2017-06-17 03:53:02

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 00/12] powerpc: Memory Protection Keys

Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

A process allocates a key and associates it with
a address range within its address space.
The process than can dynamically set read/write
permissions on the key without involving the
kernel. Any code that violates the permissions
off the address space; as defined by its associated
key, will receive a segmentation fault.

This patch series enables the feature on PPC64.
It is enabled on HPTE 64K-page platform.

ISA3.0 section 5.7.13 describes the detailed specifications.


Testing:
This patch series has passed all the protection key
tests available in the selftests directory.
The tests are updated to work on both x86 and powerpc.


version v2:
(1) documentation and selftest added
(2) fixed a bug in 4k hpte backed 64k pte where page
invalidation was not done correctly, and
initialization of second-part-of-the-pte was not
done correctly if the pte was not yet Hashed
with a hpte. Reported by Aneesh.
(3) Fixed ABI breakage caused in siginfo structure.
Reported by Anshuman.

Outstanding known issue:
Calls to sys_swapcontext with a made-up context will end
up with a crap AMR if done by code who didn't know about
that register. -- Reported by Ben.

version v1: Initial version

Thanks-to: Dave Hansen, Aneesh, Paul Mackerras,
Michael Ellermen


Ram Pai (12):
Free up four 64K PTE bits in 4K backed hpte pages.
Free up four 64K PTE bits in 64K backed hpte pages.
Implement sys_pkey_alloc and sys_pkey_free system call.
store and restore the pkey state across context switches.
Implementation for sys_mprotect_pkey() system call.
Program HPTE key protection bits.
Macro the mask used for checking DSI exception
Handle exceptions caused by violation of pkey protection.
Deliver SEGV signal on pkey violation.
Read AMR only if pkey-violation caused the exception.
Documentation updates.
Updated protection key selftest

Documentation/vm/protection-keys.txt | 110 ++
Documentation/x86/protection-keys.txt | 85 --
arch/powerpc/Kconfig | 15 +
arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +
arch/powerpc/include/asm/book3s/64/hash-64k.h | 48 +-
arch/powerpc/include/asm/book3s/64/hash.h | 15 +-
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 10 +
arch/powerpc/include/asm/book3s/64/mmu.h | 10 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 84 +-
arch/powerpc/include/asm/mman.h | 29 +-
arch/powerpc/include/asm/mmu_context.h | 12 +
arch/powerpc/include/asm/paca.h | 1 +
arch/powerpc/include/asm/pkeys.h | 159 +++
arch/powerpc/include/asm/processor.h | 5 +
arch/powerpc/include/asm/reg.h | 10 +-
arch/powerpc/include/asm/systbl.h | 3 +
arch/powerpc/include/asm/unistd.h | 6 +-
arch/powerpc/include/uapi/asm/ptrace.h | 3 +-
arch/powerpc/include/uapi/asm/unistd.h | 3 +
arch/powerpc/kernel/asm-offsets.c | 5 +
arch/powerpc/kernel/exceptions-64s.S | 18 +-
arch/powerpc/kernel/process.c | 18 +
arch/powerpc/kernel/signal_32.c | 14 +
arch/powerpc/kernel/signal_64.c | 14 +
arch/powerpc/kernel/traps.c | 49 +
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
arch/powerpc/mm/fault.c | 25 +-
arch/powerpc/mm/hash64_4k.c | 14 +-
arch/powerpc/mm/hash64_64k.c | 93 +-
arch/powerpc/mm/hash_utils_64.c | 35 +-
arch/powerpc/mm/hugetlbpage-hash64.c | 16 +-
arch/powerpc/mm/mmu_context_book3s64.c | 5 +
arch/powerpc/mm/pkeys.c | 267 +++++
include/linux/mm.h | 32 +-
include/uapi/asm-generic/mman-common.h | 2 +-
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/pkey-helpers.h | 365 +++++++
tools/testing/selftests/vm/protection_keys.c | 1451 +++++++++++++++++++++++++
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/pkey-helpers.h | 219 ----
tools/testing/selftests/x86/protection_keys.c | 1395 ------------------------
42 files changed, 2828 insertions(+), 1844 deletions(-)
create mode 100644 Documentation/vm/protection-keys.txt
delete mode 100644 Documentation/x86/protection-keys.txt
create mode 100644 arch/powerpc/include/asm/pkeys.h
create mode 100644 arch/powerpc/mm/pkeys.c
create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
create mode 100644 tools/testing/selftests/vm/protection_keys.c
delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
delete mode 100644 tools/testing/selftests/x86/protection_keys.c

--
1.8.3.1


2017-06-17 03:53:07

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
in the 4K backed hpte pages. These bits continue to be used
for 64K backed hpte pages in this patch, but will be freed
up in the next patch.

The patch does the following change to the 64K PTE format

H_PAGE_BUSY moves from bit 3 to bit 9
H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
second part of the pte.

the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
is initialized to 0xF indicating an invalid slot. If a hpte
gets cached in a 0xF slot(i.e 7th slot of secondary), it is
released immediately. In other words, even though 0xF is a
valid slot we discard and consider it as an invalid
slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
depend on a bit in the primary PTE in order to determine the
validity of a slot.

When we release a hpte in the 0xF slot we also release a
legitimate primary slot and unmap that entry. This is to
ensure that we do get a legimate non-0xF slot the next time we
retry for a slot.

Though treating 0xF slot as invalid reduces the number of available
slots and may have an effect on the performance, the probabilty
of hitting a 0xF is extermely low.

Compared to the current scheme, the above described scheme reduces
the number of false hash table updates significantly and has the
added advantage of releasing four valuable PTE bits for other
purpose.

This idea was jointly developed by Paul Mackerras, Aneesh, Michael
Ellermen and myself.

4K PTE format remain unchanged currently.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
arch/powerpc/mm/hash64_4k.c | 14 ++---
arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
8 files changed, 122 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b..5ef1d81 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,18 @@
#define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
#define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)

+
+/*
+ * Only supported by 4k linux page size
+ */
+#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
+
/* PTE flags to conserve for HPTE identification */
#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
H_PAGE_F_SECOND | H_PAGE_F_GIX)
@@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
}
#endif

+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot)
+{
+ return (slot << H_PAGE_F_GIX_SHIFT) &
+ (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE

static inline char *get_hpte_slot_array(pmd_t *pmdp)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..0eb3c89 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,23 +10,25 @@
* 64k aligned address free up few of the lower bits of RPN for us
* We steal that here. For more deatils look at pte_pfn/pfn_pte()
*/
-#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+
+#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
/*
* We need to differentiate between explicit huge page and THP huge
* page, since THP huge page also need to track real subpage details
*/
#define H_PAGE_THP_HUGE H_PAGE_4K_PFN

-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
/* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
- H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
+
/*
* we support 16 fragments per PTE page of 64K size.
*/
@@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
}

+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot)
+{
+ unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+ rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+ *hidxp = rpte.hidx | (slot << (subpg_index << 2));
+ return 0x0UL;
+}
+
#define __rpte_to_pte(r) ((r).pte)
extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
/*
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..e7cf03a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,11 +8,8 @@
*
*/
#define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIFT 56
-#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
-#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
+#define INIT_HIDX (~0x0UL)

#ifdef CONFIG_PPC_64K_PAGES
#include <asm/book3s/64/hash-64k.h>
@@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
}

+static inline bool hpte_soft_invalid(unsigned long slot)
+{
+ return ((slot & 0xfUL) == 0xfUL);
+}
+
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index);
+
/* This low level function performs the actual PTE insertion
* Setting the PTE depends on the MMU type and other factors. It's
* an horrible mess that I'm not going to try to clean up now but
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..cfb8169 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
extern int __hash_page_64K(unsigned long ea, unsigned long access,
unsigned long vsid, pte_t *ptep, unsigned long trap,
unsigned long flags, int ssize);
+extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot);
+extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index);
+
struct mm_struct;
unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
index 44fe483..b832ed3 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
.val = H_PAGE_4K_PFN,
.set = "4K_pfn",
}, {
-#endif
+#else
.mask = H_PAGE_F_GIX,
.val = H_PAGE_F_GIX,
.set = "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
.val = H_PAGE_F_SECOND,
.set = "f_second",
}, {
+#endif /* CONFIG_PPC_64K_PAGES */
#endif
.mask = _PAGE_SPECIAL,
.val = _PAGE_SPECIAL,
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..c673829 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
pte_t *ptep, unsigned long trap, unsigned long flags,
int ssize, int subpg_prot)
{
+ real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
* need to add in 0x1 if it's a read-only user page
*/
rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);

if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
/*
* There MIGHT be an HPTE for this pte
*/
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+ unsigned long gslot = get_hidx_gslot(vpn, shift,
+ ssize, rpte, 0);

- if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
+ if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
MMU_PAGE_4K, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
}
@@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
return -1;
}
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb1..3702a3c 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -15,34 +15,13 @@
#include <linux/mm.h>
#include <asm/machdep.h>
#include <asm/mmu.h>
+
/*
* index from 0 - 15
*/
bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
{
- unsigned long g_idx;
- unsigned long ptev = pte_val(rpte.pte);
-
- g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
- index = index >> 2;
- if (g_idx & (0x1 << index))
- return true;
- else
- return false;
-}
-/*
- * index from 0 - 15
- */
-static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
-{
- unsigned long g_idx;
-
- if (!(ptev & H_PAGE_COMBO))
- return ptev;
- index = index >> 2;
- g_idx = 0x1 << index;
-
- return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
+ return !(hpte_soft_invalid(rpte.hidx >> (index << 2)));
}

int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
@@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
int ssize, int subpg_prot)
{
real_pte_t rpte;
- unsigned long *hidxp;
unsigned long hpte_group;
unsigned int subpg_index;
- unsigned long rflags, pa, hidx;
+ unsigned long rflags, pa;
unsigned long old_pte, new_pte, subpg_pte;
unsigned long vpn, hash, slot;
unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
@@ -116,28 +94,23 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
* On hash insert failure we use old pte value and we don't
* want slot information there if we have a insert failure.
*/
- old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
- new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
+ old_pte &= ~(H_PAGE_HASHPTE);
+ new_pte &= ~(H_PAGE_HASHPTE);
goto htab_insert_hpte;
}
/*
* Check for sub page valid and update
*/
if (__rpte_sub_valid(rpte, subpg_index)) {
- int ret;

- hash = hpt_hash(vpn, shift, ssize);
- hidx = __rpte_to_hidx(rpte, subpg_index);
- if (hidx & _PTEIDX_SECONDARY)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += hidx & _PTEIDX_GROUP_IX;
+ unsigned long gslot = get_hidx_gslot(vpn, shift,
+ ssize, rpte, subpg_index);

- ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
+ int ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
MMU_PAGE_4K, MMU_PAGE_4K,
ssize, flags);
/*
- *if we failed because typically the HPTE wasn't really here
+ * if we failed because typically the HPTE wasn't really here
* we try an insertion.
*/
if (ret == -1)
@@ -148,6 +121,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
}

htab_insert_hpte:
+
+ /*
+ * initialize all hidx entries to a invalid value,
+ * the first time the PTE is about to allocate
+ * a 4K hpte
+ */
+ if (!(old_pte & H_PAGE_COMBO))
+ rpte.hidx = INIT_HIDX;
+
/*
* handle H_PAGE_4K_PFN case
*/
@@ -177,10 +159,20 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
rflags, HPTE_V_SECONDARY,
MMU_PAGE_4K, MMU_PAGE_4K,
ssize);
- if (slot == -1) {
- if (mftb() & 0x1)
+
+ if (unlikely(hpte_soft_invalid(slot))) {
+ slot = slot & _PTEIDX_GROUP_IX;
+ mmu_hash_ops.hpte_invalidate(hpte_group+slot, vpn,
+ MMU_PAGE_4K, MMU_PAGE_4K,
+ ssize, flags);
+ }
+
+ if (unlikely(slot == -1 || hpte_soft_invalid(slot))) {
+
+ if (hpte_soft_invalid(slot) || (mftb() & 0x1))
hpte_group = ((hash & htab_hash_mask) *
HPTES_PER_GROUP) & ~0x7UL;
+
mmu_hash_ops.hpte_remove(hpte_group);
/*
* FIXME!! Should be try the group from which we removed ?
@@ -204,11 +196,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
* Since we have H_PAGE_BUSY set on ptep, we can be sure
* nobody is undating hidx.
*/
- hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
- rpte.hidx &= ~(0xfUL << (subpg_index << 2));
- *hidxp = rpte.hidx | (slot << (subpg_index << 2));
- new_pte = mark_subptegroup_valid(new_pte, subpg_index);
- new_pte |= H_PAGE_HASHPTE;
+ new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
+ new_pte |= H_PAGE_HASHPTE;
+
/*
* check __real_pte for details on matching smp_rmb()
*/
@@ -322,9 +312,10 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
return -1;
}
- new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
+
new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index f2095ce..c0f4b46 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)

void __init hash__early_init_mmu(void)
{
+#ifndef CONFIG_PPC_64K_PAGES
/*
- * We have code in __hash_page_64K() and elsewhere, which assumes it can
+ * We have code in __hash_page_4K() and elsewhere, which assumes it can
* do the following:
* new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
*
@@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
* with a BUILD_BUG_ON().
*/
BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul << (H_PAGE_F_GIX_SHIFT + 3)));
+#endif /* CONFIG_PPC_64K_PAGES */

htab_init_page_sizes();

@@ -1589,29 +1591,39 @@ static inline void tm_flush_hash_page(int local)
}
#endif

+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+ unsigned long hash, slot, hidx;
+
+ hash = hpt_hash(vpn, shift, ssize);
+ hidx = __rpte_to_hidx(rpte, subpg_index);
+ if (hidx & _PTEIDX_SECONDARY)
+ hash = ~hash;
+ slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+ slot += hidx & _PTEIDX_GROUP_IX;
+ return slot;
+}
+
+
/* WARNING: This is called from hash_low_64.S, if you change this prototype,
* do not forget to update the assembly call site !
*/
void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
unsigned long flags)
{
- unsigned long hash, index, shift, hidx, slot;
+ unsigned long hash, index, shift, hidx, gslot;
int local = flags & HPTE_LOCAL_UPDATE;

DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
- hash = hpt_hash(vpn, shift, ssize);
- hidx = __rpte_to_hidx(pte, index);
- if (hidx & _PTEIDX_SECONDARY)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += hidx & _PTEIDX_GROUP_IX;
+ gslot = get_hidx_gslot(vpn, shift, ssize, pte, index);
DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
/*
* We use same base page size and actual psize, because we don't
* use these functions for hugepage
*/
- mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
+ mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
ssize, local);
} pte_iterate_hashed_end();

--
1.8.3.1

2017-06-17 03:53:21

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai <[email protected]>
(cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)

Conflicts:
arch/powerpc/include/asm/reg.h
arch/powerpc/kernel/exceptions-64s.S
---
arch/powerpc/include/asm/mmu_context.h | 12 +++++
arch/powerpc/include/asm/pkeys.h | 9 ++++
arch/powerpc/include/asm/reg.h | 7 +--
arch/powerpc/mm/fault.c | 21 +++++++-
arch/powerpc/mm/pkeys.c | 90 ++++++++++++++++++++++++++++++++++
5 files changed, 134 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
{
}

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+ /* by default, allow everything */
+ return true;
+}
static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
{
/* by default, allow everything */
return true;
}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
#endif /* __KERNEL__ */
#endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 9b6820d..405e7db 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)

+static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
+{
+ return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
+}
+
#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 2dcb8a1..a11977f 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -285,9 +285,10 @@
#define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
#define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
#define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
-#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
- DSISR_PAGEATTR_CONFLT | \
- DSISR_BADACCESS | \
+#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+ DSISR_PAGEATTR_CONFLT | \
+ DSISR_BADACCESS | \
+ DSISR_KEYFAULT | \
DSISR_BIT43)
#define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
#define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..c31624f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
* bits we are interested in. But there are some bits which
* indicate errors in DSISR but can validly be set in SRR1.
*/
- if (trap == 0x400)
+ if (trap == 0x400) {
error_code &= 0x48200000;
- else
+ flags |= FAULT_FLAG_INSTRUCTION;
+ } else
is_write = error_code & DSISR_ISSTORE;
#else
is_write = error_code & ESR_DST;
@@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
}
#endif

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (error_code & DSISR_KEYFAULT) {
+ code = SEGV_PKUERR;
+ goto bad_area_nosemaphore;
+ }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/* We restore the interrupt state now */
if (!arch_irq_disabled_regs(regs))
local_irq_enable();
@@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
#endif /* CONFIG_PPC_STD_MMU */

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ 0)) {
+ code = SEGV_PKUERR;
+ goto bad_area;
+ }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..439241a 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
}

+static inline bool pkey_allows_read(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_amr() & (AMR_AD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_write(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_amr() & (AMR_WD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_execute(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
+}
+
+
/*
* set the access right in AMR IAMR and UAMOR register
* for @pkey to that specified in @init_val.
@@ -175,3 +206,62 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
*/
return vma_pkey(vma);
}
+
+bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+ int pkey = pte_flags_to_pkey(pte_val(pte));
+
+ if (!pkey_allows_read(pkey))
+ return false;
+ if (write && !pkey_allows_write(pkey))
+ return false;
+ return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to AMR/IAMR for other
+ * processes or any way to tell *which * AMR/IAMR in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+ if (!current->mm)
+ return true;
+ /*
+ * if the VMA is from another process, then AMR/IAMR has no
+ * relevance and should not be enforced.
+ */
+ if (current->mm != vma->vm_mm)
+ return true;
+
+ return false;
+}
+
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign)
+{
+ int pkey;
+ /* allow access if the VMA is not one from this process */
+ if (foreign || vma_is_foreign(vma))
+ return true;
+
+ pkey = vma_pkey(vma);
+
+ if (!pkey)
+ return true;
+
+ if (execute)
+ return pkey_allows_execute(pkey);
+
+ if (!pkey_allows_read(pkey))
+ return false;
+
+ if (write)
+ return pkey_allows_write(pkey);
+
+ return true;
+}
--
1.8.3.1

2017-06-17 03:53:35

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 11/12]Documentation: Documentation updates.

The Documentaton file is moved from x86 into the generic area,
since this feature is now supported by more than one archs.

Signed-off-by: Ram Pai <[email protected]>
---
Documentation/vm/protection-keys.txt | 110 ++++++++++++++++++++++++++++++++++
Documentation/x86/protection-keys.txt | 85 --------------------------
2 files changed, 110 insertions(+), 85 deletions(-)
create mode 100644 Documentation/vm/protection-keys.txt
delete mode 100644 Documentation/x86/protection-keys.txt

diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt
new file mode 100644
index 0000000..b49e6bb
--- /dev/null
+++ b/Documentation/vm/protection-keys.txt
@@ -0,0 +1,110 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+found in new generation of intel CPUs on PowerPC CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.
+
+
+On Intel:
+
+It works by dedicating 4 previously ignored bits in each page table
+entry to a "protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key. Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register. The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs. These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+
+On PowerPC:
+
+It works by dedicating 5 page table entry to a "protection key",
+giving 32 possible keys.
+
+There is a user-accessible register (AMR) with two separate bits
+(Access Disable and Write Disable) for each key. Being a CPU
+register, AMR is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+NOTE: Disabling read permission does not disable
+write and vice-versa.
+
+The feature is available on 64-bit HPTE mode only.
+
+'mtspr 0xd, mem' reads the AMR register
+'mfspr mem, 0xd' writes into the AMR register.
+
+Permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+=========================== Syscalls ===========================
+
+There are 3 system calls which directly interact with pkeys:
+
+ int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+ int pkey_free(int pkey);
+ int pkey_mprotect(unsigned long start, size_t len,
+ unsigned long prot, int pkey);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc(). An application calls the WRPKRU instruction
+directly in order to change access permissions to memory covered
+with a key. In this example WRPKRU is wrapped by a C function
+called pkey_set().
+
+ int real_prot = PROT_READ|PROT_WRITE;
+ pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+ ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+ ... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+ pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+ *ptr = foo; // assign something
+ pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+ munmap(ptr, PAGE_SIZE);
+ pkey_free(pkey);
+
+(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+ An example implementation can be found in
+ tools/testing/selftests/x86/protection_keys.c)
+
+=========================== Behavior ===========================
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect(). For instance if you do this:
+
+ mprotect(ptr, size, PROT_NONE);
+ something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+ pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+ pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
+ something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+ *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+ read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
diff --git a/Documentation/x86/protection-keys.txt b/Documentation/x86/protection-keys.txt
deleted file mode 100644
index b643045..0000000
--- a/Documentation/x86/protection-keys.txt
+++ /dev/null
@@ -1,85 +0,0 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains. It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key. Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
-
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register. The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs. These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
-
-=========================== Syscalls ===========================
-
-There are 3 system calls which directly interact with pkeys:
-
- int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
- int pkey_free(int pkey);
- int pkey_mprotect(unsigned long start, size_t len,
- unsigned long prot, int pkey);
-
-Before a pkey can be used, it must first be allocated with
-pkey_alloc(). An application calls the WRPKRU instruction
-directly in order to change access permissions to memory covered
-with a key. In this example WRPKRU is wrapped by a C function
-called pkey_set().
-
- int real_prot = PROT_READ|PROT_WRITE;
- pkey = pkey_alloc(0, PKEY_DENY_WRITE);
- ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
- ... application runs here
-
-Now, if the application needs to update the data at 'ptr', it can
-gain access, do the update, then remove its write access:
-
- pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
- *ptr = foo; // assign something
- pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
-
-Now when it frees the memory, it will also free the pkey since it
-is no longer in use:
-
- munmap(ptr, PAGE_SIZE);
- pkey_free(pkey);
-
-(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
- An example implementation can be found in
- tools/testing/selftests/x86/protection_keys.c)
-
-=========================== Behavior ===========================
-
-The kernel attempts to make protection keys consistent with the
-behavior of a plain mprotect(). For instance if you do this:
-
- mprotect(ptr, size, PROT_NONE);
- something(ptr);
-
-you can expect the same effects with protection keys when doing this:
-
- pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
- pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
- something(ptr);
-
-That should be true whether something() is a direct access to 'ptr'
-like:
-
- *ptr = foo;
-
-or when the kernel does the access on the application's behalf like
-with a read():
-
- read(fd, ptr, 1);
-
-The kernel will send a SIGSEGV in both cases, but si_code will be set
-to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
-the plain mprotect() permissions are violated.
--
1.8.3.1

2017-06-17 03:53:14

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 05/12] powerpc: Implementation for sys_mprotect_pkey() system call.

This system call, associates the pkey with PTE of all
pages corresponding to the given address range.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++++++-
arch/powerpc/include/asm/mman.h | 29 +++++----
arch/powerpc/include/asm/pkeys.h | 21 ++++++-
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 4 +-
arch/powerpc/include/uapi/asm/unistd.h | 1 +
arch/powerpc/mm/pkeys.c | 93 +++++++++++++++++++++++++++-
include/linux/mm.h | 1 +
8 files changed, 154 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 87e9a89..bc845cd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,6 +37,7 @@
#define _RPAGE_RSV2 0x0800000000000000UL
#define _RPAGE_RSV3 0x0400000000000000UL
#define _RPAGE_RSV4 0x0200000000000000UL
+#define _RPAGE_RSV5 0x00040UL

#define _PAGE_PTE 0x4000000000000000UL /* distinguishes PTEs from pointers */
#define _PAGE_PRESENT 0x8000000000000000UL /* pte contains a translation */
@@ -56,6 +57,20 @@
/* Max physical address bit as per radix table */
#define _RPAGE_PA_MAX 57

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define H_PAGE_PKEY_BIT0 _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1 _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2 _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3 _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4 _RPAGE_RSV5
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0 0
+#define H_PAGE_PKEY_BIT1 0
+#define H_PAGE_PKEY_BIT2 0
+#define H_PAGE_PKEY_BIT3 0
+#define H_PAGE_PKEY_BIT4 0
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/*
* Max physical address bit we will use for now.
*
@@ -122,7 +137,12 @@
#define PAGE_PROT_BITS (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
_PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY | _PAGE_EXEC | \
- _PAGE_SOFT_DIRTY)
+ _PAGE_SOFT_DIRTY | \
+ H_PAGE_PKEY_BIT0 | \
+ H_PAGE_PKEY_BIT1 | \
+ H_PAGE_PKEY_BIT2 | \
+ H_PAGE_PKEY_BIT3 | \
+ H_PAGE_PKEY_BIT4)
/*
* We define 2 sets of base prot bits, one for basic pages (ie,
* cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..14cc1aa 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,24 +13,31 @@

#include <asm/cputable.h>
#include <linux/mm.h>
+#include <linux/pkeys.h>
#include <asm/cpu_has_feature.h>

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
/*
* This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
* here. How important is the optimization?
*/
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
- unsigned long pkey)
-{
- return (prot & PROT_SAO) ? VM_SAO : 0;
-}
-#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+#define arch_calc_vm_prot_bits(prot, key) ( \
+ ((prot) & PROT_SAO ? VM_SAO : 0) | \
+ pkey_to_vmflag_bits(key))
+#define arch_vm_get_page_prot(vm_flags) __pgprot( \
+ ((vm_flags) & VM_SAO ? _PAGE_SAO : 0) | \
+ vmflag_to_page_pkey_bits(vm_flags))
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+#define arch_calc_vm_prot_bits(prot, key) ( \
+ ((prot) & PROT_SAO ? VM_SAO : 0))
+#define arch_vm_get_page_prot(vm_flags) __pgprot( \
+ ((vm_flags) & VM_SAO ? _PAGE_SAO : 0))
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */

-static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
-{
- return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
-}
-#define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)

static inline bool arch_validate_prot(unsigned long prot)
{
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..0f3dca8 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,19 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)

+#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
+ ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
+ ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
+ ((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) | \
+ ((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL))
+
+#define vmflag_to_page_pkey_bits(vm_flags) \
+ (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)| \
+ ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) | \
+ ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) | \
+ ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
+ ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
+
/*
* Bits are in BE format.
* NOTE: key 31, 1, 0 are not used.
@@ -42,6 +55,12 @@
#define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
pkeybit_mask(pkey))

+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+ return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
+}
+
static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
{
/* a reserved key is never considered as 'explicitly allocated' */
@@ -114,7 +133,7 @@ static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
return __arch_set_user_pkey_access(tsk, pkey, init_val);
}

-static inline pkey_mm_init(struct mm_struct *mm)
+static inline void pkey_mm_init(struct mm_struct *mm)
{
mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
/* -1 means unallocated or invalid */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 22dd776..b33b551 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@
SYSCALL(statx)
SYSCALL(pkey_alloc)
SYSCALL(pkey_free)
+SYSCALL(pkey_mprotect)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index e0273bc..daf1ba9 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,12 +12,10 @@
#include <uapi/asm/unistd.h>


-#define NR_syscalls 386
+#define NR_syscalls 387

#define __NR__exit __NR_exit

-#define __IGNORE_pkey_mprotect
-
#ifndef __ASSEMBLY__

#include <linux/types.h>
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 7993a07..71ae45e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
#define __NR_statx 383
#define __NR_pkey_alloc 384
#define __NR_pkey_free 385
+#define __NR_pkey_mprotect 386

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index b97366e..11a32b3 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -15,6 +15,17 @@
#include <linux/pkeys.h> /* PKEY_* */
#include <uapi/asm-generic/mman-common.h>

+#define pkeyshift(pkey) ((arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY)
+
+static inline bool pkey_allows_readwrite(int pkey)
+{
+ int pkey_shift = pkeyshift(pkey);
+
+ if (!(read_uamor() & (0x3UL << pkey_shift)))
+ return true;
+
+ return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
+}

/*
* set the access right in AMR IAMR and UAMOR register
@@ -68,7 +79,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,

int __execute_only_pkey(struct mm_struct *mm)
{
- return -1;
+ bool need_to_set_mm_pkey = false;
+ int execute_only_pkey = mm->context.execute_only_pkey;
+ int ret;
+
+ /* Do we need to assign a pkey for mm's execute-only maps? */
+ if (execute_only_pkey == -1) {
+ /* Go allocate one to use, which might fail */
+ execute_only_pkey = mm_pkey_alloc(mm);
+ if (execute_only_pkey < 0)
+ return -1;
+ need_to_set_mm_pkey = true;
+ }
+
+ /*
+ * We do not want to go through the relatively costly
+ * dance to set AMR if we do not need to. Check it
+ * first and assume that if the execute-only pkey is
+ * readwrite-disabled than we do not have to set it
+ * ourselves.
+ */
+ if (!need_to_set_mm_pkey &&
+ !pkey_allows_readwrite(execute_only_pkey))
+ return execute_only_pkey;
+
+ /*
+ * Set up AMR so that it denies access for everything
+ * other than execution.
+ */
+ ret = __arch_set_user_pkey_access(current, execute_only_pkey,
+ (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+ /*
+ * If the AMR-set operation failed somehow, just return
+ * 0 and effectively disable execute-only support.
+ */
+ if (ret) {
+ mm_set_pkey_free(mm, execute_only_pkey);
+ return -1;
+ }
+
+ /* We got one, store it and use it from here on out */
+ if (need_to_set_mm_pkey)
+ mm->context.execute_only_pkey = execute_only_pkey;
+ return execute_only_pkey;
+}
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+ /* Do this check first since the vm_flags should be hot */
+ if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+ return false;
+ if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey)
+ return false;
+
+ return true;
}

/*
@@ -84,5 +148,30 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
if (pkey != -1)
return pkey;

- return 0;
+ /*
+ * Look for a protection-key-drive execute-only mapping
+ * which is now being given permissions that are not
+ * execute-only. Move it back to the default pkey.
+ */
+ if (vma_is_pkey_exec_only(vma) &&
+ (prot & (PROT_READ|PROT_WRITE))) {
+ return 0;
+ }
+ /*
+ * The mapping is execute-only. Go try to get the
+ * execute-only protection key. If we fail to do that,
+ * fall through as if we do not have execute-only
+ * support.
+ */
+ if (prot == PROT_EXEC) {
+ pkey = execute_only_pkey(vma->vm_mm);
+ if (pkey > 0)
+ return pkey;
+ }
+ /*
+ * This is a vanilla, non-pkey mprotect (or we failed to
+ * setup execute-only), inherit the pkey from the VMA we
+ * are working on.
+ */
+ return vma_pkey(vma);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 34ddac7..5399031 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#elif defined(CONFIG_PPC)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
--
1.8.3.1

2017-06-17 03:53:26

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 8db9ef8..a4de1b4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
ld r12,_MSR(r1)
ld r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
+ std r3,_DAR(r1)
+ std r4,_DSISR(r1)
#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+ beq+ 1f
mfspr r5,SPRN_AMR
std r5,PACA_AMR(r13)
#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
- li r5,0x300
- std r3,_DAR(r1)
- std r4,_DSISR(r1)
+1: li r5,0x300
BEGIN_MMU_FTR_SECTION
b do_hash_page /* Try to handle as hpte fault */
MMU_FTR_SECTION_ELSE
@@ -565,13 +567,15 @@ EXC_COMMON_BEGIN(instruction_access_common)
ld r12,_MSR(r1)
ld r3,_NIP(r1)
andis. r4,r12,0x5820
+ std r3,_DAR(r1)
+ std r4,_DSISR(r1)
#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+ beq+ 1f
mfspr r5,SPRN_AMR
std r5,PACA_AMR(r13)
#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
- li r5,0x400
- std r3,_DAR(r1)
- std r4,_DSISR(r1)
+1: li r5,0x400
BEGIN_MMU_FTR_SECTION
b do_hash_page /* Try to handle as hpte fault */
MMU_FTR_SECTION_ELSE
--
1.8.3.1

2017-06-17 03:53:41

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 12/12]selftest: Updated protection key selftest

Added test support for PowerPC implementation off protection keys.

Signed-off-by: Ram Pai <[email protected]>
---
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/pkey-helpers.h | 365 +++++++
tools/testing/selftests/vm/protection_keys.c | 1451 +++++++++++++++++++++++++
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/pkey-helpers.h | 219 ----
tools/testing/selftests/x86/protection_keys.c | 1395 ------------------------
6 files changed, 1818 insertions(+), 1615 deletions(-)
create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
create mode 100644 tools/testing/selftests/vm/protection_keys.c
delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
delete mode 100644 tools/testing/selftests/x86/protection_keys.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index cbb29e4..1d32f78 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress
TEST_GEN_FILES += userfaultfd
TEST_GEN_FILES += mlock-random-test
TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += protection_keys

TEST_PROGS := run_vmtests

diff --git a/tools/testing/selftests/vm/pkey-helpers.h b/tools/testing/selftests/vm/pkey-helpers.h
new file mode 100644
index 0000000..5fec0a2
--- /dev/null
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -0,0 +1,365 @@
+#ifndef _PKEYS_HELPER_H
+#define _PKEYS_HELPER_H
+#define _GNU_SOURCE
+#include <string.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+
+/* Define some kernel-like types */
+#define u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__ /* arch */
+
+#define SYS_mprotect_key 380
+#define SYS_pkey_alloc 381
+#define SYS_pkey_free 382
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x14
+
+#define NR_PKEYS 16
+#define NR_RESERVED_PKEYS 1
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS 0x1
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<21)
+
+#define INIT_PRKU 0x0UL
+
+#elif __powerpc64__ /* arch */
+
+#define SYS_mprotect_key 386
+#define SYS_pkey_alloc 384
+#define SYS_pkey_free 385
+#define si_pkey_offset 0x20
+#define REG_IP_IDX PT_NIP
+#define REG_TRAPNO PT_TRAP
+#define REG_AMR 45
+#define gregs gp_regs
+#define fpregs fp_regs
+
+#define NR_PKEYS 32
+#define NR_RESERVED_PKEYS 3
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS 0x3 /* disable read and write */
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<24)
+
+#define INIT_PRKU 0x3UL
+#else /* arch */
+
+ NOT SUPPORTED
+
+#endif /* arch */
+
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+
+
+static inline u32 pkey_to_shift(int pkey)
+{
+#ifdef __i386__
+ return pkey * PKRU_BITS_PER_PKEY;
+#elif __powerpc64__
+ return (NR_PKEYS - pkey - 1) * PKRU_BITS_PER_PKEY;
+#endif
+}
+
+
+extern int dprint_in_signal;
+extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+static inline void sigsafe_printf(const char *format, ...)
+{
+ va_list ap;
+
+ va_start(ap, format);
+ if (!dprint_in_signal) {
+ vprintf(format, ap);
+ } else {
+ int len = vsnprintf(dprint_in_signal_buffer,
+ DPRINT_IN_SIGNAL_BUF_SIZE,
+ format, ap);
+ /*
+ * len is amount that would have been printed,
+ * but actual write is truncated at BUF_SIZE.
+ */
+ if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
+ len = DPRINT_IN_SIGNAL_BUF_SIZE;
+ write(1, dprint_in_signal_buffer, len);
+ }
+ va_end(ap);
+}
+#define dprintf_level(level, args...) do { \
+ if (level <= DEBUG_LEVEL) \
+ sigsafe_printf(args); \
+ fflush(NULL); \
+} while (0)
+#define dprintf0(args...) dprintf_level(0, args)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern u64 shadow_pkey_reg;
+
+static inline u64 __rdpkey_reg(void)
+{
+#ifdef __i386__
+ unsigned int eax, edx;
+ unsigned int ecx = 0;
+ unsigned int pkey_reg;
+
+ asm volatile(".byte 0x0f,0x01,0xee\n\t"
+ : "=a" (eax), "=d" (edx)
+ : "c" (ecx));
+#elif __powerpc64__
+ u64 eax;
+ u64 pkey_reg;
+
+ asm volatile("mfspr %0, 0xd" : "=r" ((u64)(eax)));
+#endif
+ pkey_reg = (u64)eax;
+ return pkey_reg;
+}
+
+static inline u64 _rdpkey_reg(int line)
+{
+ u64 pkey_reg = __rdpkey_reg();
+
+ dprintf4("rdpkey_reg(line=%d) pkey_reg: %lx shadow: %lx\n",
+ line, pkey_reg, shadow_pkey_reg);
+ assert(pkey_reg == shadow_pkey_reg);
+
+ return pkey_reg;
+}
+
+#define rdpkey_reg() _rdpkey_reg(__LINE__)
+
+static inline void __wrpkey_reg(u64 pkey_reg)
+{
+#ifdef __i386__
+ unsigned int eax = pkey_reg;
+ unsigned int ecx = 0;
+ unsigned int edx = 0;
+
+ dprintf4("%s() changing %lx to %lx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+ asm volatile(".byte 0x0f,0x01,0xef\n\t"
+ : : "a" (eax), "c" (ecx), "d" (edx));
+ dprintf4("%s() PKRUP after changing %lx to %lx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+#else
+ u64 eax = pkey_reg;
+
+ dprintf4("%s() changing %llx to %llx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+ asm volatile("mtspr 0xd, %0" : : "r" ((unsigned long)(eax)) : "memory");
+ dprintf4("%s() PKRUP after changing %llx to %llx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+#endif
+ assert(pkey_reg == __rdpkey_reg());
+}
+
+static inline void wrpkey_reg(u64 pkey_reg)
+{
+ dprintf4("%s() changing %lx to %lx\n",
+ __func__, __rdpkey_reg(), pkey_reg);
+ /* will do the shadow check for us: */
+ rdpkey_reg();
+ __wrpkey_reg(pkey_reg);
+ shadow_pkey_reg = pkey_reg;
+ dprintf4("%s(%lx) pkey_reg: %lx\n",
+ __func__, pkey_reg, __rdpkey_reg());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+ u64 pkey_reg = rdpkey_reg();
+ int bit = pkey * 2;
+
+ if (do_allow)
+ pkey_reg &= (1<<bit);
+ else
+ pkey_reg |= (1<<bit);
+
+ dprintf4("pkey_reg now: %lx\n", rdpkey_reg());
+ wrpkey_reg(pkey_reg);
+}
+
+static inline void __pkey_write_allow(int pkey, int do_allow_write)
+{
+ u64 pkey_reg = rdpkey_reg();
+ int bit = pkey * 2 + 1;
+
+ if (do_allow_write)
+ pkey_reg &= (1<<bit);
+ else
+ pkey_reg |= (1<<bit);
+
+ wrpkey_reg(pkey_reg);
+ dprintf4("pkey_reg now: %lx\n", rdpkey_reg());
+}
+
+#define MB (1<<20)
+
+#ifdef __i386__
+
+#define PAGE_SIZE 4096
+static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
+ unsigned int *ecx, unsigned int *edx)
+{
+ /* ecx is often an input as well as an output. */
+ asm volatile(
+ "cpuid;"
+ : "=a" (*eax),
+ "=b" (*ebx),
+ "=c" (*ecx),
+ "=d" (*edx)
+ : "0" (*eax), "2" (*ecx));
+}
+
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
+#define X86_FEATURE_PKU (1<<3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE (1<<4) /* OS Protection Keys Enable */
+
+static inline int cpu_has_pkey(void)
+{
+ unsigned int eax;
+ unsigned int ebx;
+ unsigned int ecx;
+ unsigned int edx;
+
+ eax = 0x7;
+ ecx = 0x0;
+ __cpuid(&eax, &ebx, &ecx, &edx);
+
+ if (!(ecx & X86_FEATURE_PKU)) {
+ dprintf2("cpu does not have PKU\n");
+ return 0;
+ }
+ if (!(ecx & X86_FEATURE_OSPKE)) {
+ dprintf2("cpu does not have OSPKE\n");
+ return 0;
+ }
+ return 1;
+}
+
+#define XSTATE_PKRU_BIT (9)
+#define XSTATE_PKRU 0x200
+int pkru_xstate_offset(void)
+{
+ unsigned int eax;
+ unsigned int ebx;
+ unsigned int ecx;
+ unsigned int edx;
+ int xstate_offset;
+ int xstate_size;
+ unsigned long XSTATE_CPUID = 0xd;
+ int leaf;
+
+ /* assume that XSTATE_PKRU is set in XCR0 */
+ leaf = XSTATE_PKRU_BIT;
+ {
+ eax = XSTATE_CPUID;
+ ecx = leaf;
+ __cpuid(&eax, &ebx, &ecx, &edx);
+
+ if (leaf == XSTATE_PKRU_BIT) {
+ xstate_offset = ebx;
+ xstate_size = eax;
+ }
+ }
+
+ if (xstate_size == 0) {
+ printf("could not find size/offset of PKRU in xsave state\n");
+ return 0;
+ }
+
+ return xstate_offset;
+}
+
+/* 8-bytes of instruction * 512 bytes = 1 page */
+#define __page_o_noops() asm(".rept 512 ; nopl 0x7eeeeeee(%eax) ; .endr")
+
+#elif __powerpc64__
+
+#define PAGE_SIZE (0x1UL << 16)
+static inline int cpu_has_pkey(void)
+{
+ return 1;
+}
+
+/* 4-bytes of instruction * 16384bytes = 1 page */
+#define __page_o_noops() asm(".rept 16384 ; nop; .endr")
+
+#endif
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1))
+#define ALIGN_DOWN(x, align_to) ((x) & ~((align_to)-1))
+#define ALIGN_PTR_UP(p, ptr_align_to) \
+ ((typeof(p))ALIGN_UP((unsigned long)(p), ptr_align_to))
+#define ALIGN_PTR_DOWN(p, ptr_align_to) \
+ ((typeof(p))ALIGN_DOWN((unsigned long)(p), ptr_align_to))
+#define __stringify_1(x...) #x
+#define __stringify(x...) __stringify_1(x)
+
+#define PTR_ERR_ENOTSUP ((void *)-ENOTSUP)
+
+extern void abort_hooks(void);
+#define pkey_assert(condition) do { \
+ if (!(condition)) { \
+ dprintf0("assert() at %s::%d test_nr: %d iteration: %d\n", \
+ __FILE__, __LINE__, \
+ test_nr, iteration_nr); \
+ dprintf0("errno at assert: %d", errno); \
+ abort_hooks(); \
+ assert(condition); \
+ } \
+} while (0)
+#define raw_assert(cond) assert(cond)
+
+
+static inline int open_hugepage_file(int flag)
+{
+ int fd;
+#ifdef __i386__
+ fd = open("/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages",
+ O_RDONLY);
+#elif __powerpc64__
+ fd = open("/sys/kernel/mm/hugepages/hugepages-16384kB/nr_hugepages",
+ O_RDONLY);
+#else
+ NOT SUPPORTED
+#endif
+ return fd;
+}
+
+static inline int get_start_key(void)
+{
+#ifdef __i386__
+ return 1;
+#elif __powerpc64__
+ return 0;
+#else
+ NOT SUPPORTED
+#endif
+}
+
+#endif /* _PKEYS_HELPER_H */
diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c
new file mode 100644
index 0000000..26c5e5a
--- /dev/null
+++ b/tools/testing/selftests/vm/protection_keys.c
@@ -0,0 +1,1451 @@
+/*
+ * Tests Memory Protection Keys (see Documentation/vm/protection-keys.txt)
+ *
+ * There are examples in here of:
+ * * how to set protection keys on memory
+ * * how to set/clear bits in PKRU (the rights register)
+ * * how to handle SEGV_PKRU signals and extract pkey-relevant
+ * information from the siginfo
+ *
+ * Things to add:
+ * make sure KSM and KSM COW breaking works
+ * prefault pages in at malloc, or not
+ * protect MPX bounds tables with protection keys?
+ * make sure VMA splitting/merging is working correctly
+ * OOMs can destroy mm->mmap (see exit_mmap()),
+ * so make sure it is immune to pkeys
+ * look for pkey "leaks" where it is still set on a VMA
+ * but "freed" back to the kernel
+ * do a plain mprotect() to a mprotect_pkey() area and make
+ * sure the pkey sticks
+ *
+ * Compile like this:
+ * gcc -o protection_keys -O2 -g -std=gnu99
+ * -pthread -Wall protection_keys.c -lrt -ldl -lm
+ * gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99
+ * -pthread -Wall protection_keys.c -lrt -ldl -lm
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <time.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+#include <setjmp.h>
+
+#include "pkey-helpers.h"
+
+int iteration_nr = 1;
+int test_nr;
+u64 shadow_pkey_reg;
+
+int dprint_in_signal;
+char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+
+void cat_into_file(char *str, char *file)
+{
+ int fd = open(file, O_RDWR);
+ int ret;
+
+ dprintf2("%s(): writing '%s' to '%s'\n", __func__, str, file);
+ /*
+ * these need to be raw because they are called under
+ * pkey_assert()
+ */
+ raw_assert(fd >= 0);
+ ret = write(fd, str, strlen(str));
+ if (ret != strlen(str)) {
+ perror("write to file failed");
+ fprintf(stderr, "filename: '%s' str: '%s'\n", file, str);
+ raw_assert(0);
+ }
+ close(fd);
+}
+
+#if CONTROL_TRACING > 0
+static int warned_tracing;
+int tracing_root_ok(void)
+{
+ if (geteuid() != 0) {
+ if (!warned_tracing)
+ fprintf(stderr, "WARNING: not run as root, "
+ "can not do tracing control\n");
+ warned_tracing = 1;
+ return 0;
+ }
+ return 1;
+}
+#endif
+
+void tracing_on(void)
+{
+#if CONTROL_TRACING > 0
+#define TRACEDIR "/sys/kernel/debug/tracing"
+ char pidstr[32];
+
+ if (!tracing_root_ok())
+ return;
+
+ sprintf(pidstr, "%d", getpid());
+ cat_into_file("0", TRACEDIR "/tracing_on");
+ cat_into_file("\n", TRACEDIR "/trace");
+ if (1) {
+ cat_into_file("function_graph", TRACEDIR "/current_tracer");
+ cat_into_file("1", TRACEDIR "/options/funcgraph-proc");
+ } else {
+ cat_into_file("nop", TRACEDIR "/current_tracer");
+ }
+ cat_into_file(pidstr, TRACEDIR "/set_ftrace_pid");
+ cat_into_file("1", TRACEDIR "/tracing_on");
+ dprintf1("enabled tracing\n");
+#endif
+}
+
+void tracing_off(void)
+{
+#if CONTROL_TRACING > 0
+ if (!tracing_root_ok())
+ return;
+ cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void abort_hooks(void)
+{
+ fprintf(stderr, "running %s()...\n", __func__);
+ tracing_off();
+#ifdef SLEEP_ON_ABORT
+ sleep(SLEEP_ON_ABORT);
+#endif
+}
+
+
+/*
+ * This attempts to have roughly a page of instructions followed by a few
+ * instructions that do a write, and another page of instructions. That
+ * way, we are pretty sure that the write is in the second page of
+ * instructions and has at least a page of padding behind it.
+ *
+ * *That* lets us be sure to madvise() away the write instruction, which
+ * will then fault, which makes sure that the fault code handles
+ * execute-only memory properly.
+ */
+__attribute__((__aligned__(PAGE_SIZE)))
+void lots_o_noops_around_write(int *write_to_me)
+{
+ dprintf3("running %s()\n", __func__);
+ __page_o_noops();
+ /* Assume this happens in the second page of instructions: */
+ *write_to_me = __LINE__;
+ /* pad out by another page: */
+ __page_o_noops();
+ dprintf3("%s() done\n", __func__);
+}
+
+void dump_mem(void *dumpme, int len_bytes)
+{
+ char *c = (void *)dumpme;
+ int i;
+
+ for (i = 0; i < len_bytes; i += sizeof(u64)) {
+ u64 *ptr = (u64 *)(c + i);
+
+ dprintf1("dump[%03d][@%p]: %016jx\n", i, ptr, *ptr);
+ }
+}
+
+#define __SI_FAULT (3 << 16)
+#define SEGV_BNDERR (__SI_FAULT|3) /* failed address bound checks */
+#define SEGV_PKUERR (__SI_FAULT|4)
+
+static char *si_code_str(int si_code)
+{
+ if (si_code & SEGV_MAPERR)
+ return "SEGV_MAPERR";
+ if (si_code & SEGV_ACCERR)
+ return "SEGV_ACCERR";
+ if (si_code & SEGV_BNDERR)
+ return "SEGV_BNDERR";
+ if (si_code & SEGV_PKUERR)
+ return "SEGV_PKUERR";
+ return "UNKNOWN";
+}
+
+int pkey_faults;
+int last_si_pkey = -1;
+
+u64 reset_bits(int pkey, u64 bits)
+{
+ u32 shift = pkey_to_shift(pkey);
+
+ return ~(bits << shift);
+}
+
+u64 left_shift_bits(int pkey, u64 bits)
+{
+ u32 shift = pkey_to_shift(pkey);
+
+ return (bits << shift);
+}
+
+u64 right_shift_bits(int pkey, u64 bits)
+{
+ u32 shift = pkey_to_shift(pkey);
+
+ return (bits >> shift);
+}
+
+void signal_handler(int signum, siginfo_t *si, void *vucontext)
+{
+ ucontext_t *uctxt = vucontext;
+ int trapno;
+ unsigned long ip;
+ char *fpregs;
+ u64 *pkey_reg_ptr;
+ u64 si_pkey;
+ u32 *si_pkey_ptr;
+
+ dprint_in_signal = 1;
+ dprintf1(">>>>===============SIGSEGV============================\n");
+ dprintf1("%s()::%d, pkey_reg: 0x%lx shadow: %lx\n", __func__, __LINE__,
+ __rdpkey_reg(), shadow_pkey_reg);
+
+ trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
+ ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
+ fpregs = (char *) uctxt->uc_mcontext.fpregs;
+
+ dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__,
+ trapno, ip, si_code_str(si->si_code), si->si_code);
+#ifdef __i386__
+ /*
+ * 32-bit has some extra padding so that userspace can tell whether
+ * the XSTATE header is present in addition to the "legacy" FPU
+ * state. We just assume that it is here.
+ */
+ fpregs += 0x70;
+ pkey_reg_ptr = (void *)(&fpregs[pkru_xstate_offset()]);
+ /*
+ * If we got a PKRU fault, we *HAVE* to have at least one bit set in
+ * here.
+ */
+ dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
+ if (DEBUG_LEVEL > 4)
+ dump_mem(pkey_reg_ptr - 128, 256);
+#elif __powerpc64__
+ pkey_reg_ptr = &uctxt->uc_mcontext.gregs[REG_AMR];
+#endif
+
+
+ dprintf1("siginfo: %p\n", si);
+ dprintf1(" fpregs: %p\n", fpregs);
+ pkey_assert(*pkey_reg_ptr);
+
+ si_pkey_ptr = (u32 *)(((u8 *)si) + si_pkey_offset);
+ dprintf1("si_pkey_ptr: %p\n", si_pkey_ptr);
+ dump_mem(si_pkey_ptr - 8, 24);
+ si_pkey = *si_pkey_ptr;
+ pkey_assert(si_pkey < NR_PKEYS);
+ last_si_pkey = si_pkey;
+
+ if ((si->si_code == SEGV_MAPERR) ||
+ (si->si_code == SEGV_ACCERR) ||
+ (si->si_code == SEGV_BNDERR)) {
+ printf("non-PK si_code, exiting...\n");
+ exit(4);
+ }
+
+ dprintf1("signal pkey_reg : %08x\n", *pkey_reg_ptr);
+ /*
+ * need __rdpkey_reg() version so we do not do
+ * shadow_pkey_reg checking
+ */
+ dprintf1("signal pkey_reg from pkey_reg: %08x\n", __rdpkey_reg());
+ dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
+ *(u64 *)pkey_reg_ptr &= reset_bits(si_pkey, PKEY_DISABLE_ACCESS);
+ shadow_pkey_reg &= reset_bits(si_pkey, PKEY_DISABLE_ACCESS);
+ dprintf1("WARNING: set PRKU=0 to allow faulting instruction "
+ "to continue\n");
+ pkey_faults++;
+ dprintf1("<<<<==================================================\n");
+}
+
+int wait_all_children(void)
+{
+ int status;
+
+ return waitpid(-1, &status, 0);
+}
+
+void sig_chld(int x)
+{
+ dprint_in_signal = 1;
+ dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
+ dprint_in_signal = 0;
+}
+
+void setup_sigsegv_handler(void)
+{
+ int r, rs;
+ struct sigaction newact;
+ struct sigaction oldact;
+
+ /* #PF is mapped to sigsegv */
+ int signum = SIGSEGV;
+
+ newact.sa_handler = 0;
+ newact.sa_sigaction = signal_handler;
+
+ /*sigset_t - signals to block while in the handler */
+ /* get the old signal mask. */
+ rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
+ pkey_assert(rs == 0);
+
+ /* call sa_sigaction, not sa_handler*/
+ newact.sa_flags = SA_SIGINFO;
+
+ newact.sa_restorer = 0; /* void(*)(), obsolete */
+ r = sigaction(signum, &newact, &oldact);
+ r = sigaction(SIGALRM, &newact, &oldact);
+ pkey_assert(r == 0);
+}
+
+void setup_handlers(void)
+{
+ signal(SIGCHLD, &sig_chld);
+ setup_sigsegv_handler();
+}
+
+pid_t fork_lazy_child(void)
+{
+ pid_t forkret;
+
+ forkret = fork();
+ pkey_assert(forkret >= 0);
+ dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
+
+ if (!forkret) {
+ /* in the child */
+ while (1) {
+ dprintf1("child sleeping...\n");
+ sleep(30);
+ }
+ }
+ return forkret;
+}
+
+void davecmp(void *_a, void *_b, int len)
+{
+ int i;
+ unsigned long *a = _a;
+ unsigned long *b = _b;
+
+ for (i = 0; i < len / sizeof(*a); i++) {
+ if (a[i] == b[i])
+ continue;
+
+ dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
+ }
+}
+
+void dumpit(char *f)
+{
+ int fd = open(f, O_RDONLY);
+ char buf[100];
+ int nr_read;
+
+ dprintf2("maps fd: %d\n", fd);
+ do {
+ nr_read = read(fd, &buf[0], sizeof(buf));
+ write(1, buf, nr_read);
+ } while (nr_read > 0);
+ close(fd);
+}
+
+u64 pkey_get(int pkey, unsigned long flags)
+{
+ u64 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
+ u64 pkey_reg = __rdpkey_reg();
+ u64 shifted_pkey_reg;
+ u64 masked_pkey_reg;
+
+ dprintf1("%s(pkey=%d, flags=%lx) = %x / %d\n",
+ __func__, pkey, flags, 0, 0);
+ dprintf2("%s() raw pkey_reg: %lx\n", __func__, pkey_reg);
+
+ shifted_pkey_reg = right_shift_bits(pkey, pkey_reg);
+ dprintf2("%s() shifted_pkey_reg: %lx\n", __func__, shifted_pkey_reg);
+ masked_pkey_reg = shifted_pkey_reg & mask;
+ dprintf2("%s() masked pkey_reg: %lx\n", __func__, masked_pkey_reg);
+ /*
+ * shift down the relevant bits to the lowest two, then
+ * mask off all the other high bits.
+ */
+ return masked_pkey_reg;
+}
+
+int pkey_set(int pkey, unsigned long rights, unsigned long flags)
+{
+ u64 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
+ u64 old_pkey_reg = __rdpkey_reg();
+ u64 new_pkey_reg;
+
+ /* make sure that 'rights' only contains the bits we expect: */
+ assert(!(rights & ~mask));
+
+ /* copy old pkey_reg */
+ new_pkey_reg = old_pkey_reg;
+ /* mask out bits from pkey in old value: */
+ new_pkey_reg &= reset_bits(pkey, mask);
+ /* OR in new bits for pkey: */
+ new_pkey_reg |= left_shift_bits(pkey, rights);
+
+ __wrpkey_reg(new_pkey_reg);
+
+ dprintf3("%s(pkey=%d, rights=%lx, flags=%lx) = %x "
+ "pkey_reg now: %x old_pkey_reg: %x\n",
+ __func__, pkey, rights, flags,
+ 0, __rdpkey_reg(), old_pkey_reg);
+ return 0;
+}
+
+void pkey_disable_set(int pkey, int flags)
+{
+ unsigned long syscall_flags = 0;
+ int ret;
+ u64 pkey_rights;
+ u64 orig_pkey_reg = rdpkey_reg();
+
+ dprintf1("START->%s(%d, 0x%x)\n", __func__,
+ pkey, flags);
+ pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+ pkey_rights = pkey_get(pkey, syscall_flags);
+
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+ pkey_assert(pkey_rights >= 0);
+
+ /* process flags only if they have some new bits enabled */
+ if (flags && !(pkey_rights & flags)) {
+ pkey_rights |= flags;
+
+ ret = pkey_set(pkey, pkey_rights, syscall_flags);
+ assert(!ret);
+ /*pkey_reg and flags have the same format */
+ shadow_pkey_reg |= left_shift_bits(pkey, flags);
+ dprintf1("%s(%d) shadow: 0x%x\n",
+ __func__, pkey, shadow_pkey_reg);
+
+ pkey_assert(ret >= 0);
+
+ pkey_rights = pkey_get(pkey, syscall_flags);
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+
+ dprintf1("%s(%d) pkey_reg: 0x%lx\n",
+ __func__, pkey, rdpkey_reg());
+ if (flags)
+ pkey_assert(rdpkey_reg() > orig_pkey_reg);
+ }
+ dprintf1("END<---%s(%d, 0x%x)\n", __func__,
+ pkey, flags);
+}
+
+void pkey_disable_clear(int pkey, int flags)
+{
+ unsigned long syscall_flags = 0;
+ int ret;
+ u64 pkey_rights = pkey_get(pkey, syscall_flags);
+ u64 orig_pkey_reg = rdpkey_reg();
+
+ pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+ pkey_assert(pkey_rights >= 0);
+
+ pkey_rights &= ~flags;
+
+ ret = pkey_set(pkey, pkey_rights, 0);
+ /* pkey_reg and flags have the same format */
+ shadow_pkey_reg &= reset_bits(pkey, flags);
+ pkey_assert(ret >= 0);
+
+ pkey_rights = pkey_get(pkey, syscall_flags);
+ dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
+ pkey, pkey, pkey_rights);
+
+ dprintf1("%s(%d) pkey_reg: 0x%x\n",
+ __func__, pkey, rdpkey_reg());
+ if (flags)
+ assert(rdpkey_reg() > orig_pkey_reg);
+}
+
+void pkey_write_allow(int pkey)
+{
+ pkey_disable_clear(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_write_deny(int pkey)
+{
+ pkey_disable_set(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_access_allow(int pkey)
+{
+ pkey_disable_clear(pkey, PKEY_DISABLE_ACCESS);
+}
+void pkey_access_deny(int pkey)
+{
+ pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
+}
+
+int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+ unsigned long pkey)
+{
+ int sret;
+
+ dprintf2("%s(0x%p, %zx, prot=%lx, pkey=%lx)\n", __func__,
+ ptr, size, orig_prot, pkey);
+
+ errno = 0;
+ sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+ if (errno) {
+ dprintf2("SYS_mprotect_key sret: %d\n", sret);
+ dprintf2("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
+ dprintf2("SYS_mprotect_key failed, errno: %d\n", errno);
+ if (DEBUG_LEVEL >= 2)
+ perror("SYS_mprotect_pkey");
+ }
+ return sret;
+}
+
+int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
+{
+ int ret = syscall(SYS_pkey_alloc, flags, init_val);
+
+ dprintf1("%s(flags=%lx, init_val=%lx) syscall ret: %d errno: %d\n",
+ __func__, flags, init_val, ret, errno);
+ return ret;
+}
+
+void pkey_setup_shadow(void)
+{
+ shadow_pkey_reg = __rdpkey_reg();
+}
+
+void pkey_reset_shadow(u32 key)
+{
+ shadow_pkey_reg &= reset_bits(key, 0x3);
+}
+
+void pkey_set_shadow(u32 key, u64 init_val)
+{
+ shadow_pkey_reg |= left_shift_bits(key, init_val);
+}
+
+int alloc_pkey(void)
+{
+ int ret;
+ u64 init_val = 0x0;
+
+ dprintf1("%s()::%d, pkey_reg: 0x%x shadow: %x\n",
+ __func__, __LINE__, __rdpkey_reg(),
+ shadow_pkey_reg);
+ ret = sys_pkey_alloc(0, init_val);
+ /*
+ * pkey_alloc() sets PKRU, so we need to reflect it in
+ * shadow_pkey_reg:
+ */
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ if (ret) {
+ /* clear both the bits: */
+ pkey_reset_shadow(ret);
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow:"
+ " 0x%x\n",
+ __func__, __LINE__, ret,
+ __rdpkey_reg(), shadow_pkey_reg);
+ /*
+ * move the new state in from init_val
+ * (remember, we cheated and init_val == pkey_reg format)
+ */
+ pkey_set_shadow(ret, init_val);
+ }
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ dprintf1("%s()::%d errno: %d\n", __func__, __LINE__, errno);
+ /* for shadow checking: */
+ rdpkey_reg();
+ dprintf4("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ return ret;
+}
+
+int sys_pkey_free(unsigned long pkey)
+{
+ int ret = syscall(SYS_pkey_free, pkey);
+
+ dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
+ return ret;
+}
+
+/*
+ * I had a bug where pkey bits could be set by mprotect() but
+ * not cleared. This ensures we get lots of random bit sets
+ * and clears on the vma and pte pkey bits.
+ */
+int alloc_random_pkey(void)
+{
+ int max_nr_pkey_allocs;
+ int ret;
+ int i;
+ int alloced_pkeys[NR_PKEYS];
+ int nr_alloced = 0;
+ int random_index;
+
+ memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
+ srand((unsigned int)time(NULL));
+
+ /* allocate every possible key and make a note of which ones we got */
+ max_nr_pkey_allocs = NR_PKEYS;
+ for (i = 0; i < max_nr_pkey_allocs; i++) {
+ int new_pkey = alloc_pkey();
+
+ if (new_pkey < 0)
+ break;
+ alloced_pkeys[nr_alloced++] = new_pkey;
+ }
+
+ pkey_assert(nr_alloced > 0);
+ /* select a random one out of the allocated ones */
+ random_index = rand() % nr_alloced;
+ ret = alloced_pkeys[random_index];
+ /* now zero it out so we don't free it next */
+ alloced_pkeys[random_index] = 0;
+
+ /* go through the allocated ones that we did not want and free them */
+ for (i = 0; i < nr_alloced; i++) {
+ int free_ret;
+
+ if (!alloced_pkeys[i])
+ continue;
+ free_ret = sys_pkey_free(alloced_pkeys[i]);
+ pkey_assert(!free_ret);
+ }
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n", __func__,
+ __LINE__, ret, __rdpkey_reg(), shadow_pkey_reg);
+ return ret;
+}
+
+int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+ unsigned long pkey)
+{
+ int nr_iterations = random() % 100;
+ int ret;
+
+ while (0) {
+ int rpkey = alloc_random_pkey();
+
+ ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
+
+ dprintf1("sys_mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) "
+ "ret: %d\n",
+ ptr, size, orig_prot, pkey, ret);
+ if (nr_iterations-- < 0)
+ break;
+
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ sys_pkey_free(rpkey);
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, ret, __rdpkey_reg(),
+ shadow_pkey_reg);
+ }
+ pkey_assert(pkey < NR_PKEYS);
+
+ ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
+ dprintf1("mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
+ ptr, size, orig_prot, pkey, ret);
+ pkey_assert(!ret);
+ dprintf1("%s()::%d, ret: %d pkey_reg: 0x%x shadow: 0x%x\n", __func__,
+ __LINE__, ret, __rdpkey_reg(), shadow_pkey_reg);
+ return ret;
+}
+
+struct pkey_malloc_record {
+ void *ptr;
+ long size;
+};
+struct pkey_malloc_record *pkey_malloc_records;
+long nr_pkey_malloc_records;
+void record_pkey_malloc(void *ptr, long size)
+{
+ long i;
+ struct pkey_malloc_record *rec = NULL;
+
+ for (i = 0; i < nr_pkey_malloc_records; i++) {
+ rec = &pkey_malloc_records[i];
+ /* find a free record */
+ if (rec)
+ break;
+ }
+ if (!rec) {
+ /* every record is full */
+ size_t old_nr_records = nr_pkey_malloc_records;
+ size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
+ size_t new_size = new_nr_records *
+ sizeof(struct pkey_malloc_record);
+
+ dprintf2("new_nr_records: %zd\n", new_nr_records);
+ dprintf2("new_size: %zd\n", new_size);
+ pkey_malloc_records = realloc(pkey_malloc_records, new_size);
+ pkey_assert(pkey_malloc_records != NULL);
+ rec = &pkey_malloc_records[nr_pkey_malloc_records];
+ /*
+ * realloc() does not initialize memory, so zero it from
+ * the first new record all the way to the end.
+ */
+ for (i = 0; i < new_nr_records - old_nr_records; i++)
+ memset(rec + i, 0, sizeof(*rec));
+ }
+ dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
+ (int)(rec - pkey_malloc_records), rec, ptr, size);
+ rec->ptr = ptr;
+ rec->size = size;
+ nr_pkey_malloc_records++;
+}
+
+void free_pkey_malloc(void *ptr)
+{
+ long i;
+ int ret;
+
+ dprintf3("%s(%p)\n", __func__, ptr);
+ for (i = 0; i < nr_pkey_malloc_records; i++) {
+ struct pkey_malloc_record *rec = &pkey_malloc_records[i];
+
+ dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
+ ptr, i, rec, rec->ptr, rec->size);
+ if ((ptr < rec->ptr) ||
+ (ptr >= rec->ptr + rec->size))
+ continue;
+
+ dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
+ ptr, i, rec, rec->ptr, rec->size);
+ nr_pkey_malloc_records--;
+ ret = munmap(rec->ptr, rec->size);
+ dprintf3("munmap ret: %d\n", ret);
+ pkey_assert(!ret);
+ dprintf3("clearing rec->ptr, rec: %p\n", rec);
+ rec->ptr = NULL;
+ dprintf3("done clearing rec->ptr, rec: %p\n", rec);
+ return;
+ }
+ pkey_assert(false);
+}
+
+
+void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+{
+ void *ptr;
+ int ret;
+
+ rdpkey_reg();
+ dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+ size, prot, pkey);
+ pkey_assert(pkey < NR_PKEYS);
+ ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ pkey_assert(ptr != (void *)-1);
+ ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
+ pkey_assert(!ret);
+ record_pkey_malloc(ptr, size);
+ rdpkey_reg();
+
+ dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
+ return ptr;
+}
+
+void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
+{
+ int ret;
+ void *ptr;
+
+ dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+ size, prot, pkey);
+ /*
+ * Guarantee we can fit at least one huge page in the resulting
+ * allocation by allocating space for 2:
+ */
+ size = ALIGN_UP(size, HPAGE_SIZE * 2);
+ ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ pkey_assert(ptr != (void *)-1);
+ record_pkey_malloc(ptr, size);
+ mprotect_pkey(ptr, size, prot, pkey);
+
+ dprintf1("unaligned ptr: %p\n", ptr);
+ ptr = ALIGN_PTR_UP(ptr, HPAGE_SIZE);
+ dprintf1(" aligned ptr: %p\n", ptr);
+ ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
+ dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
+ ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
+ dprintf1("MADV_WILLNEED ret: %d\n", ret);
+ memset(ptr, 0, HPAGE_SIZE);
+
+ dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
+ return ptr;
+}
+
+int hugetlb_setup_ok;
+#define GET_NR_HUGE_PAGES 10
+void setup_hugetlbfs(void)
+{
+ int err;
+ int fd;
+ char buf[] = "123";
+
+ if (geteuid() != 0) {
+ fprintf(stderr,
+ "WARNING: not run as root, can not do hugetlb test\n");
+ return;
+ }
+
+ cat_into_file(__stringify(GET_NR_HUGE_PAGES),
+ "/proc/sys/vm/nr_hugepages");
+
+ /*
+ * Now go make sure that we got the pages and that they
+ * are 2M pages. Someone might have made 1G the default.
+ */
+ fd = open_hugepage_file(O_RDONLY);
+ if (fd < 0) {
+ perror("opening sysfs 2M hugetlb config");
+ return;
+ }
+
+ /* -1 to guarantee leaving the trailing \0 */
+ err = read(fd, buf, sizeof(buf)-1);
+ close(fd);
+ if (err <= 0) {
+ perror("reading sysfs 2M hugetlb config");
+ return;
+ }
+
+ if (atoi(buf) != GET_NR_HUGE_PAGES) {
+ fprintf(stderr, "could not confirm 2M pages, got:"
+ " '%s' expected %d\n",
+ buf, GET_NR_HUGE_PAGES);
+ return;
+ }
+
+ hugetlb_setup_ok = 1;
+}
+
+void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
+{
+ void *ptr;
+ int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
+
+ if (!hugetlb_setup_ok)
+ return PTR_ERR_ENOTSUP;
+
+ dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
+ size = ALIGN_UP(size, HPAGE_SIZE * 2);
+ pkey_assert(pkey < NR_PKEYS);
+ ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
+ pkey_assert(ptr != (void *)-1);
+ mprotect_pkey(ptr, size, prot, pkey);
+
+ record_pkey_malloc(ptr, size);
+
+ dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
+ return ptr;
+}
+
+void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
+{
+ void *ptr;
+ int fd;
+
+ dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
+ size, prot, pkey);
+ pkey_assert(pkey < NR_PKEYS);
+ fd = open("/dax/foo", O_RDWR);
+ pkey_assert(fd >= 0);
+
+ ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
+ pkey_assert(ptr != (void *)-1);
+
+ mprotect_pkey(ptr, size, prot, pkey);
+
+ record_pkey_malloc(ptr, size);
+
+ dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+ close(fd);
+ return ptr;
+}
+
+void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
+
+ malloc_pkey_with_mprotect,
+ malloc_pkey_anon_huge,
+ malloc_pkey_hugetlb
+/* can not do direct with the pkey_mprotect() API:
+ * malloc_pkey_mmap_direct,
+ * malloc_pkey_mmap_dax,
+ */
+};
+
+void *malloc_pkey(long size, int prot, u16 pkey)
+{
+ void *ret;
+ static int malloc_type;
+ int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
+
+ pkey_assert(pkey < NR_PKEYS);
+
+ while (1) {
+ pkey_assert(malloc_type < nr_malloc_types);
+
+ ret = pkey_malloc[malloc_type](size, prot, pkey);
+ pkey_assert(ret != (void *)-1);
+
+ malloc_type++;
+ if (malloc_type >= nr_malloc_types)
+ malloc_type = (random()%nr_malloc_types);
+
+ /* try again if the malloc_type we tried is unsupported */
+ if (ret == PTR_ERR_ENOTSUP)
+ continue;
+
+ break;
+ }
+
+ dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__,
+ size, prot, pkey, ret);
+ return ret;
+}
+
+int last_pkey_faults;
+void expected_pkey_faults(int pkey)
+{
+ dprintf2("%s(): last_pkey_faults: %d pkey_faults: %d\n",
+ __func__, last_pkey_faults, pkey_faults);
+ dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
+ pkey_assert(last_pkey_faults + 1 == pkey_faults);
+ pkey_assert(last_si_pkey == pkey);
+ /*
+ * The signal handler shold have cleared out PKRU to let the
+ * test program continue. We now have to restore it.
+ */
+ if (__rdpkey_reg() != shadow_pkey_reg)
+ pkey_assert(0);
+
+ __wrpkey_reg(shadow_pkey_reg);
+ dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
+ __func__, shadow_pkey_reg);
+ last_pkey_faults = pkey_faults;
+ last_si_pkey = -1;
+}
+
+void do_not_expect_pk_fault(void)
+{
+ pkey_assert(last_pkey_faults == pkey_faults);
+}
+
+int test_fds[10] = { -1 };
+int nr_test_fds;
+void __save_test_fd(int fd)
+{
+ pkey_assert(fd >= 0);
+ pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
+ test_fds[nr_test_fds] = fd;
+ nr_test_fds++;
+}
+
+int get_test_read_fd(void)
+{
+ int test_fd = open("/etc/passwd", O_RDONLY);
+
+ __save_test_fd(test_fd);
+ return test_fd;
+}
+
+void close_test_fds(void)
+{
+ int i;
+
+ for (i = 0; i < nr_test_fds; i++) {
+ if (test_fds[i] < 0)
+ continue;
+ close(test_fds[i]);
+ test_fds[i] = -1;
+ }
+ nr_test_fds = 0;
+}
+
+#define barrier() (__asm__ __volatile__("" : : : "memory"))
+__attribute__((noinline)) int read_ptr(int *ptr)
+{
+ /*
+ * Keep GCC from optimizing this away somehow
+ */
+ barrier();
+ return *ptr;
+}
+
+void test_read_of_write_disabled_region(int *ptr, u16 pkey)
+{
+ int ptr_contents;
+
+ dprintf1("disabling write access to PKEY[1], doing read\n");
+ pkey_write_deny(pkey);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("*ptr: %d\n", ptr_contents);
+ dprintf1("\n");
+ do_not_expect_pk_fault();
+}
+
+void test_read_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ int ptr_contents;
+
+ dprintf1("disabling access to PKEY[%02d], doing read @ %p\n",
+ pkey, ptr);
+ rdpkey_reg();
+ pkey_access_deny(pkey);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("*ptr: %d\n", ptr_contents);
+ expected_pkey_faults(pkey);
+}
+
+void test_read_of_access_disabled_region_with_page_already_mapped(int *ptr,
+ u16 pkey)
+{
+ int ptr_contents;
+
+ dprintf1("disabling access to PKEY[%02d], doing read @ %p\n",
+ pkey, ptr);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("reading ptr before disabling the read : %d\n",
+ ptr_contents);
+ rdpkey_reg();
+ pkey_access_deny(pkey);
+ ptr_contents = read_ptr(ptr);
+ dprintf1("*ptr: %d\n", ptr_contents);
+ expected_pkey_faults(pkey);
+}
+
+void test_write_of_write_disabled_region_with_page_already_mapped(int *ptr,
+ u16 pkey)
+{
+ *ptr = __LINE__;
+ dprintf1("disabling write access; after accessing the page, "
+ "to PKEY[%02d], doing write\n", pkey);
+ pkey_write_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+
+void test_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+ dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
+ pkey_write_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+void test_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
+ pkey_access_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+
+void test_write_of_access_disabled_region_with_page_already_mapped(int *ptr,
+ u16 pkey)
+{
+ *ptr = __LINE__;
+ dprintf1("disabling access; after accessing the page, "
+ " to PKEY[%02d], doing write\n", pkey);
+ pkey_access_deny(pkey);
+ *ptr = __LINE__;
+ expected_pkey_faults(pkey);
+}
+
+void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ int ret;
+ int test_fd = get_test_read_fd();
+
+ dprintf1("disabling access to PKEY[%02d], "
+ "having kernel read() to buffer\n", pkey);
+ pkey_access_deny(pkey);
+ ret = read(test_fd, ptr, 1);
+ dprintf1("read ret: %d\n", ret);
+ pkey_assert(ret);
+}
+void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+ int ret;
+ int test_fd = get_test_read_fd();
+
+ pkey_write_deny(pkey);
+ ret = read(test_fd, ptr, 100);
+ dprintf1("read ret: %d\n", ret);
+ if (ret < 0 && (DEBUG_LEVEL > 0))
+ perror("verbose read result (OK for this to be bad)");
+ pkey_assert(ret);
+}
+
+void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
+{
+ int pipe_ret, vmsplice_ret;
+ struct iovec iov;
+ int pipe_fds[2];
+
+ pipe_ret = pipe(pipe_fds);
+
+ pkey_assert(pipe_ret == 0);
+ dprintf1("disabling access to PKEY[%02d], "
+ "having kernel vmsplice from buffer\n", pkey);
+ pkey_access_deny(pkey);
+ iov.iov_base = ptr;
+ iov.iov_len = PAGE_SIZE;
+ vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
+ dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
+ pkey_assert(vmsplice_ret == -1);
+
+ close(pipe_fds[0]);
+ close(pipe_fds[1]);
+}
+
+void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
+{
+ int ignored = 0xdada;
+ int futex_ret;
+ int some_int = __LINE__;
+
+ dprintf1("disabling write to PKEY[%02d], "
+ "doing futex gunk in buffer\n", pkey);
+ *ptr = some_int;
+ pkey_write_deny(pkey);
+ futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL,
+ &ignored, ignored);
+ if (DEBUG_LEVEL > 0)
+ perror("futex");
+ dprintf1("futex() ret: %d\n", futex_ret);
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
+{
+ int err;
+ int i = get_start_key();
+
+ /* Note: 0 is the default pkey, so don't mess with it */
+ for (; i < NR_PKEYS; i++) {
+ if (pkey == i)
+ continue;
+
+ dprintf1("trying get/set/free to non-allocated pkey: %2d\n", i);
+ err = sys_pkey_free(i);
+ pkey_assert(err);
+
+ err = sys_pkey_free(i);
+ pkey_assert(err);
+
+ err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, i);
+ pkey_assert(err);
+ }
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_syscalls_bad_args(int *ptr, u16 pkey)
+{
+ int err;
+ int bad_pkey = NR_PKEYS+pkey;
+
+ /* pass a known-invalid pkey in: */
+ err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, bad_pkey);
+ pkey_assert(err);
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
+{
+ int err = 0;
+ int allocated_pkeys[NR_PKEYS] = {0};
+ int nr_allocated_pkeys = 0;
+ int i;
+
+ for (i = 0; i < NR_PKEYS*2; i++) {
+ int new_pkey;
+
+ dprintf1("%s() alloc loop: %d\n", __func__, i);
+ new_pkey = alloc_pkey();
+ dprintf4("%s()::%d, err: %d pkey_reg: 0x%x shadow: 0x%x\n",
+ __func__, __LINE__, err, __rdpkey_reg(),
+ shadow_pkey_reg);
+ rdpkey_reg(); /* for shadow checking */
+ dprintf2("%s() errno: %d ENOSPC: %d\n", __func__, errno,
+ ENOSPC);
+ if ((new_pkey == -1) && (errno == ENOSPC)) {
+ dprintf2("%s() allocate failed pkey after %d tries\n",
+ __func__, nr_allocated_pkeys);
+ break;
+ }
+ pkey_assert(nr_allocated_pkeys < NR_PKEYS);
+ allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
+ }
+
+ dprintf3("%s()::%d\n", __func__, __LINE__);
+
+ /*
+ * ensure it did not reach the end of the loop without
+ * failure:
+ */
+ pkey_assert(i < NR_PKEYS*2);
+ /*
+ * There are NR_PKEYS pkeys supported in hardware. NR_RESERVED_KEYS
+ * are reserved. One can be taken up by an execute-only mapping.
+ * Ensure that we can allocate at least the remaining.
+ */
+ pkey_assert(i >= (NR_PKEYS-NR_RESERVED_PKEYS-1));
+
+ for (i = 0; i < nr_allocated_pkeys; i++) {
+ err = sys_pkey_free(allocated_pkeys[i]);
+ pkey_assert(!err);
+ rdpkey_reg(); /* for shadow checking */
+ }
+}
+
+void test_ptrace_of_child(int *ptr, u16 pkey)
+{
+ __attribute__((__unused__)) int peek_result;
+ pid_t child_pid;
+ void *ignored = 0;
+ long ret;
+ int status;
+ /*
+ * This is the "control" for our little expermient. Make sure
+ * we can always access it when ptracing.
+ */
+ int *plain_ptr_unaligned = malloc(HPAGE_SIZE);
+ int *plain_ptr = ALIGN_PTR_UP(plain_ptr_unaligned, PAGE_SIZE);
+
+ /*
+ * Fork a child which is an exact copy of this process, of course.
+ * That means we can do all of our tests via ptrace() and then plain
+ * memory access and ensure they work differently.
+ */
+ child_pid = fork_lazy_child();
+ dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
+
+ ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
+ if (ret)
+ perror("attach");
+ dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
+ pkey_assert(ret != -1);
+ ret = waitpid(child_pid, &status, WUNTRACED);
+ if ((ret != child_pid) || !(WIFSTOPPED(status))) {
+ fprintf(stderr, "weird waitpid result %ld stat %x\n",
+ ret, status);
+ pkey_assert(0);
+ }
+ dprintf2("waitpid ret: %ld\n", ret);
+ dprintf2("waitpid status: %d\n", status);
+
+ pkey_access_deny(pkey);
+ pkey_write_deny(pkey);
+
+ /* Write access, untested for now:
+ * ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
+ * pkey_assert(ret != -1);
+ * dprintf1("poke at %p: %ld\n", peek_at, ret);
+ */
+
+ /*
+ * Try to access the pkey-protected "ptr" via ptrace:
+ */
+ ret = ptrace(PTRACE_PEEKDATA, child_pid, ptr, ignored);
+ /* expect it to work, without an error: */
+ pkey_assert(ret != -1);
+ /* Now access from the current task, and expect an exception: */
+ peek_result = read_ptr(ptr);
+ expected_pkey_faults(pkey);
+
+ /*
+ * Try to access the NON-pkey-protected "plain_ptr" via ptrace:
+ */
+ ret = ptrace(PTRACE_PEEKDATA, child_pid, plain_ptr, ignored);
+ /* expect it to work, without an error: */
+ pkey_assert(ret != -1);
+ /* Now access from the current task, and expect NO exception: */
+ peek_result = read_ptr(plain_ptr);
+ do_not_expect_pk_fault();
+
+ ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
+ pkey_assert(ret != -1);
+
+ ret = kill(child_pid, SIGKILL);
+ pkey_assert(ret != -1);
+
+ wait(&status);
+
+ free(plain_ptr_unaligned);
+}
+
+void test_executing_on_unreadable_memory(int *ptr, u16 pkey)
+{
+ void *p1;
+ int scratch;
+ int ptr_contents;
+ int ret;
+
+ p1 = ALIGN_PTR_UP(&lots_o_noops_around_write, PAGE_SIZE);
+ dprintf3("&lots_o_noops: %p\n", &lots_o_noops_around_write);
+ /* lots_o_noops_around_write should be page-aligned already */
+ assert(p1 == &lots_o_noops_around_write);
+
+ /* Point 'p1' at the *second* page of the function: */
+ p1 += PAGE_SIZE;
+
+ madvise(p1, PAGE_SIZE, MADV_DONTNEED);
+ lots_o_noops_around_write(&scratch);
+ ptr_contents = read_ptr(p1);
+ dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
+
+ ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC, (u64)pkey);
+ pkey_assert(!ret);
+ pkey_access_deny(pkey);
+
+ dprintf2("pkey_reg: %x\n", rdpkey_reg());
+
+ /*
+ * Make sure this is an *instruction* fault
+ */
+ madvise(p1, PAGE_SIZE, MADV_DONTNEED);
+ lots_o_noops_around_write(&scratch);
+ do_not_expect_pk_fault();
+ ptr_contents = read_ptr(p1);
+ dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
+ expected_pkey_faults(pkey);
+}
+
+void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
+{
+ int size = PAGE_SIZE;
+ int sret;
+
+ if (cpu_has_pkey()) {
+ dprintf1("SKIP: %s: no CPU support\n", __func__);
+ return;
+ }
+
+ sret = syscall(SYS_mprotect_key, ptr, size, PROT_READ, pkey);
+ pkey_assert(sret < 0);
+}
+
+void (*pkey_tests[])(int *ptr, u16 pkey) = {
+ test_read_of_write_disabled_region,
+ test_read_of_access_disabled_region,
+ test_read_of_access_disabled_region_with_page_already_mapped,
+ test_write_of_write_disabled_region,
+ test_write_of_write_disabled_region_with_page_already_mapped,
+ test_write_of_access_disabled_region,
+ test_write_of_access_disabled_region_with_page_already_mapped,
+ test_kernel_write_of_access_disabled_region,
+ test_kernel_write_of_write_disabled_region,
+ test_kernel_gup_of_access_disabled_region,
+ test_kernel_gup_write_to_write_disabled_region,
+ test_executing_on_unreadable_memory,
+ test_ptrace_of_child,
+ test_pkey_syscalls_on_non_allocated_pkey,
+ test_pkey_syscalls_bad_args,
+ test_pkey_alloc_exhaust,
+};
+
+void run_tests_once(void)
+{
+ int *ptr;
+ int prot = PROT_READ|PROT_WRITE;
+
+ for (test_nr = 0; test_nr < ARRAY_SIZE(pkey_tests); test_nr++) {
+ int pkey;
+ int orig_pkey_faults = pkey_faults;
+
+ dprintf1("======================\n");
+ dprintf1("test %d preparing...\n", test_nr);
+
+ tracing_on();
+ pkey = alloc_random_pkey();
+ dprintf1("test %d starting with pkey: %d\n", test_nr, pkey);
+ ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
+ dprintf1("test %d starting...\n", test_nr);
+ pkey_tests[test_nr](ptr, pkey);
+ dprintf1("freeing test memory: %p\n", ptr);
+ free_pkey_malloc(ptr);
+ sys_pkey_free(pkey);
+
+ dprintf1("pkey_faults: %d\n", pkey_faults);
+ dprintf1("orig_pkey_faults: %d\n", orig_pkey_faults);
+
+ tracing_off();
+ close_test_fds();
+
+ printf("test %2d PASSED (iteration %d)\n",
+ test_nr, iteration_nr);
+ dprintf1("======================\n\n");
+ }
+ iteration_nr++;
+}
+
+int main(void)
+{
+ int nr_iterations = 22;
+
+ setup_handlers();
+
+ printf("has pkey support: %d\n", cpu_has_pkey());
+
+ if (!cpu_has_pkey()) {
+ int size = PAGE_SIZE;
+ int *ptr;
+
+ printf("running PKEY tests for unsupported CPU/OS\n");
+
+ ptr = mmap(NULL, size, PROT_NONE,
+ MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+ assert(ptr != (void *)-1);
+ test_mprotect_pkey_on_unsupported_cpu(ptr, 1);
+ exit(0);
+ }
+
+ pkey_setup_shadow();
+ printf("startup pkey_reg: %lx\n", rdpkey_reg());
+ setup_hugetlbfs();
+
+ while (nr_iterations-- > 0)
+ run_tests_once();
+
+ printf("done (all tests OK)\n");
+ return 0;
+}
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 97f187e..fee6181 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -6,7 +6,7 @@ include ../lib.mk

TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_syscall test_mremap_vdso \
check_initial_reg_state sigreturn ldt_gdt iopl mpx-mini-test ioperm \
- protection_keys test_vdso
+ test_vdso
TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
diff --git a/tools/testing/selftests/x86/pkey-helpers.h b/tools/testing/selftests/x86/pkey-helpers.h
deleted file mode 100644
index b202939..0000000
--- a/tools/testing/selftests/x86/pkey-helpers.h
+++ /dev/null
@@ -1,219 +0,0 @@
-#ifndef _PKEYS_HELPER_H
-#define _PKEYS_HELPER_H
-#define _GNU_SOURCE
-#include <string.h>
-#include <stdarg.h>
-#include <stdio.h>
-#include <stdint.h>
-#include <stdbool.h>
-#include <signal.h>
-#include <assert.h>
-#include <stdlib.h>
-#include <ucontext.h>
-#include <sys/mman.h>
-
-#define NR_PKEYS 16
-#define PKRU_BITS_PER_PKEY 2
-
-#ifndef DEBUG_LEVEL
-#define DEBUG_LEVEL 0
-#endif
-#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
-extern int dprint_in_signal;
-extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
-static inline void sigsafe_printf(const char *format, ...)
-{
- va_list ap;
-
- va_start(ap, format);
- if (!dprint_in_signal) {
- vprintf(format, ap);
- } else {
- int len = vsnprintf(dprint_in_signal_buffer,
- DPRINT_IN_SIGNAL_BUF_SIZE,
- format, ap);
- /*
- * len is amount that would have been printed,
- * but actual write is truncated at BUF_SIZE.
- */
- if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
- len = DPRINT_IN_SIGNAL_BUF_SIZE;
- write(1, dprint_in_signal_buffer, len);
- }
- va_end(ap);
-}
-#define dprintf_level(level, args...) do { \
- if (level <= DEBUG_LEVEL) \
- sigsafe_printf(args); \
- fflush(NULL); \
-} while (0)
-#define dprintf0(args...) dprintf_level(0, args)
-#define dprintf1(args...) dprintf_level(1, args)
-#define dprintf2(args...) dprintf_level(2, args)
-#define dprintf3(args...) dprintf_level(3, args)
-#define dprintf4(args...) dprintf_level(4, args)
-
-extern unsigned int shadow_pkru;
-static inline unsigned int __rdpkru(void)
-{
- unsigned int eax, edx;
- unsigned int ecx = 0;
- unsigned int pkru;
-
- asm volatile(".byte 0x0f,0x01,0xee\n\t"
- : "=a" (eax), "=d" (edx)
- : "c" (ecx));
- pkru = eax;
- return pkru;
-}
-
-static inline unsigned int _rdpkru(int line)
-{
- unsigned int pkru = __rdpkru();
-
- dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
- line, pkru, shadow_pkru);
- assert(pkru == shadow_pkru);
-
- return pkru;
-}
-
-#define rdpkru() _rdpkru(__LINE__)
-
-static inline void __wrpkru(unsigned int pkru)
-{
- unsigned int eax = pkru;
- unsigned int ecx = 0;
- unsigned int edx = 0;
-
- dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
- asm volatile(".byte 0x0f,0x01,0xef\n\t"
- : : "a" (eax), "c" (ecx), "d" (edx));
- assert(pkru == __rdpkru());
-}
-
-static inline void wrpkru(unsigned int pkru)
-{
- dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
- /* will do the shadow check for us: */
- rdpkru();
- __wrpkru(pkru);
- shadow_pkru = pkru;
- dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
-}
-
-/*
- * These are technically racy. since something could
- * change PKRU between the read and the write.
- */
-static inline void __pkey_access_allow(int pkey, int do_allow)
-{
- unsigned int pkru = rdpkru();
- int bit = pkey * 2;
-
- if (do_allow)
- pkru &= (1<<bit);
- else
- pkru |= (1<<bit);
-
- dprintf4("pkru now: %08x\n", rdpkru());
- wrpkru(pkru);
-}
-
-static inline void __pkey_write_allow(int pkey, int do_allow_write)
-{
- long pkru = rdpkru();
- int bit = pkey * 2 + 1;
-
- if (do_allow_write)
- pkru &= (1<<bit);
- else
- pkru |= (1<<bit);
-
- wrpkru(pkru);
- dprintf4("pkru now: %08x\n", rdpkru());
-}
-
-#define PROT_PKEY0 0x10 /* protection key value (bit 0) */
-#define PROT_PKEY1 0x20 /* protection key value (bit 1) */
-#define PROT_PKEY2 0x40 /* protection key value (bit 2) */
-#define PROT_PKEY3 0x80 /* protection key value (bit 3) */
-
-#define PAGE_SIZE 4096
-#define MB (1<<20)
-
-static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
- unsigned int *ecx, unsigned int *edx)
-{
- /* ecx is often an input as well as an output. */
- asm volatile(
- "cpuid;"
- : "=a" (*eax),
- "=b" (*ebx),
- "=c" (*ecx),
- "=d" (*edx)
- : "0" (*eax), "2" (*ecx));
-}
-
-/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
-#define X86_FEATURE_PKU (1<<3) /* Protection Keys for Userspace */
-#define X86_FEATURE_OSPKE (1<<4) /* OS Protection Keys Enable */
-
-static inline int cpu_has_pku(void)
-{
- unsigned int eax;
- unsigned int ebx;
- unsigned int ecx;
- unsigned int edx;
-
- eax = 0x7;
- ecx = 0x0;
- __cpuid(&eax, &ebx, &ecx, &edx);
-
- if (!(ecx & X86_FEATURE_PKU)) {
- dprintf2("cpu does not have PKU\n");
- return 0;
- }
- if (!(ecx & X86_FEATURE_OSPKE)) {
- dprintf2("cpu does not have OSPKE\n");
- return 0;
- }
- return 1;
-}
-
-#define XSTATE_PKRU_BIT (9)
-#define XSTATE_PKRU 0x200
-
-int pkru_xstate_offset(void)
-{
- unsigned int eax;
- unsigned int ebx;
- unsigned int ecx;
- unsigned int edx;
- int xstate_offset;
- int xstate_size;
- unsigned long XSTATE_CPUID = 0xd;
- int leaf;
-
- /* assume that XSTATE_PKRU is set in XCR0 */
- leaf = XSTATE_PKRU_BIT;
- {
- eax = XSTATE_CPUID;
- ecx = leaf;
- __cpuid(&eax, &ebx, &ecx, &edx);
-
- if (leaf == XSTATE_PKRU_BIT) {
- xstate_offset = ebx;
- xstate_size = eax;
- }
- }
-
- if (xstate_size == 0) {
- printf("could not find size/offset of PKRU in xsave state\n");
- return 0;
- }
-
- return xstate_offset;
-}
-
-#endif /* _PKEYS_HELPER_H */
diff --git a/tools/testing/selftests/x86/protection_keys.c b/tools/testing/selftests/x86/protection_keys.c
deleted file mode 100644
index 3237bc0..0000000
--- a/tools/testing/selftests/x86/protection_keys.c
+++ /dev/null
@@ -1,1395 +0,0 @@
-/*
- * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
- *
- * There are examples in here of:
- * * how to set protection keys on memory
- * * how to set/clear bits in PKRU (the rights register)
- * * how to handle SEGV_PKRU signals and extract pkey-relevant
- * information from the siginfo
- *
- * Things to add:
- * make sure KSM and KSM COW breaking works
- * prefault pages in at malloc, or not
- * protect MPX bounds tables with protection keys?
- * make sure VMA splitting/merging is working correctly
- * OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune to pkeys
- * look for pkey "leaks" where it is still set on a VMA but "freed" back to the kernel
- * do a plain mprotect() to a mprotect_pkey() area and make sure the pkey sticks
- *
- * Compile like this:
- * gcc -o protection_keys -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
- * gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
- */
-#define _GNU_SOURCE
-#include <errno.h>
-#include <linux/futex.h>
-#include <sys/time.h>
-#include <sys/syscall.h>
-#include <string.h>
-#include <stdio.h>
-#include <stdint.h>
-#include <stdbool.h>
-#include <signal.h>
-#include <assert.h>
-#include <stdlib.h>
-#include <ucontext.h>
-#include <sys/mman.h>
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-#include <unistd.h>
-#include <sys/ptrace.h>
-#include <setjmp.h>
-
-#include "pkey-helpers.h"
-
-int iteration_nr = 1;
-int test_nr;
-
-unsigned int shadow_pkru;
-
-#define HPAGE_SIZE (1UL<<21)
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
-#define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1))
-#define ALIGN_DOWN(x, align_to) ((x) & ~((align_to)-1))
-#define ALIGN_PTR_UP(p, ptr_align_to) ((typeof(p))ALIGN_UP((unsigned long)(p), ptr_align_to))
-#define ALIGN_PTR_DOWN(p, ptr_align_to) ((typeof(p))ALIGN_DOWN((unsigned long)(p), ptr_align_to))
-#define __stringify_1(x...) #x
-#define __stringify(x...) __stringify_1(x)
-
-#define PTR_ERR_ENOTSUP ((void *)-ENOTSUP)
-
-int dprint_in_signal;
-char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
-
-extern void abort_hooks(void);
-#define pkey_assert(condition) do { \
- if (!(condition)) { \
- dprintf0("assert() at %s::%d test_nr: %d iteration: %d\n", \
- __FILE__, __LINE__, \
- test_nr, iteration_nr); \
- dprintf0("errno at assert: %d", errno); \
- abort_hooks(); \
- assert(condition); \
- } \
-} while (0)
-#define raw_assert(cond) assert(cond)
-
-void cat_into_file(char *str, char *file)
-{
- int fd = open(file, O_RDWR);
- int ret;
-
- dprintf2("%s(): writing '%s' to '%s'\n", __func__, str, file);
- /*
- * these need to be raw because they are called under
- * pkey_assert()
- */
- raw_assert(fd >= 0);
- ret = write(fd, str, strlen(str));
- if (ret != strlen(str)) {
- perror("write to file failed");
- fprintf(stderr, "filename: '%s' str: '%s'\n", file, str);
- raw_assert(0);
- }
- close(fd);
-}
-
-#if CONTROL_TRACING > 0
-static int warned_tracing;
-int tracing_root_ok(void)
-{
- if (geteuid() != 0) {
- if (!warned_tracing)
- fprintf(stderr, "WARNING: not run as root, "
- "can not do tracing control\n");
- warned_tracing = 1;
- return 0;
- }
- return 1;
-}
-#endif
-
-void tracing_on(void)
-{
-#if CONTROL_TRACING > 0
-#define TRACEDIR "/sys/kernel/debug/tracing"
- char pidstr[32];
-
- if (!tracing_root_ok())
- return;
-
- sprintf(pidstr, "%d", getpid());
- cat_into_file("0", TRACEDIR "/tracing_on");
- cat_into_file("\n", TRACEDIR "/trace");
- if (1) {
- cat_into_file("function_graph", TRACEDIR "/current_tracer");
- cat_into_file("1", TRACEDIR "/options/funcgraph-proc");
- } else {
- cat_into_file("nop", TRACEDIR "/current_tracer");
- }
- cat_into_file(pidstr, TRACEDIR "/set_ftrace_pid");
- cat_into_file("1", TRACEDIR "/tracing_on");
- dprintf1("enabled tracing\n");
-#endif
-}
-
-void tracing_off(void)
-{
-#if CONTROL_TRACING > 0
- if (!tracing_root_ok())
- return;
- cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
-#endif
-}
-
-void abort_hooks(void)
-{
- fprintf(stderr, "running %s()...\n", __func__);
- tracing_off();
-#ifdef SLEEP_ON_ABORT
- sleep(SLEEP_ON_ABORT);
-#endif
-}
-
-static inline void __page_o_noops(void)
-{
- /* 8-bytes of instruction * 512 bytes = 1 page */
- asm(".rept 512 ; nopl 0x7eeeeeee(%eax) ; .endr");
-}
-
-/*
- * This attempts to have roughly a page of instructions followed by a few
- * instructions that do a write, and another page of instructions. That
- * way, we are pretty sure that the write is in the second page of
- * instructions and has at least a page of padding behind it.
- *
- * *That* lets us be sure to madvise() away the write instruction, which
- * will then fault, which makes sure that the fault code handles
- * execute-only memory properly.
- */
-__attribute__((__aligned__(PAGE_SIZE)))
-void lots_o_noops_around_write(int *write_to_me)
-{
- dprintf3("running %s()\n", __func__);
- __page_o_noops();
- /* Assume this happens in the second page of instructions: */
- *write_to_me = __LINE__;
- /* pad out by another page: */
- __page_o_noops();
- dprintf3("%s() done\n", __func__);
-}
-
-/* Define some kernel-like types */
-#define u8 uint8_t
-#define u16 uint16_t
-#define u32 uint32_t
-#define u64 uint64_t
-
-#ifdef __i386__
-#define SYS_mprotect_key 380
-#define SYS_pkey_alloc 381
-#define SYS_pkey_free 382
-#define REG_IP_IDX REG_EIP
-#define si_pkey_offset 0x14
-#else
-#define SYS_mprotect_key 329
-#define SYS_pkey_alloc 330
-#define SYS_pkey_free 331
-#define REG_IP_IDX REG_RIP
-#define si_pkey_offset 0x20
-#endif
-
-void dump_mem(void *dumpme, int len_bytes)
-{
- char *c = (void *)dumpme;
- int i;
-
- for (i = 0; i < len_bytes; i += sizeof(u64)) {
- u64 *ptr = (u64 *)(c + i);
- dprintf1("dump[%03d][@%p]: %016jx\n", i, ptr, *ptr);
- }
-}
-
-#define __SI_FAULT (3 << 16)
-#define SEGV_BNDERR (__SI_FAULT|3) /* failed address bound checks */
-#define SEGV_PKUERR (__SI_FAULT|4)
-
-static char *si_code_str(int si_code)
-{
- if (si_code & SEGV_MAPERR)
- return "SEGV_MAPERR";
- if (si_code & SEGV_ACCERR)
- return "SEGV_ACCERR";
- if (si_code & SEGV_BNDERR)
- return "SEGV_BNDERR";
- if (si_code & SEGV_PKUERR)
- return "SEGV_PKUERR";
- return "UNKNOWN";
-}
-
-int pkru_faults;
-int last_si_pkey = -1;
-void signal_handler(int signum, siginfo_t *si, void *vucontext)
-{
- ucontext_t *uctxt = vucontext;
- int trapno;
- unsigned long ip;
- char *fpregs;
- u32 *pkru_ptr;
- u64 si_pkey;
- u32 *si_pkey_ptr;
- int pkru_offset;
- fpregset_t fpregset;
-
- dprint_in_signal = 1;
- dprintf1(">>>>===============SIGSEGV============================\n");
- dprintf1("%s()::%d, pkru: 0x%x shadow: %x\n", __func__, __LINE__,
- __rdpkru(), shadow_pkru);
-
- trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
- ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
- fpregset = uctxt->uc_mcontext.fpregs;
- fpregs = (void *)fpregset;
-
- dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__,
- trapno, ip, si_code_str(si->si_code), si->si_code);
-#ifdef __i386__
- /*
- * 32-bit has some extra padding so that userspace can tell whether
- * the XSTATE header is present in addition to the "legacy" FPU
- * state. We just assume that it is here.
- */
- fpregs += 0x70;
-#endif
- pkru_offset = pkru_xstate_offset();
- pkru_ptr = (void *)(&fpregs[pkru_offset]);
-
- dprintf1("siginfo: %p\n", si);
- dprintf1(" fpregs: %p\n", fpregs);
- /*
- * If we got a PKRU fault, we *HAVE* to have at least one bit set in
- * here.
- */
- dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
- if (DEBUG_LEVEL > 4)
- dump_mem(pkru_ptr - 128, 256);
- pkey_assert(*pkru_ptr);
-
- si_pkey_ptr = (u32 *)(((u8 *)si) + si_pkey_offset);
- dprintf1("si_pkey_ptr: %p\n", si_pkey_ptr);
- dump_mem(si_pkey_ptr - 8, 24);
- si_pkey = *si_pkey_ptr;
- pkey_assert(si_pkey < NR_PKEYS);
- last_si_pkey = si_pkey;
-
- if ((si->si_code == SEGV_MAPERR) ||
- (si->si_code == SEGV_ACCERR) ||
- (si->si_code == SEGV_BNDERR)) {
- printf("non-PK si_code, exiting...\n");
- exit(4);
- }
-
- dprintf1("signal pkru from xsave: %08x\n", *pkru_ptr);
- /* need __rdpkru() version so we do not do shadow_pkru checking */
- dprintf1("signal pkru from pkru: %08x\n", __rdpkru());
- dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
- *(u64 *)pkru_ptr = 0x00000000;
- dprintf1("WARNING: set PRKU=0 to allow faulting instruction to continue\n");
- pkru_faults++;
- dprintf1("<<<<==================================================\n");
- return;
- if (trapno == 14) {
- fprintf(stderr,
- "ERROR: In signal handler, page fault, trapno = %d, ip = %016lx\n",
- trapno, ip);
- fprintf(stderr, "si_addr %p\n", si->si_addr);
- fprintf(stderr, "REG_ERR: %lx\n",
- (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
- exit(1);
- } else {
- fprintf(stderr, "unexpected trap %d! at 0x%lx\n", trapno, ip);
- fprintf(stderr, "si_addr %p\n", si->si_addr);
- fprintf(stderr, "REG_ERR: %lx\n",
- (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
- exit(2);
- }
- dprint_in_signal = 0;
-}
-
-int wait_all_children(void)
-{
- int status;
- return waitpid(-1, &status, 0);
-}
-
-void sig_chld(int x)
-{
- dprint_in_signal = 1;
- dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
- dprint_in_signal = 0;
-}
-
-void setup_sigsegv_handler(void)
-{
- int r, rs;
- struct sigaction newact;
- struct sigaction oldact;
-
- /* #PF is mapped to sigsegv */
- int signum = SIGSEGV;
-
- newact.sa_handler = 0;
- newact.sa_sigaction = signal_handler;
-
- /*sigset_t - signals to block while in the handler */
- /* get the old signal mask. */
- rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
- pkey_assert(rs == 0);
-
- /* call sa_sigaction, not sa_handler*/
- newact.sa_flags = SA_SIGINFO;
-
- newact.sa_restorer = 0; /* void(*)(), obsolete */
- r = sigaction(signum, &newact, &oldact);
- r = sigaction(SIGALRM, &newact, &oldact);
- pkey_assert(r == 0);
-}
-
-void setup_handlers(void)
-{
- signal(SIGCHLD, &sig_chld);
- setup_sigsegv_handler();
-}
-
-pid_t fork_lazy_child(void)
-{
- pid_t forkret;
-
- forkret = fork();
- pkey_assert(forkret >= 0);
- dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
-
- if (!forkret) {
- /* in the child */
- while (1) {
- dprintf1("child sleeping...\n");
- sleep(30);
- }
- }
- return forkret;
-}
-
-void davecmp(void *_a, void *_b, int len)
-{
- int i;
- unsigned long *a = _a;
- unsigned long *b = _b;
-
- for (i = 0; i < len / sizeof(*a); i++) {
- if (a[i] == b[i])
- continue;
-
- dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
- }
-}
-
-void dumpit(char *f)
-{
- int fd = open(f, O_RDONLY);
- char buf[100];
- int nr_read;
-
- dprintf2("maps fd: %d\n", fd);
- do {
- nr_read = read(fd, &buf[0], sizeof(buf));
- write(1, buf, nr_read);
- } while (nr_read > 0);
- close(fd);
-}
-
-#define PKEY_DISABLE_ACCESS 0x1
-#define PKEY_DISABLE_WRITE 0x2
-
-u32 pkey_get(int pkey, unsigned long flags)
-{
- u32 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
- u32 pkru = __rdpkru();
- u32 shifted_pkru;
- u32 masked_pkru;
-
- dprintf1("%s(pkey=%d, flags=%lx) = %x / %d\n",
- __func__, pkey, flags, 0, 0);
- dprintf2("%s() raw pkru: %x\n", __func__, pkru);
-
- shifted_pkru = (pkru >> (pkey * PKRU_BITS_PER_PKEY));
- dprintf2("%s() shifted_pkru: %x\n", __func__, shifted_pkru);
- masked_pkru = shifted_pkru & mask;
- dprintf2("%s() masked pkru: %x\n", __func__, masked_pkru);
- /*
- * shift down the relevant bits to the lowest two, then
- * mask off all the other high bits.
- */
- return masked_pkru;
-}
-
-int pkey_set(int pkey, unsigned long rights, unsigned long flags)
-{
- u32 mask = (PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE);
- u32 old_pkru = __rdpkru();
- u32 new_pkru;
-
- /* make sure that 'rights' only contains the bits we expect: */
- assert(!(rights & ~mask));
-
- /* copy old pkru */
- new_pkru = old_pkru;
- /* mask out bits from pkey in old value: */
- new_pkru &= ~(mask << (pkey * PKRU_BITS_PER_PKEY));
- /* OR in new bits for pkey: */
- new_pkru |= (rights << (pkey * PKRU_BITS_PER_PKEY));
-
- __wrpkru(new_pkru);
-
- dprintf3("%s(pkey=%d, rights=%lx, flags=%lx) = %x pkru now: %x old_pkru: %x\n",
- __func__, pkey, rights, flags, 0, __rdpkru(), old_pkru);
- return 0;
-}
-
-void pkey_disable_set(int pkey, int flags)
-{
- unsigned long syscall_flags = 0;
- int ret;
- int pkey_rights;
- u32 orig_pkru = rdpkru();
-
- dprintf1("START->%s(%d, 0x%x)\n", __func__,
- pkey, flags);
- pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
-
- pkey_rights = pkey_get(pkey, syscall_flags);
-
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
- pkey_assert(pkey_rights >= 0);
-
- pkey_rights |= flags;
-
- ret = pkey_set(pkey, pkey_rights, syscall_flags);
- assert(!ret);
- /*pkru and flags have the same format */
- shadow_pkru |= flags << (pkey * 2);
- dprintf1("%s(%d) shadow: 0x%x\n", __func__, pkey, shadow_pkru);
-
- pkey_assert(ret >= 0);
-
- pkey_rights = pkey_get(pkey, syscall_flags);
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
-
- dprintf1("%s(%d) pkru: 0x%x\n", __func__, pkey, rdpkru());
- if (flags)
- pkey_assert(rdpkru() > orig_pkru);
- dprintf1("END<---%s(%d, 0x%x)\n", __func__,
- pkey, flags);
-}
-
-void pkey_disable_clear(int pkey, int flags)
-{
- unsigned long syscall_flags = 0;
- int ret;
- int pkey_rights = pkey_get(pkey, syscall_flags);
- u32 orig_pkru = rdpkru();
-
- pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
-
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
- pkey_assert(pkey_rights >= 0);
-
- pkey_rights |= flags;
-
- ret = pkey_set(pkey, pkey_rights, 0);
- /* pkru and flags have the same format */
- shadow_pkru &= ~(flags << (pkey * 2));
- pkey_assert(ret >= 0);
-
- pkey_rights = pkey_get(pkey, syscall_flags);
- dprintf1("%s(%d) pkey_get(%d): %x\n", __func__,
- pkey, pkey, pkey_rights);
-
- dprintf1("%s(%d) pkru: 0x%x\n", __func__, pkey, rdpkru());
- if (flags)
- assert(rdpkru() > orig_pkru);
-}
-
-void pkey_write_allow(int pkey)
-{
- pkey_disable_clear(pkey, PKEY_DISABLE_WRITE);
-}
-void pkey_write_deny(int pkey)
-{
- pkey_disable_set(pkey, PKEY_DISABLE_WRITE);
-}
-void pkey_access_allow(int pkey)
-{
- pkey_disable_clear(pkey, PKEY_DISABLE_ACCESS);
-}
-void pkey_access_deny(int pkey)
-{
- pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
-}
-
-int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
- unsigned long pkey)
-{
- int sret;
-
- dprintf2("%s(0x%p, %zx, prot=%lx, pkey=%lx)\n", __func__,
- ptr, size, orig_prot, pkey);
-
- errno = 0;
- sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
- if (errno) {
- dprintf2("SYS_mprotect_key sret: %d\n", sret);
- dprintf2("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
- dprintf2("SYS_mprotect_key failed, errno: %d\n", errno);
- if (DEBUG_LEVEL >= 2)
- perror("SYS_mprotect_pkey");
- }
- return sret;
-}
-
-int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
-{
- int ret = syscall(SYS_pkey_alloc, flags, init_val);
- dprintf1("%s(flags=%lx, init_val=%lx) syscall ret: %d errno: %d\n",
- __func__, flags, init_val, ret, errno);
- return ret;
-}
-
-int alloc_pkey(void)
-{
- int ret;
- unsigned long init_val = 0x0;
-
- dprintf1("alloc_pkey()::%d, pkru: 0x%x shadow: %x\n",
- __LINE__, __rdpkru(), shadow_pkru);
- ret = sys_pkey_alloc(0, init_val);
- /*
- * pkey_alloc() sets PKRU, so we need to reflect it in
- * shadow_pkru:
- */
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- if (ret) {
- /* clear both the bits: */
- shadow_pkru &= ~(0x3 << (ret * 2));
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- /*
- * move the new state in from init_val
- * (remember, we cheated and init_val == pkru format)
- */
- shadow_pkru |= (init_val << (ret * 2));
- }
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- dprintf1("alloc_pkey()::%d errno: %d\n", __LINE__, errno);
- /* for shadow checking: */
- rdpkru();
- dprintf4("alloc_pkey()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n",
- __LINE__, ret, __rdpkru(), shadow_pkru);
- return ret;
-}
-
-int sys_pkey_free(unsigned long pkey)
-{
- int ret = syscall(SYS_pkey_free, pkey);
- dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
- return ret;
-}
-
-/*
- * I had a bug where pkey bits could be set by mprotect() but
- * not cleared. This ensures we get lots of random bit sets
- * and clears on the vma and pte pkey bits.
- */
-int alloc_random_pkey(void)
-{
- int max_nr_pkey_allocs;
- int ret;
- int i;
- int alloced_pkeys[NR_PKEYS];
- int nr_alloced = 0;
- int random_index;
- memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
-
- /* allocate every possible key and make a note of which ones we got */
- max_nr_pkey_allocs = NR_PKEYS;
- max_nr_pkey_allocs = 1;
- for (i = 0; i < max_nr_pkey_allocs; i++) {
- int new_pkey = alloc_pkey();
- if (new_pkey < 0)
- break;
- alloced_pkeys[nr_alloced++] = new_pkey;
- }
-
- pkey_assert(nr_alloced > 0);
- /* select a random one out of the allocated ones */
- random_index = rand() % nr_alloced;
- ret = alloced_pkeys[random_index];
- /* now zero it out so we don't free it next */
- alloced_pkeys[random_index] = 0;
-
- /* go through the allocated ones that we did not want and free them */
- for (i = 0; i < nr_alloced; i++) {
- int free_ret;
- if (!alloced_pkeys[i])
- continue;
- free_ret = sys_pkey_free(alloced_pkeys[i]);
- pkey_assert(!free_ret);
- }
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- return ret;
-}
-
-int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
- unsigned long pkey)
-{
- int nr_iterations = random() % 100;
- int ret;
-
- while (0) {
- int rpkey = alloc_random_pkey();
- ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
- dprintf1("sys_mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
- ptr, size, orig_prot, pkey, ret);
- if (nr_iterations-- < 0)
- break;
-
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- sys_pkey_free(rpkey);
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- }
- pkey_assert(pkey < NR_PKEYS);
-
- ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
- dprintf1("mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
- ptr, size, orig_prot, pkey, ret);
- pkey_assert(!ret);
- dprintf1("%s()::%d, ret: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, ret, __rdpkru(), shadow_pkru);
- return ret;
-}
-
-struct pkey_malloc_record {
- void *ptr;
- long size;
-};
-struct pkey_malloc_record *pkey_malloc_records;
-long nr_pkey_malloc_records;
-void record_pkey_malloc(void *ptr, long size)
-{
- long i;
- struct pkey_malloc_record *rec = NULL;
-
- for (i = 0; i < nr_pkey_malloc_records; i++) {
- rec = &pkey_malloc_records[i];
- /* find a free record */
- if (rec)
- break;
- }
- if (!rec) {
- /* every record is full */
- size_t old_nr_records = nr_pkey_malloc_records;
- size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
- size_t new_size = new_nr_records * sizeof(struct pkey_malloc_record);
- dprintf2("new_nr_records: %zd\n", new_nr_records);
- dprintf2("new_size: %zd\n", new_size);
- pkey_malloc_records = realloc(pkey_malloc_records, new_size);
- pkey_assert(pkey_malloc_records != NULL);
- rec = &pkey_malloc_records[nr_pkey_malloc_records];
- /*
- * realloc() does not initialize memory, so zero it from
- * the first new record all the way to the end.
- */
- for (i = 0; i < new_nr_records - old_nr_records; i++)
- memset(rec + i, 0, sizeof(*rec));
- }
- dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
- (int)(rec - pkey_malloc_records), rec, ptr, size);
- rec->ptr = ptr;
- rec->size = size;
- nr_pkey_malloc_records++;
-}
-
-void free_pkey_malloc(void *ptr)
-{
- long i;
- int ret;
- dprintf3("%s(%p)\n", __func__, ptr);
- for (i = 0; i < nr_pkey_malloc_records; i++) {
- struct pkey_malloc_record *rec = &pkey_malloc_records[i];
- dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
- ptr, i, rec, rec->ptr, rec->size);
- if ((ptr < rec->ptr) ||
- (ptr >= rec->ptr + rec->size))
- continue;
-
- dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
- ptr, i, rec, rec->ptr, rec->size);
- nr_pkey_malloc_records--;
- ret = munmap(rec->ptr, rec->size);
- dprintf3("munmap ret: %d\n", ret);
- pkey_assert(!ret);
- dprintf3("clearing rec->ptr, rec: %p\n", rec);
- rec->ptr = NULL;
- dprintf3("done clearing rec->ptr, rec: %p\n", rec);
- return;
- }
- pkey_assert(false);
-}
-
-
-void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
-{
- void *ptr;
- int ret;
-
- rdpkru();
- dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
- size, prot, pkey);
- pkey_assert(pkey < NR_PKEYS);
- ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- pkey_assert(ptr != (void *)-1);
- ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
- pkey_assert(!ret);
- record_pkey_malloc(ptr, size);
- rdpkru();
-
- dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
- return ptr;
-}
-
-void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
-{
- int ret;
- void *ptr;
-
- dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
- size, prot, pkey);
- /*
- * Guarantee we can fit at least one huge page in the resulting
- * allocation by allocating space for 2:
- */
- size = ALIGN_UP(size, HPAGE_SIZE * 2);
- ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- pkey_assert(ptr != (void *)-1);
- record_pkey_malloc(ptr, size);
- mprotect_pkey(ptr, size, prot, pkey);
-
- dprintf1("unaligned ptr: %p\n", ptr);
- ptr = ALIGN_PTR_UP(ptr, HPAGE_SIZE);
- dprintf1(" aligned ptr: %p\n", ptr);
- ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
- dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
- ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
- dprintf1("MADV_WILLNEED ret: %d\n", ret);
- memset(ptr, 0, HPAGE_SIZE);
-
- dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
- return ptr;
-}
-
-int hugetlb_setup_ok;
-#define GET_NR_HUGE_PAGES 10
-void setup_hugetlbfs(void)
-{
- int err;
- int fd;
- char buf[] = "123";
-
- if (geteuid() != 0) {
- fprintf(stderr, "WARNING: not run as root, can not do hugetlb test\n");
- return;
- }
-
- cat_into_file(__stringify(GET_NR_HUGE_PAGES), "/proc/sys/vm/nr_hugepages");
-
- /*
- * Now go make sure that we got the pages and that they
- * are 2M pages. Someone might have made 1G the default.
- */
- fd = open("/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages", O_RDONLY);
- if (fd < 0) {
- perror("opening sysfs 2M hugetlb config");
- return;
- }
-
- /* -1 to guarantee leaving the trailing \0 */
- err = read(fd, buf, sizeof(buf)-1);
- close(fd);
- if (err <= 0) {
- perror("reading sysfs 2M hugetlb config");
- return;
- }
-
- if (atoi(buf) != GET_NR_HUGE_PAGES) {
- fprintf(stderr, "could not confirm 2M pages, got: '%s' expected %d\n",
- buf, GET_NR_HUGE_PAGES);
- return;
- }
-
- hugetlb_setup_ok = 1;
-}
-
-void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
-{
- void *ptr;
- int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
-
- if (!hugetlb_setup_ok)
- return PTR_ERR_ENOTSUP;
-
- dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
- size = ALIGN_UP(size, HPAGE_SIZE * 2);
- pkey_assert(pkey < NR_PKEYS);
- ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
- pkey_assert(ptr != (void *)-1);
- mprotect_pkey(ptr, size, prot, pkey);
-
- record_pkey_malloc(ptr, size);
-
- dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
- return ptr;
-}
-
-void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
-{
- void *ptr;
- int fd;
-
- dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__,
- size, prot, pkey);
- pkey_assert(pkey < NR_PKEYS);
- fd = open("/dax/foo", O_RDWR);
- pkey_assert(fd >= 0);
-
- ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
- pkey_assert(ptr != (void *)-1);
-
- mprotect_pkey(ptr, size, prot, pkey);
-
- record_pkey_malloc(ptr, size);
-
- dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
- close(fd);
- return ptr;
-}
-
-void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
-
- malloc_pkey_with_mprotect,
- malloc_pkey_anon_huge,
- malloc_pkey_hugetlb
-/* can not do direct with the pkey_mprotect() API:
- malloc_pkey_mmap_direct,
- malloc_pkey_mmap_dax,
-*/
-};
-
-void *malloc_pkey(long size, int prot, u16 pkey)
-{
- void *ret;
- static int malloc_type;
- int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
-
- pkey_assert(pkey < NR_PKEYS);
-
- while (1) {
- pkey_assert(malloc_type < nr_malloc_types);
-
- ret = pkey_malloc[malloc_type](size, prot, pkey);
- pkey_assert(ret != (void *)-1);
-
- malloc_type++;
- if (malloc_type >= nr_malloc_types)
- malloc_type = (random()%nr_malloc_types);
-
- /* try again if the malloc_type we tried is unsupported */
- if (ret == PTR_ERR_ENOTSUP)
- continue;
-
- break;
- }
-
- dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__,
- size, prot, pkey, ret);
- return ret;
-}
-
-int last_pkru_faults;
-void expected_pk_fault(int pkey)
-{
- dprintf2("%s(): last_pkru_faults: %d pkru_faults: %d\n",
- __func__, last_pkru_faults, pkru_faults);
- dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
- pkey_assert(last_pkru_faults + 1 == pkru_faults);
- pkey_assert(last_si_pkey == pkey);
- /*
- * The signal handler shold have cleared out PKRU to let the
- * test program continue. We now have to restore it.
- */
- if (__rdpkru() != 0)
- pkey_assert(0);
-
- __wrpkru(shadow_pkru);
- dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
- __func__, shadow_pkru);
- last_pkru_faults = pkru_faults;
- last_si_pkey = -1;
-}
-
-void do_not_expect_pk_fault(void)
-{
- pkey_assert(last_pkru_faults == pkru_faults);
-}
-
-int test_fds[10] = { -1 };
-int nr_test_fds;
-void __save_test_fd(int fd)
-{
- pkey_assert(fd >= 0);
- pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
- test_fds[nr_test_fds] = fd;
- nr_test_fds++;
-}
-
-int get_test_read_fd(void)
-{
- int test_fd = open("/etc/passwd", O_RDONLY);
- __save_test_fd(test_fd);
- return test_fd;
-}
-
-void close_test_fds(void)
-{
- int i;
-
- for (i = 0; i < nr_test_fds; i++) {
- if (test_fds[i] < 0)
- continue;
- close(test_fds[i]);
- test_fds[i] = -1;
- }
- nr_test_fds = 0;
-}
-
-#define barrier() __asm__ __volatile__("": : :"memory")
-__attribute__((noinline)) int read_ptr(int *ptr)
-{
- /*
- * Keep GCC from optimizing this away somehow
- */
- barrier();
- return *ptr;
-}
-
-void test_read_of_write_disabled_region(int *ptr, u16 pkey)
-{
- int ptr_contents;
-
- dprintf1("disabling write access to PKEY[1], doing read\n");
- pkey_write_deny(pkey);
- ptr_contents = read_ptr(ptr);
- dprintf1("*ptr: %d\n", ptr_contents);
- dprintf1("\n");
-}
-void test_read_of_access_disabled_region(int *ptr, u16 pkey)
-{
- int ptr_contents;
-
- dprintf1("disabling access to PKEY[%02d], doing read @ %p\n", pkey, ptr);
- rdpkru();
- pkey_access_deny(pkey);
- ptr_contents = read_ptr(ptr);
- dprintf1("*ptr: %d\n", ptr_contents);
- expected_pk_fault(pkey);
-}
-void test_write_of_write_disabled_region(int *ptr, u16 pkey)
-{
- dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
- pkey_write_deny(pkey);
- *ptr = __LINE__;
- expected_pk_fault(pkey);
-}
-void test_write_of_access_disabled_region(int *ptr, u16 pkey)
-{
- dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
- pkey_access_deny(pkey);
- *ptr = __LINE__;
- expected_pk_fault(pkey);
-}
-void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
-{
- int ret;
- int test_fd = get_test_read_fd();
-
- dprintf1("disabling access to PKEY[%02d], "
- "having kernel read() to buffer\n", pkey);
- pkey_access_deny(pkey);
- ret = read(test_fd, ptr, 1);
- dprintf1("read ret: %d\n", ret);
- pkey_assert(ret);
-}
-void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
-{
- int ret;
- int test_fd = get_test_read_fd();
-
- pkey_write_deny(pkey);
- ret = read(test_fd, ptr, 100);
- dprintf1("read ret: %d\n", ret);
- if (ret < 0 && (DEBUG_LEVEL > 0))
- perror("verbose read result (OK for this to be bad)");
- pkey_assert(ret);
-}
-
-void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
-{
- int pipe_ret, vmsplice_ret;
- struct iovec iov;
- int pipe_fds[2];
-
- pipe_ret = pipe(pipe_fds);
-
- pkey_assert(pipe_ret == 0);
- dprintf1("disabling access to PKEY[%02d], "
- "having kernel vmsplice from buffer\n", pkey);
- pkey_access_deny(pkey);
- iov.iov_base = ptr;
- iov.iov_len = PAGE_SIZE;
- vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
- dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
- pkey_assert(vmsplice_ret == -1);
-
- close(pipe_fds[0]);
- close(pipe_fds[1]);
-}
-
-void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
-{
- int ignored = 0xdada;
- int futex_ret;
- int some_int = __LINE__;
-
- dprintf1("disabling write to PKEY[%02d], "
- "doing futex gunk in buffer\n", pkey);
- *ptr = some_int;
- pkey_write_deny(pkey);
- futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL,
- &ignored, ignored);
- if (DEBUG_LEVEL > 0)
- perror("futex");
- dprintf1("futex() ret: %d\n", futex_ret);
-}
-
-/* Assumes that all pkeys other than 'pkey' are unallocated */
-void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
-{
- int err;
- int i;
-
- /* Note: 0 is the default pkey, so don't mess with it */
- for (i = 1; i < NR_PKEYS; i++) {
- if (pkey == i)
- continue;
-
- dprintf1("trying get/set/free to non-allocated pkey: %2d\n", i);
- err = sys_pkey_free(i);
- pkey_assert(err);
-
- err = sys_pkey_free(i);
- pkey_assert(err);
-
- err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, i);
- pkey_assert(err);
- }
-}
-
-/* Assumes that all pkeys other than 'pkey' are unallocated */
-void test_pkey_syscalls_bad_args(int *ptr, u16 pkey)
-{
- int err;
- int bad_pkey = NR_PKEYS+99;
-
- /* pass a known-invalid pkey in: */
- err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, bad_pkey);
- pkey_assert(err);
-}
-
-/* Assumes that all pkeys other than 'pkey' are unallocated */
-void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
-{
- int err;
- int allocated_pkeys[NR_PKEYS] = {0};
- int nr_allocated_pkeys = 0;
- int i;
-
- for (i = 0; i < NR_PKEYS*2; i++) {
- int new_pkey;
- dprintf1("%s() alloc loop: %d\n", __func__, i);
- new_pkey = alloc_pkey();
- dprintf4("%s()::%d, err: %d pkru: 0x%x shadow: 0x%x\n", __func__,
- __LINE__, err, __rdpkru(), shadow_pkru);
- rdpkru(); /* for shadow checking */
- dprintf2("%s() errno: %d ENOSPC: %d\n", __func__, errno, ENOSPC);
- if ((new_pkey == -1) && (errno == ENOSPC)) {
- dprintf2("%s() failed to allocate pkey after %d tries\n",
- __func__, nr_allocated_pkeys);
- break;
- }
- pkey_assert(nr_allocated_pkeys < NR_PKEYS);
- allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
- }
-
- dprintf3("%s()::%d\n", __func__, __LINE__);
-
- /*
- * ensure it did not reach the end of the loop without
- * failure:
- */
- pkey_assert(i < NR_PKEYS*2);
-
- /*
- * There are 16 pkeys supported in hardware. One is taken
- * up for the default (0) and another can be taken up by
- * an execute-only mapping. Ensure that we can allocate
- * at least 14 (16-2).
- */
- pkey_assert(i >= NR_PKEYS-2);
-
- for (i = 0; i < nr_allocated_pkeys; i++) {
- err = sys_pkey_free(allocated_pkeys[i]);
- pkey_assert(!err);
- rdpkru(); /* for shadow checking */
- }
-}
-
-void test_ptrace_of_child(int *ptr, u16 pkey)
-{
- __attribute__((__unused__)) int peek_result;
- pid_t child_pid;
- void *ignored = 0;
- long ret;
- int status;
- /*
- * This is the "control" for our little expermient. Make sure
- * we can always access it when ptracing.
- */
- int *plain_ptr_unaligned = malloc(HPAGE_SIZE);
- int *plain_ptr = ALIGN_PTR_UP(plain_ptr_unaligned, PAGE_SIZE);
-
- /*
- * Fork a child which is an exact copy of this process, of course.
- * That means we can do all of our tests via ptrace() and then plain
- * memory access and ensure they work differently.
- */
- child_pid = fork_lazy_child();
- dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
-
- ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
- if (ret)
- perror("attach");
- dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
- pkey_assert(ret != -1);
- ret = waitpid(child_pid, &status, WUNTRACED);
- if ((ret != child_pid) || !(WIFSTOPPED(status))) {
- fprintf(stderr, "weird waitpid result %ld stat %x\n",
- ret, status);
- pkey_assert(0);
- }
- dprintf2("waitpid ret: %ld\n", ret);
- dprintf2("waitpid status: %d\n", status);
-
- pkey_access_deny(pkey);
- pkey_write_deny(pkey);
-
- /* Write access, untested for now:
- ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
- pkey_assert(ret != -1);
- dprintf1("poke at %p: %ld\n", peek_at, ret);
- */
-
- /*
- * Try to access the pkey-protected "ptr" via ptrace:
- */
- ret = ptrace(PTRACE_PEEKDATA, child_pid, ptr, ignored);
- /* expect it to work, without an error: */
- pkey_assert(ret != -1);
- /* Now access from the current task, and expect an exception: */
- peek_result = read_ptr(ptr);
- expected_pk_fault(pkey);
-
- /*
- * Try to access the NON-pkey-protected "plain_ptr" via ptrace:
- */
- ret = ptrace(PTRACE_PEEKDATA, child_pid, plain_ptr, ignored);
- /* expect it to work, without an error: */
- pkey_assert(ret != -1);
- /* Now access from the current task, and expect NO exception: */
- peek_result = read_ptr(plain_ptr);
- do_not_expect_pk_fault();
-
- ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
- pkey_assert(ret != -1);
-
- ret = kill(child_pid, SIGKILL);
- pkey_assert(ret != -1);
-
- wait(&status);
-
- free(plain_ptr_unaligned);
-}
-
-void test_executing_on_unreadable_memory(int *ptr, u16 pkey)
-{
- void *p1;
- int scratch;
- int ptr_contents;
- int ret;
-
- p1 = ALIGN_PTR_UP(&lots_o_noops_around_write, PAGE_SIZE);
- dprintf3("&lots_o_noops: %p\n", &lots_o_noops_around_write);
- /* lots_o_noops_around_write should be page-aligned already */
- assert(p1 == &lots_o_noops_around_write);
-
- /* Point 'p1' at the *second* page of the function: */
- p1 += PAGE_SIZE;
-
- madvise(p1, PAGE_SIZE, MADV_DONTNEED);
- lots_o_noops_around_write(&scratch);
- ptr_contents = read_ptr(p1);
- dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
-
- ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC, (u64)pkey);
- pkey_assert(!ret);
- pkey_access_deny(pkey);
-
- dprintf2("pkru: %x\n", rdpkru());
-
- /*
- * Make sure this is an *instruction* fault
- */
- madvise(p1, PAGE_SIZE, MADV_DONTNEED);
- lots_o_noops_around_write(&scratch);
- do_not_expect_pk_fault();
- ptr_contents = read_ptr(p1);
- dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
- expected_pk_fault(pkey);
-}
-
-void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
-{
- int size = PAGE_SIZE;
- int sret;
-
- if (cpu_has_pku()) {
- dprintf1("SKIP: %s: no CPU support\n", __func__);
- return;
- }
-
- sret = syscall(SYS_mprotect_key, ptr, size, PROT_READ, pkey);
- pkey_assert(sret < 0);
-}
-
-void (*pkey_tests[])(int *ptr, u16 pkey) = {
- test_read_of_write_disabled_region,
- test_read_of_access_disabled_region,
- test_write_of_write_disabled_region,
- test_write_of_access_disabled_region,
- test_kernel_write_of_access_disabled_region,
- test_kernel_write_of_write_disabled_region,
- test_kernel_gup_of_access_disabled_region,
- test_kernel_gup_write_to_write_disabled_region,
- test_executing_on_unreadable_memory,
- test_ptrace_of_child,
- test_pkey_syscalls_on_non_allocated_pkey,
- test_pkey_syscalls_bad_args,
- test_pkey_alloc_exhaust,
-};
-
-void run_tests_once(void)
-{
- int *ptr;
- int prot = PROT_READ|PROT_WRITE;
-
- for (test_nr = 0; test_nr < ARRAY_SIZE(pkey_tests); test_nr++) {
- int pkey;
- int orig_pkru_faults = pkru_faults;
-
- dprintf1("======================\n");
- dprintf1("test %d preparing...\n", test_nr);
-
- tracing_on();
- pkey = alloc_random_pkey();
- dprintf1("test %d starting with pkey: %d\n", test_nr, pkey);
- ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
- dprintf1("test %d starting...\n", test_nr);
- pkey_tests[test_nr](ptr, pkey);
- dprintf1("freeing test memory: %p\n", ptr);
- free_pkey_malloc(ptr);
- sys_pkey_free(pkey);
-
- dprintf1("pkru_faults: %d\n", pkru_faults);
- dprintf1("orig_pkru_faults: %d\n", orig_pkru_faults);
-
- tracing_off();
- close_test_fds();
-
- printf("test %2d PASSED (iteration %d)\n", test_nr, iteration_nr);
- dprintf1("======================\n\n");
- }
- iteration_nr++;
-}
-
-void pkey_setup_shadow(void)
-{
- shadow_pkru = __rdpkru();
-}
-
-int main(void)
-{
- int nr_iterations = 22;
-
- setup_handlers();
-
- printf("has pku: %d\n", cpu_has_pku());
-
- if (!cpu_has_pku()) {
- int size = PAGE_SIZE;
- int *ptr;
-
- printf("running PKEY tests for unsupported CPU/OS\n");
-
- ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
- assert(ptr != (void *)-1);
- test_mprotect_pkey_on_unsupported_cpu(ptr, 1);
- exit(0);
- }
-
- pkey_setup_shadow();
- printf("startup pkru: %x\n", rdpkru());
- setup_hugetlbfs();
-
- while (nr_iterations-- > 0)
- run_tests_once();
-
- printf("done (all tests OK)\n");
- return 0;
-}
--
1.8.3.1

2017-06-17 03:54:18

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

The value of the AMR register at the time of exception
is made available in gp_regs[PT_AMR] of the siginfo.

This field can be used to reprogram the permission bits of
any valid pkey.

Similarly the value of the pkey, whose protection got violated,
is made available at si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/paca.h | 1 +
arch/powerpc/include/uapi/asm/ptrace.h | 3 ++-
arch/powerpc/kernel/asm-offsets.c | 5 ++++
arch/powerpc/kernel/exceptions-64s.S | 8 ++++++
arch/powerpc/kernel/signal_32.c | 14 ++++++++++
arch/powerpc/kernel/signal_64.c | 14 ++++++++++
arch/powerpc/kernel/traps.c | 49 ++++++++++++++++++++++++++++++++++
arch/powerpc/mm/fault.c | 4 +++
8 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1c09f8f..a41afd3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -92,6 +92,7 @@ struct paca_struct {
struct dtl_entry *dispatch_log_end;
#endif /* CONFIG_PPC_STD_MMU_64 */
u64 dscr_default; /* per-CPU default DSCR */
+ u64 paca_amr; /* value of amr at exception */

#ifdef CONFIG_PPC_STD_MMU_64
/*
diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..7ec2428 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -108,8 +108,9 @@ struct pt_regs {
#define PT_DAR 41
#define PT_DSISR 42
#define PT_RESULT 43
-#define PT_DSCR 44
#define PT_REGS_COUNT 44
+#define PT_DSCR 44
+#define PT_AMR 45

#define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */

diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 709e234..17f5d8a 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -241,6 +241,11 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ OFFSET(PACA_AMR, paca_struct, paca_amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 3fd0528..8db9ef8 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
ld r12,_MSR(r1)
ld r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ mfspr r5,SPRN_AMR
+ std r5,PACA_AMR(r13)
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
li r5,0x300
std r3,_DAR(r1)
std r4,_DSISR(r1)
@@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
ld r12,_MSR(r1)
ld r3,_NIP(r1)
andis. r4,r12,0x5820
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ mfspr r5,SPRN_AMR
+ std r5,PACA_AMR(r13)
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
li r5,0x400
std r3,_DAR(r1)
std r4,_DSISR(r1)
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..059766a 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
(unsigned long) &frame->tramp[2]);
}

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
+ return 1;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return 0;
}

@@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
long err;
unsigned int save_r2 = 0;
unsigned long msr;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#ifdef CONFIG_VSX
int i;
#endif
@@ -750,6 +758,12 @@ static long restore_user_regs(struct pt_regs *regs,
return 1;
#endif /* CONFIG_SPE */

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ err |= __get_user(amr, &sr->mc_gregs[PT_AMR]);
+ if (!err && amr != get_paca()->paca_amr)
+ write_amr(amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return 0;
}

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index c83c115..35df2e4 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
if (set != NULL)
err |= __put_user(set->sig[0], &sc->oldmask);

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ err |= __put_user(get_paca()->paca_amr, &sc->gp_regs[PT_AMR]);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return err;
}

@@ -327,6 +331,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#ifdef CONFIG_VSX
int i;
#endif
@@ -406,6 +413,13 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
}
#endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ err |= __get_user(amr, &sc->gp_regs[PT_AMR]);
+ if (!err && amr != get_paca()->paca_amr)
+ write_amr(amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return err;
}

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d4e545d..cc4bde8b 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -20,6 +20,7 @@
#include <linux/sched/debug.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/pkeys.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
#include <linux/ptrace.h>
@@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
info->si_addr = (void __user *)regs->nip;
}

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
+{
+ struct vm_area_struct *vma;
+
+ /* Fault not from Protection Keys: nothing to do */
+ if (si_code != SEGV_PKUERR)
+ return;
+
+ down_read(&current->mm->mmap_sem);
+ /*
+ * we could be racing with pkey_mprotect().
+ * If pkey_mprotect() wins the key value could
+ * get modified...xxx
+ */
+ vma = find_vma(current->mm, addr);
+ up_read(&current->mm->mmap_sem);
+
+ /*
+ * force_sig_info_fault() is called from a number of
+ * contexts, some of which have a VMA and some of which
+ * do not. The Pkey-fault handing happens after we have a
+ * valid VMA, so we should never reach this without a
+ * valid VMA.
+ */
+ if (!vma) {
+ WARN_ONCE(1, "Pkey fault with no VMA passed in");
+ info->si_pkey = 0;
+ return;
+ }
+
+ /*
+ * We could report the incorrect key because of the reason
+ * explained above.
+ *
+ * si_pkey should be thought off as a strong hint, but not
+ * an absolutely guarantee because of the race explained
+ * above.
+ */
+ info->si_pkey = vma_pkey(vma);
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
{
siginfo_t info;
@@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
info.si_signo = signr;
info.si_code = code;
info.si_addr = (void __user *) addr;
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ fill_sig_info_pkey(code, &info, addr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
force_sig_info(signr, &info, current);
}

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c31624f..dd448d2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -453,6 +453,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
0)) {
+
+ /* our caller may not have saved the amr. Lets save it */
+ get_paca()->paca_amr = read_amr();
+
code = SEGV_PKUERR;
goto bad_area;
}
--
1.8.3.1

2017-06-17 03:54:16

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

Replace the magic number used to check for DSI exception
with a meaningful value.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/reg.h | 9 ++++++++-
arch/powerpc/kernel/exceptions-64s.S | 2 +-
2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..2dcb8a1 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,23 @@
#define SPRN_DAR 0x013 /* Data Address Register */
#define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
#define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
+#define DSISR_BIT32 0x80000000 /* not defined */
#define DSISR_NOHPTE 0x40000000 /* no translation found */
+#define DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
+#define DSISR_BIT35 0x10000000 /* not defined */
#define DSISR_PROTFAULT 0x08000000 /* protection fault */
#define DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
#define DSISR_ISSTORE 0x02000000 /* access was a store */
#define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
-#define DSISR_NOSEGMENT 0x00200000 /* SLB miss */
#define DSISR_KEYFAULT 0x00200000 /* Key fault */
+#define DSISR_BIT43 0x00100000 /* not defined */
#define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
#define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
#define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
+#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+ DSISR_PAGEATTR_CONFLT | \
+ DSISR_BADACCESS | \
+ DSISR_BIT43)
#define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
#define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
#define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..3fd0528 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
.balign IFETCH_ALIGN_BYTES
do_hash_page:
#ifdef CONFIG_PPC_STD_MMU_64
- andis. r0,r4,0xa410 /* weird error? */
+ andis. r0,r4,DSISR_PAGE_FAULT_MASK@h
bne- handle_page_fault /* if not, try to insert a HPTE */
andis. r0,r4,DSISR_DABRMATCH@h
bne- handle_dabr_fault
--
1.8.3.1

2017-06-17 03:53:11

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 04/12] powerpc: store and restore the pkey state across context switches.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/processor.h | 5 +++++
arch/powerpc/kernel/process.c | 18 ++++++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
struct thread_vr_state ckvr_state; /* Checkpointed VR state */
unsigned long ckvrsave; /* Checkpointed VRSAVE */
#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+ unsigned long iamr;
+ unsigned long uamor;
+#endif
#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
void* kvm_shadow_vcpu; /* KVM internal data */
#endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
t->tar = mfspr(SPRN_TAR);
}
#endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ t->amr = mfspr(SPRN_AMR);
+ t->iamr = mfspr(SPRN_IAMR);
+ t->uamor = mfspr(SPRN_UAMOR);
+#endif
}

static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct *old_thread,
mtspr(SPRN_TAR, new_thread->tar);
}
#endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (old_thread->amr != new_thread->amr)
+ mtspr(SPRN_AMR, new_thread->amr);
+ if (old_thread->iamr != new_thread->iamr)
+ mtspr(SPRN_IAMR, new_thread->iamr);
+ if (old_thread->uamor != new_thread->uamor)
+ mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
}

struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp)
current->thread.tm_texasr = 0;
current->thread.tm_tfiar = 0;
#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ current->thread.amr = 0x0ul;
+ current->thread.iamr = 0x0ul;
+ current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
}
EXPORT_SYMBOL(start_thread);

--
1.8.3.1

2017-06-17 03:54:46

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 06/12] powerpc: Program HPTE key protection bits.

Map the PTE protection key bits to the HPTE key protection bits,
while creatiing HPTE entries.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +++++
arch/powerpc/include/asm/pkeys.h | 7 +++++++
arch/powerpc/mm/hash_utils_64.c | 5 +++++
3 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cfb8169..3d7872c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
#define HPTE_R_PP0 ASM_CONST(0x8000000000000000)
#define HPTE_R_TS ASM_CONST(0x4000000000000000)
#define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000)
+#define HPTE_R_KEY_BIT0 ASM_CONST(0x2000000000000000)
+#define HPTE_R_KEY_BIT1 ASM_CONST(0x1000000000000000)
#define HPTE_R_RPN_SHIFT 12
#define HPTE_R_RPN ASM_CONST(0x0ffffffffffff000)
#define HPTE_R_RPN_3_0 ASM_CONST(0x01fffffffffff000)
@@ -104,6 +106,9 @@
#define HPTE_R_C ASM_CONST(0x0000000000000080)
#define HPTE_R_R ASM_CONST(0x0000000000000100)
#define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00)
+#define HPTE_R_KEY_BIT2 ASM_CONST(0x0000000000000800)
+#define HPTE_R_KEY_BIT3 ASM_CONST(0x0000000000000400)
+#define HPTE_R_KEY_BIT4 ASM_CONST(0x0000000000000200)

#define HPTE_V_1TB_SEG ASM_CONST(0x4000000000000000)
#define HPTE_V_VRMA_MASK ASM_CONST(0x4001ffffff000000)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0f3dca8..9b6820d 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -27,6 +27,13 @@
((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))

+#define calc_pte_to_hpte_pkey_bits(pteflags) \
+ (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
+
/*
* Bits are in BE format.
* NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index c0f4b46..7d974cd 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
#include <linux/memblock.h>
#include <linux/context_tracking.h>
#include <linux/libfdt.h>
+#include <linux/pkeys.h>

#include <asm/debugfs.h>
#include <asm/processor.h>
@@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags)
*/
rflags |= HPTE_R_M;

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ rflags |= calc_pte_to_hpte_pkey_bits(pteflags);
+#endif
+
return rflags;
}

--
1.8.3.1

2017-06-17 03:54:47

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 03/12] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.

Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free() frees up the pkey.

Total 32 keys are supported on powerpc. However pkey 0,1 and 31
are reserved. So effectively we have 29 pkeys.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/Kconfig | 15 ++++
arch/powerpc/include/asm/book3s/64/mmu.h | 10 +++
arch/powerpc/include/asm/book3s/64/pgtable.h | 62 ++++++++++++++
arch/powerpc/include/asm/pkeys.h | 124 +++++++++++++++++++++++++++
arch/powerpc/include/asm/systbl.h | 2 +
arch/powerpc/include/asm/unistd.h | 4 +-
arch/powerpc/include/uapi/asm/unistd.h | 2 +
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/mmu_context_book3s64.c | 5 ++
arch/powerpc/mm/pkeys.c | 88 +++++++++++++++++++
include/linux/mm.h | 31 ++++---
include/uapi/asm-generic/mman-common.h | 2 +-
12 files changed, 331 insertions(+), 15 deletions(-)
create mode 100644 arch/powerpc/include/asm/pkeys.h
create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..b6960617 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config PPC64_MEMORY_PROTECTION_KEYS
+ prompt "PowerPC Memory Protection Keys"
+ def_bool y
+ # Note: only available in 64-bit mode
+ depends on PPC64 && PPC_64K_PAGES
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select ARCH_HAS_PKEYS
+ ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/powerpc/protection-keys.txt
+
+ If unsure, say y.
endmenu

config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
#ifdef CONFIG_SPAPR_TCE_IOMMU
struct list_head iommu_group_mem_list;
#endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ /*
+ * Each bit represents one protection key.
+ * bit set -> key allocated
+ * bit unset -> key available for allocation
+ */
+ u32 pkey_allocation_map;
+ s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
} mm_context_t;

/*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
}

+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include <asm/reg.h>
+static inline u64 read_amr(void)
+{
+ return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+ mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+ return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+ mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+ return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+ mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_amr(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_uamor(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_iamr(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 0000000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey() 32
+
+#define AMR_AD_BIT 0x1UL
+#define AMR_WD_BIT 0x2UL
+#define IAMR_EX_BIT 0x1UL
+#define AMR_BITS_PER_PKEY 2
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | \
+ VM_PKEY_BIT1 | \
+ VM_PKEY_BIT2 | \
+ VM_PKEY_BIT3 | \
+ VM_PKEY_BIT4)
+
+/*
+ * Bits are in BE format.
+ * NOTE: key 31, 1, 0 are not used.
+ * key 0 is used by default. It give read/write/execute permission.
+ * key 31 is reserved by the hypervisor.
+ * key 1 is recommended to be not used.
+ * PowerISA(3.0) page 1015, programming note.
+ */
+#define PKEY_INITIAL_ALLOCAION 0xc0000001
+
+#define pkeybit_mask(pkey) (0x1 << (arch_max_pkey() - pkey - 1))
+
+#define mm_pkey_allocation_map(mm) (mm->context.pkey_allocation_map)
+
+#define mm_set_pkey_allocated(mm, pkey) { \
+ mm_pkey_allocation_map(mm) |= pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_free(mm, pkey) { \
+ mm_pkey_allocation_map(mm) &= ~pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_is_allocated(mm, pkey) \
+ (mm_pkey_allocation_map(mm) & pkeybit_mask(pkey))
+
+#define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
+ pkeybit_mask(pkey))
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+ /* a reserved key is never considered as 'explicitly allocated' */
+ return (!mm_set_pkey_is_reserved(mm, pkey) &&
+ mm_set_pkey_is_allocated(mm, pkey));
+}
+
+/*
+ * Returns a positive, 5-bit key on success, or -1 on failure.
+ */
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+ /*
+ * Note: this is the one and only place we make sure
+ * that the pkey is valid as far as the hardware is
+ * concerned. The rest of the kernel trusts that
+ * only good, valid pkeys come out of here.
+ */
+ u32 all_pkeys_mask = (u32)(~(0x0));
+ int ret;
+
+ /*
+ * Are we out of pkeys? We must handle this specially
+ * because ffz() behavior is undefined if there are no
+ * zeros.
+ */
+ if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+ return -1;
+
+ ret = arch_max_pkey() -
+ ffz((u32)mm_pkey_allocation_map(mm))
+ - 1;
+ mm_set_pkey_allocated(mm, ret);
+ return ret;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+ if (!mm_pkey_is_allocated(mm, pkey))
+ return -EINVAL;
+
+ mm_set_pkey_free(mm, pkey);
+
+ return 0;
+}
+
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+extern int __execute_only_pkey(struct mm_struct *mm);
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+ return __execute_only_pkey(mm);
+}
+
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+ int prot, int pkey);
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+ int prot, int pkey)
+{
+ return __arch_override_mprotect_pkey(vma, prot, pkey);
+}
+
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val);
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+{
+ return __arch_set_user_pkey_access(tsk, pkey, init_val);
+}
+
+static inline pkey_mm_init(struct mm_struct *mm)
+{
+ mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
+ /* -1 means unallocated or invalid */
+ mm->context.execute_only_pkey = -1;
+}
+
+#endif /*_ASM_PPC64_PKEYS_H */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 1c94708..22dd776 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -388,3 +388,5 @@
COMPAT_SYS_SPU(pwritev2)
SYSCALL(kexec_file_load)
SYSCALL(statx)
+SYSCALL(pkey_alloc)
+SYSCALL(pkey_free)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 9ba11db..e0273bc 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,13 +12,11 @@
#include <uapi/asm/unistd.h>


-#define NR_syscalls 384
+#define NR_syscalls 386

#define __NR__exit __NR_exit

#define __IGNORE_pkey_mprotect
-#define __IGNORE_pkey_alloc
-#define __IGNORE_pkey_free

#ifndef __ASSEMBLY__

diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index b85f142..7993a07 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -394,5 +394,7 @@
#define __NR_pwritev2 381
#define __NR_kexec_file_load 382
#define __NR_statx 383
+#define __NR_pkey_alloc 384
+#define __NR_pkey_free 385

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 7414034..8cc2ff1 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -45,3 +45,4 @@ obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_iommu.o
obj-$(CONFIG_PPC_PTDUMP) += dump_linuxpagetables.o
obj-$(CONFIG_PPC_HTDUMP) += dump_hashpagetable.o
+obj-$(CONFIG_PPC64_MEMORY_PROTECTION_KEYS) += pkeys.o
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2a..2da9931 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -16,6 +16,7 @@
#include <linux/string.h>
#include <linux/types.h>
#include <linux/mm.h>
+#include <linux/pkeys.h>
#include <linux/spinlock.h>
#include <linux/idr.h>
#include <linux/export.h>
@@ -120,6 +121,10 @@ static int hash__init_new_context(struct mm_struct *mm)

subpage_prot_init_new_context(mm);

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ pkey_mm_init(mm);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return index;
}

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
new file mode 100644
index 0000000..b97366e
--- /dev/null
+++ b/arch/powerpc/mm/pkeys.c
@@ -0,0 +1,88 @@
+/*
+ * PowerPC Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/pkeys.h> /* PKEY_* */
+#include <uapi/asm-generic/mman-common.h>
+
+
+/*
+ * set the access right in AMR IAMR and UAMOR register
+ * for @pkey to that specified in @init_val.
+ */
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+{
+ u64 old_amr, old_uamor, old_iamr;
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+ u64 new_amr_bits = 0x0ul;
+ u64 new_iamr_bits = 0x0ul;
+ u64 new_uamor_bits = 0x3ul;
+
+ /* Set the bits we need in AMR: */
+ if (init_val & PKEY_DISABLE_ACCESS)
+ new_amr_bits |= AMR_AD_BIT;
+ if (init_val & PKEY_DISABLE_WRITE)
+ new_amr_bits |= AMR_WD_BIT;
+
+ /*
+ * By default execute is disabled.
+ * To enable execute, PKEY_ENABLE_EXECUTE
+ * needs to be specified.
+ */
+ if ((init_val & PKEY_DISABLE_EXECUTE))
+ new_iamr_bits |= IAMR_EX_BIT;
+
+ /* Shift the bits in to the correct place in AMR for pkey: */
+ new_amr_bits <<= pkey_shift;
+ new_iamr_bits <<= pkey_shift;
+ new_uamor_bits <<= pkey_shift;
+
+ /* Get old AMR and mask off any old bits in place: */
+ old_amr = read_amr();
+ old_amr &= ~((u64)(AMR_AD_BIT|AMR_WD_BIT) << pkey_shift);
+
+ old_iamr = read_iamr();
+ old_iamr &= ~(0x3ul << pkey_shift);
+
+ old_uamor = read_uamor();
+ old_uamor &= ~(0x3ul << pkey_shift);
+
+ /* Write old part along with new part: */
+ write_amr(old_amr | new_amr_bits);
+ write_iamr(old_iamr | new_iamr_bits);
+ write_uamor(old_uamor | new_uamor_bits);
+
+ return 0;
+}
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+ return -1;
+}
+
+/*
+ * This should only be called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
+ int pkey)
+{
+ /*
+ * Is this an mprotect_pkey() call? If so, never
+ * override the value that came from the user.
+ */
+ if (pkey != -1)
+ return pkey;
+
+ return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..34ddac7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */

#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
-# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
-# define VM_PKEY_BIT1 VM_HIGH_ARCH_1
-# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
-# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
-#endif
+#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
+ || defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
+#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#elif defined(CONFIG_PPC)
+#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 /* intel does not use this bit */
+ /* but reserved for future expansion */
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
# define VM_GROWSUP VM_ARCH_1
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0..b13ecc6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -76,5 +76,5 @@
#define PKEY_DISABLE_WRITE 0x2
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
PKEY_DISABLE_WRITE)
-
+#define PKEY_DISABLE_EXECUTE 0x4
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
--
1.8.3.1

2017-06-17 03:55:26

by Ram Pai

[permalink] [raw]
Subject: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
in the 64K backed hpte pages. This along with the earlier
patch will entirely free up the four bits from 64K PTE.

This patch does the following change to 64K PTE that is
backed by 64K hpte.

H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
second part of the pte.

since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
to bit 7. Trying to minimize gaps so that contiguous bits
can be allocated if needed in the future.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/hash-64k.h | 26 ++++++++------------------
arch/powerpc/mm/hash64_64k.c | 16 +++++++---------
arch/powerpc/mm/hugetlbpage-hash64.c | 16 ++++++----------
3 files changed, 21 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0eb3c89..2fa5c60 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,12 +12,8 @@
*/
#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_F_GIX_SHIFT 56

-
-#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */

/*
@@ -56,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
unsigned long *hidxp;

rpte.pte = pte;
- rpte.hidx = 0;
- if (pte_val(pte) & H_PAGE_COMBO) {
- /*
- * Make sure we order the hidx load against the H_PAGE_COMBO
- * check. The store side ordering is done in __hash_page_4K
- */
- smp_rmb();
- hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
- rpte.hidx = *hidxp;
- }
+ /*
+ * The store side ordering is done in __hash_page_4K
+ */
+ smp_rmb();
+ hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+ rpte.hidx = *hidxp;
return rpte;
}

static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
{
- if ((pte_val(rpte.pte) & H_PAGE_COMBO))
- return (rpte.hidx >> (index<<2)) & 0xf;
- return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
+ return ((rpte.hidx >> (index<<2)) & 0xfUL);
}

static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 3702a3c..1c25ec2 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -211,6 +211,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
unsigned long vsid, pte_t *ptep, unsigned long trap,
unsigned long flags, int ssize)
{
+ real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -247,6 +248,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));

rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);

if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -254,16 +256,13 @@ int __hash_page_64K(unsigned long ea, unsigned long access,

vpn = hpt_vpn(ea, vsid, ssize);
if (unlikely(old_pte & H_PAGE_HASHPTE)) {
+ unsigned long gslot;
+
/*
* There MIGHT be an HPTE for this pte
*/
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
- if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
+ gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
+ if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_64K,
MMU_PAGE_64K, ssize,
flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
@@ -313,8 +312,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
return -1;
}

- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ set_hidx_slot(ptep, rpte, 0, slot);
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
index a84bb44..239ca86 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
pte_t *ptep, unsigned long trap, unsigned long flags,
int ssize, unsigned int shift, unsigned int mmu_psize)
{
+ real_pte_t rpte;
unsigned long vpn;
unsigned long old_pte, new_pte;
unsigned long rflags, pa, sz;
@@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));

rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);

sz = ((1UL) << shift);
if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -71,15 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
/* Check if pte already has an hpte (case 2) */
if (unlikely(old_pte & H_PAGE_HASHPTE)) {
/* There MIGHT be an HPTE for this pte */
- unsigned long hash, slot;
+ unsigned long gslot;

- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
- if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
+ gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
+ if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
mmu_psize, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
}
@@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
return -1;
}

- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
}

/*
--
1.8.3.1

2017-06-19 11:04:48

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC v2 12/12]selftest: Updated protection key selftest

Ram Pai <[email protected]> writes:

> Added test support for PowerPC implementation off protection keys.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> tools/testing/selftests/vm/Makefile | 1 +
> tools/testing/selftests/vm/pkey-helpers.h | 365 +++++++
> tools/testing/selftests/vm/protection_keys.c | 1451 +++++++++++++++++++++++++
> tools/testing/selftests/x86/Makefile | 2 +-
> tools/testing/selftests/x86/pkey-helpers.h | 219 ----
> tools/testing/selftests/x86/protection_keys.c | 1395 ------------------------

Please split the move and the addition of the powerpc code into two
separate patches (move first). That way we can actually see what you're
doing to add powerpc support.

cheers

2017-06-19 11:06:17

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

Ram Pai <[email protected]> writes:

> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
> 1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 8db9ef8..a4de1b4 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
> ld r12,_MSR(r1)
> ld r3,PACA_EXGEN+EX_DAR(r13)
> lwz r4,PACA_EXGEN+EX_DSISR(r13)
> + std r3,_DAR(r1)
> + std r4,_DSISR(r1)
> #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
> + beq+ 1f

This seems to be incremental on top of one of your other patches.

But I don't see why, can you please just squash this into whatever patch
adds this code in the first place.

cheers

2017-06-19 12:18:05

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC v2 03/12] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.

Hi Ram,

Ram Pai <[email protected]> writes:
> Sys_pkey_alloc() allocates and returns available pkey
> Sys_pkey_free() frees up the pkey.
>
> Total 32 keys are supported on powerpc. However pkey 0,1 and 31
> are reserved. So effectively we have 29 pkeys.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> include/linux/mm.h | 31 ++++---
> include/uapi/asm-generic/mman-common.h | 2 +-

Those changes need to be split out and acked by mm folks.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cb17c6..34ddac7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
> #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
>
> #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
> -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
> -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
> -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit arch */
> +#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit arch */
> +#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit arch */
> +#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit arch */

Please don't change the comments, it makes the diff harder to read.

You're actually just adding this AFAICS:

> +#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */

> #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>
> #if defined(CONFIG_X86)
^
> # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
> -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
> -# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> -# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
> -# define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> -# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> -# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> -#endif
> +#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
> + || defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
> +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
^ 4?
> +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */

That appears to be inside an #if defined(CONFIG_X86) ?

> #elif defined(CONFIG_PPC)
^
Should be CONFIG_PPC64_MEMORY_PROTECTION_KEYS no?

> +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
> +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> +#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 /* intel does not use this bit */
> + /* but reserved for future expansion */

But this hunk is for PPC ?

Is it OK for the other arches & generic code to add another VM_PKEY_BIT4 ?

Do you need to update show_smap_vma_flags() ?

> # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
> #elif defined(CONFIG_PARISC)
> # define VM_GROWSUP VM_ARCH_1

> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 8c27db0..b13ecc6 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -76,5 +76,5 @@
> #define PKEY_DISABLE_WRITE 0x2
> #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> PKEY_DISABLE_WRITE)
> -
> +#define PKEY_DISABLE_EXECUTE 0x4

How you can set that if it's not in PKEY_ACCESS_MASK?

See:

SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
{
int pkey;
int ret;

/* No flags supported yet. */
if (flags)
return -EINVAL;
/* check for unsupported init values */
if (init_val & ~PKEY_ACCESS_MASK)
return -EINVAL;


cheers

2017-06-19 17:59:36

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote:
> Ram Pai <[email protected]> writes:
>
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
> > 1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > index 8db9ef8..a4de1b4 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
> > ld r12,_MSR(r1)
> > ld r3,PACA_EXGEN+EX_DAR(r13)
> > lwz r4,PACA_EXGEN+EX_DSISR(r13)
> > + std r3,_DAR(r1)
> > + std r4,_DSISR(r1)
> > #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
> > + beq+ 1f
>
> This seems to be incremental on top of one of your other patches.
>
> But I don't see why, can you please just squash this into whatever patch
> adds this code in the first place.

It was an optimization added later. But yes it can be squashed into an
earlier patch.

RP

2017-06-20 05:11:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC v2 00/12] powerpc: Memory Protection Keys

On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote:
> Memory protection keys enable applications to protect its
> address space from inadvertent access or corruption from
> itself.

I presume by itself you mean protection between threads?

>
> The overall idea:
>
> A process allocates a key and associates it with
> a address range within its address space.

OK, so this is per VMA?

> The process than can dynamically set read/write
> permissions on the key without involving the
> kernel.

This bit is not clear, how can the key be set without
involving the kernel? I presume you mean the key is set
in the PTE's and the access protection values can be
set without involving the kernel?

Any code that violates the permissions
> off the address space; as defined by its associated
> key, will receive a segmentation fault.
>
> This patch series enables the feature on PPC64.
> It is enabled on HPTE 64K-page platform.
>
> ISA3.0 section 5.7.13 describes the detailed specifications.
>
>
> Testing:
> This patch series has passed all the protection key
> tests available in the selftests directory.
> The tests are updated to work on both x86 and powerpc.

Balbir

2017-06-20 06:06:07

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 00/12] powerpc: Memory Protection Keys

On 06/20/2017 10:40 AM, Balbir Singh wrote:
> On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote:
>> Memory protection keys enable applications to protect its
>> address space from inadvertent access or corruption from
>> itself.
>
> I presume by itself you mean protection between threads?

Between threads due to race conditions or from the same thread
because of programming error.

>
>>
>> The overall idea:
>>
>> A process allocates a key and associates it with
>> a address range within its address space.
>
> OK, so this is per VMA?

Yeah but the same key can be given to multiple VMAs. Any
change will effect every VMA who got tagged by it.

>
>> The process than can dynamically set read/write
>> permissions on the key without involving the
>> kernel.
>
> This bit is not clear, how can the key be set without
> involving the kernel? I presume you mean the key is set

With pkey_mprotect() system call, all the effected PTEs get
tagged for once. Switching the permission happens just by
writing into the register on the fly.

> in the PTE's and the access protection values can be
> set without involving the kernel?

PTE setting happens once, access protection values can be
changed on the fly through register.

2017-06-20 06:18:37

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 11/12]Documentation: Documentation updates.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> The Documentaton file is moved from x86 into the generic area,
> since this feature is now supported by more than one archs.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> Documentation/vm/protection-keys.txt | 110 ++++++++++++++++++++++++++++++++++
> Documentation/x86/protection-keys.txt | 85 --------------------------

I am not sure whether this is a good idea. There might be
specifics for each architecture which need to be detailed
again in this new generic one.

> 2 files changed, 110 insertions(+), 85 deletions(-)
> create mode 100644 Documentation/vm/protection-keys.txt
> delete mode 100644 Documentation/x86/protection-keys.txt
>
> diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt
> new file mode 100644
> index 0000000..b49e6bb
> --- /dev/null
> +++ b/Documentation/vm/protection-keys.txt
> @@ -0,0 +1,110 @@
> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +found in new generation of intel CPUs on PowerPC CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.

Does resultant access through protection keys should be a
subset of the protection bits enabled through original PTE
PROT format ? Does the semantics exactly the same on x86
and powerpc ?

> +
> +
> +On Intel:
> +
> +It works by dedicating 4 previously ignored bits in each page table
> +entry to a "protection key", giving 16 possible keys.
> +
> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key. Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register. The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs. These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.
> +
> +
> +On PowerPC:
> +
> +It works by dedicating 5 page table entry to a "protection key",
> +giving 32 possible keys.
> +
> +There is a user-accessible register (AMR) with two separate bits
> +(Access Disable and Write Disable) for each key. Being a CPU
> +register, AMR is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.

Small nit. Space needed here.

> +NOTE: Disabling read permission does not disable
> +write and vice-versa.
> +
> +The feature is available on 64-bit HPTE mode only.
> +
> +'mtspr 0xd, mem' reads the AMR register
> +'mfspr mem, 0xd' writes into the AMR register.
> +
> +Permissions are enforced on data access only and have no effect on
> +instruction fetches.
> +
> +=========================== Syscalls ===========================
> +
> +There are 3 system calls which directly interact with pkeys:
> +
> + int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
> + int pkey_free(int pkey);
> + int pkey_mprotect(unsigned long start, size_t len,
> + unsigned long prot, int pkey);
> +
> +Before a pkey can be used, it must first be allocated with
> +pkey_alloc(). An application calls the WRPKRU instruction
> +directly in order to change access permissions to memory covered
> +with a key. In this example WRPKRU is wrapped by a C function
> +called pkey_set().
> +
> + int real_prot = PROT_READ|PROT_WRITE;
> + pkey = pkey_alloc(0, PKEY_DENY_WRITE);
> + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
> + ... application runs here
> +
> +Now, if the application needs to update the data at 'ptr', it can
> +gain access, do the update, then remove its write access:
> +
> + pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
> + *ptr = foo; // assign something
> + pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
> +
> +Now when it frees the memory, it will also free the pkey since it
> +is no longer in use:
> +
> + munmap(ptr, PAGE_SIZE);
> + pkey_free(pkey);
> +
> +(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
> + An example implementation can be found in
> + tools/testing/selftests/x86/protection_keys.c)
> +
> +=========================== Behavior ===========================
> +
> +The kernel attempts to make protection keys consistent with the
> +behavior of a plain mprotect(). For instance if you do this:
> +
> + mprotect(ptr, size, PROT_NONE);
> + something(ptr);
> +
> +you can expect the same effects with protection keys when doing this:
> +
> + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
> + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
> + something(ptr);
> +
> +That should be true whether something() is a direct access to 'ptr'
> +like:
> +
> + *ptr = foo;
> +
> +or when the kernel does the access on the application's behalf like
> +with a read():
> +
> + read(fd, ptr, 1);
> +
> +The kernel will send a SIGSEGV in both cases, but si_code will be set
> +to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
> +the plain mprotect() permissions are violated.

I guess the right thing would be to have three files

* Documentation/vm/protection-keys.txt

- Generic interface, system calls
- Signal handling, error codes
- Semantics of programming with an example

* Documentation/x86/protection-keys.txt

- Number of active protections keys inside an address space
- X86 protection key instruction details
- PTE protection bits placement details
- Page fault handling
- Implementation details a bit ?

* Documentation/powerpc/protection-keys.txt

- Number of active protections keys inside an address space
- Powerpc instructions details
- PTE protection bits placement details
- Page fault handling
- Implementation details a bit ?

2017-06-20 06:26:22

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 12/12]selftest: Updated protection key selftest

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Added test support for PowerPC implementation off protection keys.
>
> Signed-off-by: Ram Pai <[email protected]>

First of all, there are a lot of instances where we use *pkru*
named functions on power even the real implementations have
taken care of doing appropriate things. That looks pretty
hacky. We need to change them to generic names first before
adding both x86 and powerpc procedures inside it.

2017-06-20 06:47:39

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

On 06/19/2017 11:29 PM, Ram Pai wrote:
> On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote:
>> Ram Pai <[email protected]> writes:
>>
>>> Signed-off-by: Ram Pai <[email protected]>
>>> ---
>>> arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
>>> 1 file changed, 10 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
>>> index 8db9ef8..a4de1b4 100644
>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>> @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
>>> ld r12,_MSR(r1)
>>> ld r3,PACA_EXGEN+EX_DAR(r13)
>>> lwz r4,PACA_EXGEN+EX_DSISR(r13)
>>> + std r3,_DAR(r1)
>>> + std r4,_DSISR(r1)
>>> #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
>>> + beq+ 1f
>>
>> This seems to be incremental on top of one of your other patches.
>>
>> But I don't see why, can you please just squash this into whatever patch
>> adds this code in the first place.
>
> It was an optimization added later. But yes it can be squashed into an
> earlier patch.

Could you please explain what is the optimization this achieves ?

2017-06-20 06:55:52

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> The value of the AMR register at the time of exception
> is made available in gp_regs[PT_AMR] of the siginfo.
>
> This field can be used to reprogram the permission bits of
> any valid pkey.
>
> Similarly the value of the pkey, whose protection got violated,
> is made available at si_pkey field of the siginfo structure.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/paca.h | 1 +
> arch/powerpc/include/uapi/asm/ptrace.h | 3 ++-
> arch/powerpc/kernel/asm-offsets.c | 5 ++++
> arch/powerpc/kernel/exceptions-64s.S | 8 ++++++
> arch/powerpc/kernel/signal_32.c | 14 ++++++++++
> arch/powerpc/kernel/signal_64.c | 14 ++++++++++
> arch/powerpc/kernel/traps.c | 49 ++++++++++++++++++++++++++++++++++
> arch/powerpc/mm/fault.c | 4 +++
> 8 files changed, 97 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 1c09f8f..a41afd3 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -92,6 +92,7 @@ struct paca_struct {
> struct dtl_entry *dispatch_log_end;
> #endif /* CONFIG_PPC_STD_MMU_64 */
> u64 dscr_default; /* per-CPU default DSCR */
> + u64 paca_amr; /* value of amr at exception */
>
> #ifdef CONFIG_PPC_STD_MMU_64
> /*
> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> index 8036b38..7ec2428 100644
> --- a/arch/powerpc/include/uapi/asm/ptrace.h
> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> @@ -108,8 +108,9 @@ struct pt_regs {
> #define PT_DAR 41
> #define PT_DSISR 42
> #define PT_RESULT 43
> -#define PT_DSCR 44
> #define PT_REGS_COUNT 44
> +#define PT_DSCR 44
> +#define PT_AMR 45

PT_REGS_COUNT is not getting incremented even after adding
one more element into the pack ?

>
> #define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */
>
> diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
> index 709e234..17f5d8a 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -241,6 +241,11 @@ int main(void)
> OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
> OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
> OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
> +
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + OFFSET(PACA_AMR, paca_struct, paca_amr);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +

So we now have a place in PACA for AMR.

> OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
> OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
> OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 3fd0528..8db9ef8 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
> ld r12,_MSR(r1)
> ld r3,PACA_EXGEN+EX_DAR(r13)
> lwz r4,PACA_EXGEN+EX_DSISR(r13)
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + mfspr r5,SPRN_AMR
> + std r5,PACA_AMR(r13)
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> li r5,0x300
> std r3,_DAR(r1)
> std r4,_DSISR(r1)
> @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
> ld r12,_MSR(r1)
> ld r3,_NIP(r1)
> andis. r4,r12,0x5820
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + mfspr r5,SPRN_AMR
> + std r5,PACA_AMR(r13)
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */

Saving the AMR context on page faults, this seems to be
changing in the next patch again based on whether any
key was active at that point and fault happened for the
key enforcement ?

> li r5,0x400
> std r3,_DAR(r1)
> std r4,_DSISR(r1)
> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index 97bb138..059766a 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
> (unsigned long) &frame->tramp[2]);
> }
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
> + return 1;
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> return 0;
> }
>
> @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
> long err;
> unsigned int save_r2 = 0;
> unsigned long msr;
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + unsigned long amr;
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> #ifdef CONFIG_VSX
> int i;
> #endif
> @@ -750,6 +758,12 @@ static long restore_user_regs(struct pt_regs *regs,
> return 1;
> #endif /* CONFIG_SPE */
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + err |= __get_user(amr, &sr->mc_gregs[PT_AMR]);
> + if (!err && amr != get_paca()->paca_amr)
> + write_amr(amr);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> return 0;
> }
>
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index c83c115..35df2e4 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
> if (set != NULL)
> err |= __put_user(set->sig[0], &sc->oldmask);
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + err |= __put_user(get_paca()->paca_amr, &sc->gp_regs[PT_AMR]);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> return err;
> }
>
> @@ -327,6 +331,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> unsigned long save_r13 = 0;
> unsigned long msr;
> struct pt_regs *regs = tsk->thread.regs;
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + unsigned long amr;
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> #ifdef CONFIG_VSX
> int i;
> #endif
> @@ -406,6 +413,13 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
> }
> #endif
> +
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + err |= __get_user(amr, &sc->gp_regs[PT_AMR]);
> + if (!err && amr != get_paca()->paca_amr)
> + write_amr(amr);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> return err;
> }
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index d4e545d..cc4bde8b 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -20,6 +20,7 @@
> #include <linux/sched/debug.h>
> #include <linux/kernel.h>
> #include <linux/mm.h>
> +#include <linux/pkeys.h>
> #include <linux/stddef.h>
> #include <linux/unistd.h>
> #include <linux/ptrace.h>
> @@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
> info->si_addr = (void __user *)regs->nip;
> }
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> +static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
> +{
> + struct vm_area_struct *vma;
> +
> + /* Fault not from Protection Keys: nothing to do */
> + if (si_code != SEGV_PKUERR)
> + return;

Should have checked this in the caller ?

> +
> + down_read(&current->mm->mmap_sem);
> + /*
> + * we could be racing with pkey_mprotect().
> + * If pkey_mprotect() wins the key value could
> + * get modified...xxx
> + */
> + vma = find_vma(current->mm, addr);
> + up_read(&current->mm->mmap_sem);
> +
> + /*
> + * force_sig_info_fault() is called from a number of
> + * contexts, some of which have a VMA and some of which
> + * do not. The Pkey-fault handing happens after we have a
> + * valid VMA, so we should never reach this without a
> + * valid VMA.
> + */

Also because pkey can only be used from user space when we will
definitely have a VMA associated with it.

> + if (!vma) {
> + WARN_ONCE(1, "Pkey fault with no VMA passed in");
> + info->si_pkey = 0;
> + return;
> + }
> +
> + /*
> + * We could report the incorrect key because of the reason
> + * explained above.

What if we hold mm->mmap_sem for some more time till we update
info->si_pkey ? Is there still a chance that pkey would have
changed by the time siginfo returns to user space ? I am still
wondering is there way to hold up VMA changes to be on safer
side. Is the race conditions exists on x86 as well ?

> + *
> + * si_pkey should be thought off as a strong hint, but not
> + * an absolutely guarantee because of the race explained
> + * above.
> + */
> + info->si_pkey = vma_pkey(vma);
> +}
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
> {
> siginfo_t info;
> @@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
> info.si_signo = signr;
> info.si_code = code;
> info.si_addr = (void __user *) addr;
> +
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + fill_sig_info_pkey(code, &info, addr);
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> force_sig_info(signr, &info, current);
> }
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index c31624f..dd448d2 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -453,6 +453,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> flags & FAULT_FLAG_INSTRUCTION,
> 0)) {
> +
> + /* our caller may not have saved the amr. Lets save it */
> + get_paca()->paca_amr = read_amr();
> +

Something is not right here. PACA save should have happened before we
come here. Why say the caller might not have saved the AMR ? Is there
a path when its possible ?

2017-06-20 07:25:00

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Handle Data and Instruction exceptions caused by memory
> protection-key.
>
> Signed-off-by: Ram Pai <[email protected]>
> (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)

To which tree this commit belongs to ?

>
> Conflicts:
> arch/powerpc/include/asm/reg.h
> arch/powerpc/kernel/exceptions-64s.S
> ---
> arch/powerpc/include/asm/mmu_context.h | 12 +++++
> arch/powerpc/include/asm/pkeys.h | 9 ++++
> arch/powerpc/include/asm/reg.h | 7 +--
> arch/powerpc/mm/fault.c | 21 +++++++-
> arch/powerpc/mm/pkeys.c | 90 ++++++++++++++++++++++++++++++++++
> 5 files changed, 134 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index da7e943..71fffe0 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
> {
> }
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> +bool arch_pte_access_permitted(pte_t pte, bool write);
> +bool arch_vma_access_permitted(struct vm_area_struct *vma,
> + bool write, bool execute, bool foreign);
> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +static inline bool arch_pte_access_permitted(pte_t pte, bool write)
> +{
> + /* by default, allow everything */
> + return true;
> +}
> static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> bool write, bool execute, bool foreign)
> {
> /* by default, allow everything */
> return true;
> }

Right, these are the two functions the core VM expects the
arch to provide.

> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> #endif /* __KERNEL__ */
> #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
> index 9b6820d..405e7db 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -14,6 +14,15 @@
> VM_PKEY_BIT3 | \
> VM_PKEY_BIT4)
>
> +static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
> +{
> + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
> + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
> +}

Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ?

> +
> #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
> ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
> ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 2dcb8a1..a11977f 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -285,9 +285,10 @@
> #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> - DSISR_PAGEATTR_CONFLT | \
> - DSISR_BADACCESS | \
> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> + DSISR_PAGEATTR_CONFLT | \
> + DSISR_BADACCESS | \
> + DSISR_KEYFAULT | \
> DSISR_BIT43)

This should have been cleaned up before adding new
DSISR_KEYFAULT reason code into it. But I guess its
okay.

> #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
> #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 3a7d580..c31624f 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> * bits we are interested in. But there are some bits which
> * indicate errors in DSISR but can validly be set in SRR1.
> */
> - if (trap == 0x400)
> + if (trap == 0x400) {
> error_code &= 0x48200000;
> - else
> + flags |= FAULT_FLAG_INSTRUCTION;
> + } else
> is_write = error_code & DSISR_ISSTORE;
> #else

Why adding the FAULT_FLAG_INSTRUCTION here ?

> is_write = error_code & ESR_DST;
> @@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> }
> #endif
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (error_code & DSISR_KEYFAULT) {
> + code = SEGV_PKUERR;
> + goto bad_area_nosemaphore;
> + }
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> /* We restore the interrupt state now */
> if (!arch_irq_disabled_regs(regs))
> local_irq_enable();
> @@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
> #endif /* CONFIG_PPC_STD_MMU */
>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> + flags & FAULT_FLAG_INSTRUCTION,
> + 0)) {
> + code = SEGV_PKUERR;
> + goto bad_area;
> + }
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +

I am wondering why both the above checks are required ?

* DSISR should contains DSISR_KEYFAULT

* VMA pkey values whether they matched the fault cause


> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> index 11a32b3..439241a 100644
> --- a/arch/powerpc/mm/pkeys.c
> +++ b/arch/powerpc/mm/pkeys.c
> @@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
> return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
> }
>
> +static inline bool pkey_allows_read(int pkey)
> +{
> + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
> +
> + if (!(read_uamor() & (0x3ul << pkey_shift)))
> + return true;
> +
> + return !(read_amr() & (AMR_AD_BIT << pkey_shift));
> +}

Get read_amr() into a local variable and save some cycles if we
have to do it again.

> +
> +static inline bool pkey_allows_write(int pkey)
> +{
> + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
> +
> + if (!(read_uamor() & (0x3ul << pkey_shift)))
> + return true;
> +
> + return !(read_amr() & (AMR_WD_BIT << pkey_shift));
> +}
> +

Ditto

> +static inline bool pkey_allows_execute(int pkey)
> +{
> + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
> +
> + if (!(read_uamor() & (0x3ul << pkey_shift)))
> + return true;
> +
> + return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
> +}

Ditto

> +
> +
> /*
> * set the access right in AMR IAMR and UAMOR register
> * for @pkey to that specified in @init_val.
> @@ -175,3 +206,62 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
> */
> return vma_pkey(vma);
> }
> +
> +bool arch_pte_access_permitted(pte_t pte, bool write)
> +{
> + int pkey = pte_flags_to_pkey(pte_val(pte));
> +
> + if (!pkey_allows_read(pkey))
> + return false;
> + if (write && !pkey_allows_write(pkey))
> + return false;
> + return true;
> +}
> +
> +/*
> + * We only want to enforce protection keys on the current process
> + * because we effectively have no access to AMR/IAMR for other
> + * processes or any way to tell *which * AMR/IAMR in a threaded
> + * process we could use.
> + *
> + * So do not enforce things if the VMA is not from the current
> + * mm, or if we are in a kernel thread.
> + */
> +static inline bool vma_is_foreign(struct vm_area_struct *vma)
> +{
> + if (!current->mm)
> + return true;
> + /*
> + * if the VMA is from another process, then AMR/IAMR has no
> + * relevance and should not be enforced.
> + */
> + if (current->mm != vma->vm_mm)
> + return true;
> +
> + return false;
> +}
> +

This seems pretty generic, should not be moved to core MM ?

2017-06-20 08:14:50

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Replace the magic number used to check for DSI exception
> with a meaningful value.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/reg.h | 9 ++++++++-
> arch/powerpc/kernel/exceptions-64s.S | 2 +-
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 7e50e47..2dcb8a1 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -272,16 +272,23 @@
> #define SPRN_DAR 0x013 /* Data Address Register */
> #define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
> #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
> +#define DSISR_BIT32 0x80000000 /* not defined */
> #define DSISR_NOHPTE 0x40000000 /* no translation found */
> +#define DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
> +#define DSISR_BIT35 0x10000000 /* not defined */
> #define DSISR_PROTFAULT 0x08000000 /* protection fault */
> #define DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
> #define DSISR_ISSTORE 0x02000000 /* access was a store */
> #define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
> -#define DSISR_NOSEGMENT 0x00200000 /* SLB miss */
> #define DSISR_KEYFAULT 0x00200000 /* Key fault */
> +#define DSISR_BIT43 0x00100000 /* not defined */
> #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> + DSISR_PAGEATTR_CONFLT | \
> + DSISR_BADACCESS | \
> + DSISR_BIT43)

Sorry missed this one. Seems like there are couple of unnecessary
line additions in the subsequent patch which adds the new PKEY
reason code.

-#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
- DSISR_PAGEATTR_CONFLT | \
- DSISR_BADACCESS | \
+#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+ DSISR_PAGEATTR_CONFLT | \
+ DSISR_BADACCESS | \
+ DSISR_KEYFAULT | \
DSISR_BIT43)



2017-06-20 08:22:04

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 06/12] powerpc: Program HPTE key protection bits.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Map the PTE protection key bits to the HPTE key protection bits,
> while creatiing HPTE entries.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +++++
> arch/powerpc/include/asm/pkeys.h | 7 +++++++
> arch/powerpc/mm/hash_utils_64.c | 5 +++++
> 3 files changed, 17 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index cfb8169..3d7872c 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -90,6 +90,8 @@
> #define HPTE_R_PP0 ASM_CONST(0x8000000000000000)
> #define HPTE_R_TS ASM_CONST(0x4000000000000000)
> #define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000)
> +#define HPTE_R_KEY_BIT0 ASM_CONST(0x2000000000000000)
> +#define HPTE_R_KEY_BIT1 ASM_CONST(0x1000000000000000)
> #define HPTE_R_RPN_SHIFT 12
> #define HPTE_R_RPN ASM_CONST(0x0ffffffffffff000)
> #define HPTE_R_RPN_3_0 ASM_CONST(0x01fffffffffff000)
> @@ -104,6 +106,9 @@
> #define HPTE_R_C ASM_CONST(0x0000000000000080)
> #define HPTE_R_R ASM_CONST(0x0000000000000100)
> #define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00)
> +#define HPTE_R_KEY_BIT2 ASM_CONST(0x0000000000000800)
> +#define HPTE_R_KEY_BIT3 ASM_CONST(0x0000000000000400)
> +#define HPTE_R_KEY_BIT4 ASM_CONST(0x0000000000000200)
>

Should we indicate/document how these 5 bits are not contiguous
in the HPTE format for any given real page ?

> #define HPTE_V_1TB_SEG ASM_CONST(0x4000000000000000)
> #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ffffff000000)
> diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
> index 0f3dca8..9b6820d 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -27,6 +27,13 @@
> ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
> ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
>
> +#define calc_pte_to_hpte_pkey_bits(pteflags) \
> + (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
> + ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
> +

We can drop calc_ in here. pte_to_hpte_pkey_bits should be
sufficient.

2017-06-20 09:57:16

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC v2 00/12] powerpc: Memory Protection Keys

On Tue, 2017-06-20 at 15:10 +1000, Balbir Singh wrote:
> On Fri, 2017-06-16 at 20:52 -0700, Ram Pai wrote:
> > Memory protection keys enable applications to protect its
> > address space from inadvertent access or corruption from
> > itself.
>
> I presume by itself you mean protection between threads?

Not necessarily. You could have for example a JIT that
when it runs the JITed code, only "opens" the keys for
the VM itself, preventing the JITed code from "leaking out"

There are plenty of other usages...
>
> > The overall idea:
> >
> > A process allocates a key and associates it with
> > a address range within its address space.
>
> OK, so this is per VMA?
>
> > The process than can dynamically set read/write
> > permissions on the key without involving the
> > kernel.
>
> This bit is not clear, how can the key be set without
> involving the kernel? I presume you mean the key is set
> in the PTE's and the access protection values can be
> set without involving the kernel?
>
> Any code that violates the permissions
> > off the address space; as defined by its associated
> > key, will receive a segmentation fault.
> >
> > This patch series enables the feature on PPC64.
> > It is enabled on HPTE 64K-page platform.
> >
> > ISA3.0 section 5.7.13 describes the detailed specifications.
> >
> >
> > Testing:
> > This patch series has passed all the protection key
> > tests available in the selftests directory.
> > The tests are updated to work on both x86 and powerpc.
>
> Balbir

2017-06-20 10:20:40

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> in the 4K backed hpte pages. These bits continue to be used
> for 64K backed hpte pages in this patch, but will be freed
> up in the next patch.

The counting 3, 4, 5 and 6 are in BE format I believe, I was
initially trying to see that from right to left as we normally
do in the kernel and was getting confused. So basically these
bits (which are only applicable for 64K mapping IIUC) are going
to be freed up from the PTE format.

#define _RPAGE_RSV1 0x1000000000000000UL
#define _RPAGE_RSV2 0x0800000000000000UL
#define _RPAGE_RSV3 0x0400000000000000UL
#define _RPAGE_RSV4 0x0200000000000000UL

As you have mentioned before this feature is available for 64K
page size only and not for 4K mappings. So I assume we support
both the combinations.

* 64K mapping on 64K
* 64K mapping on 4K

These are the current users of the above bits

#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */

>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 9

and what is in there on bit 9 now ? This ?

#define _RPAGE_SW2 0x00400

which is used as

#define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */

which will not be required any more ?

> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating an invalid slot. If a hpte
> gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> released immediately. In other words, even though 0xF is a

Release immediately means we attempt again for a new hash slot ?

> valid slot we discard and consider it as an invalid
> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> depend on a bit in the primary PTE in order to determine the
> validity of a slot.

So we have to see the slot number in the second half for each PTE to
figure out if it has got a valid slot in the hash page table.

>
> When we release a hpte in the 0xF slot we also release a
> legitimate primary slot and unmap that entry. This is to
> ensure that we do get a legimate non-0xF slot the next time we
> retry for a slot.

Okay.

>
> Though treating 0xF slot as invalid reduces the number of available
> slots and may have an effect on the performance, the probabilty
> of hitting a 0xF is extermely low.

Why you say that ? I thought every slot number has the same probability
of hit from the hash function.

>
> Compared to the current scheme, the above described scheme reduces
> the number of false hash table updates significantly and has the

How it reduces false hash table updates ?

> added advantage of releasing four valuable PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
> arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
> arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> arch/powerpc/mm/hash64_4k.c | 14 ++---
> arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
> arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
> 8 files changed, 122 insertions(+), 78 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b..5ef1d81 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -16,6 +16,18 @@
> #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>
> +
> +/*
> + * Only supported by 4k linux page size
> + */
> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT 56
> +
> +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> +

So we moved the common 64K definitions here.


> /* PTE flags to conserve for HPTE identification */
> #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> H_PAGE_F_SECOND | H_PAGE_F_GIX)
> @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
> }
> #endif
>
> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + return (slot << H_PAGE_F_GIX_SHIFT) &
> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +}

Why we are passing the first 3 arguments of the function if we never
use it inside. Is the caller expected to take care of it ?

> +
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> static inline char *get_hpte_slot_array(pmd_t *pmdp)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 9732837..0eb3c89 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -10,23 +10,25 @@
> * 64k aligned address free up few of the lower bits of RPN for us
> * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> */
> -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */

Its the same thing, changes nothing.

> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT 56
> +
> +
> +#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */

H_PAGE_BUSY seems to be differently defined here.

> +
> /*
> * We need to differentiate between explicit huge page and THP huge
> * page, since THP huge page also need to track real subpage details
> */
> #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
>
> -/*
> - * Used to track subpage group valid if H_PAGE_COMBO is set
> - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> - */
> -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)

H_PAGE_COMBO_VALID is not defined alternately ?

> -
> /* PTE flags to conserve for HPTE identification */
> -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
> +

Slot information has moved to the second half, hence _PAGE_HPTEFLAGS
need not carry that.

> /*
> * we support 16 fragments per PTE page of 64K size.
> */
> @@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> }
>
> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> +
> + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> + return 0x0UL;
> +}

New method to insert the slot information in the second half.

> +
> #define __rpte_to_pte(r) ((r).pte)
> extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
> /*
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index 4e957b0..e7cf03a 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -8,11 +8,8 @@
> *
> */
> #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> -#define H_PAGE_F_GIX_SHIFT 56
> -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */

Removing the common definitions.

> +
> +#define INIT_HIDX (~0x0UL)
>
> #ifdef CONFIG_PPC_64K_PAGES
> #include <asm/book3s/64/hash-64k.h>
> @@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
> return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
> }
>
> +static inline bool hpte_soft_invalid(unsigned long slot)
> +{
> + return ((slot & 0xfUL) == 0xfUL);
> +}
> +
> +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> +
> /* This low level function performs the actual PTE insertion
> * Setting the PTE depends on the MMU type and other factors. It's
> * an horrible mess that I'm not going to try to clean up now but
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 6981a52..cfb8169 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> extern int __hash_page_64K(unsigned long ea, unsigned long access,
> unsigned long vsid, pte_t *ptep, unsigned long trap,
> unsigned long flags, int ssize);
> +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot);
> +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index);

I wonder what purpose set_hidx_slot() defined previously, served.

> +
> struct mm_struct;
> unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> index 44fe483..b832ed3 100644
> --- a/arch/powerpc/mm/dump_linuxpagetables.c
> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> @@ -213,7 +213,7 @@ struct flag_info {
> .val = H_PAGE_4K_PFN,
> .set = "4K_pfn",
> }, {
> -#endif
> +#else
> .mask = H_PAGE_F_GIX,
> .val = H_PAGE_F_GIX,
> .set = "f_gix",
> @@ -224,6 +224,7 @@ struct flag_info {
> .val = H_PAGE_F_SECOND,
> .set = "f_second",
> }, {
> +#endif /* CONFIG_PPC_64K_PAGES */

Are we adding H_PAGE_F_GIX as an element for 4K mapping ?

> #endif
> .mask = _PAGE_SPECIAL,
> .val = _PAGE_SPECIAL,
> diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> index 6fa450c..c673829 100644
> --- a/arch/powerpc/mm/hash64_4k.c
> +++ b/arch/powerpc/mm/hash64_4k.c
> @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> pte_t *ptep, unsigned long trap, unsigned long flags,
> int ssize, int subpg_prot)
> {
> + real_pte_t rpte;
> unsigned long hpte_group;
> unsigned long rflags, pa;
> unsigned long old_pte, new_pte;
> @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * need to add in 0x1 if it's a read-only user page
> */
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> /*
> * There MIGHT be an HPTE for this pte
> */
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> + unsigned long gslot = get_hidx_gslot(vpn, shift,
> + ssize, rpte, 0);

I am wondering why there is a 'g' before the slot in all these
functions.

Its already too much of changes in a single patch. Being a single
logical change it needs to be inside a single change but then we
need much more description in the commit message for some one to
understand what all changed and how.

2017-06-20 10:52:48

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> in the 64K backed hpte pages. This along with the earlier
> patch will entirely free up the four bits from 64K PTE.
>
> This patch does the following change to 64K PTE that is
> backed by 64K hpte.
>
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
> to bit 7. Trying to minimize gaps so that contiguous bits
> can be allocated if needed in the future.
>
> The second part of the PTE will hold
> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.

I still dont understand how we freed up the 5th bit which is
used in the 5th patch. Was that bit never used for any thing
on 64K page size (64K and 4K mappings) ?

+#define _RPAGE_RSV5 0x00040UL

+#define H_PAGE_PKEY_BIT0 _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1 _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2 _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3 _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4 _RPAGE_RSV5

2017-06-20 22:45:56

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 03/12] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.

On Mon, Jun 19, 2017 at 10:18:01PM +1000, Michael Ellerman wrote:
> Hi Ram,
>
> Ram Pai <[email protected]> writes:
> > Sys_pkey_alloc() allocates and returns available pkey
> > Sys_pkey_free() frees up the pkey.
> >
> > Total 32 keys are supported on powerpc. However pkey 0,1 and 31
> > are reserved. So effectively we have 29 pkeys.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > include/linux/mm.h | 31 ++++---
> > include/uapi/asm-generic/mman-common.h | 2 +-
>
> Those changes need to be split out and acked by mm folks.
>
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7cb17c6..34ddac7 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
> > #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
> >
> > #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> > -#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
> > -#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
> > -#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
> > -#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
> > +#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit arch */
> > +#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit arch */
> > +#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit arch */
> > +#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit arch */
>
> Please don't change the comments, it makes the diff harder to read.

The lines were surpassing 80 columns. tried to compress the comments
without loosing meaning. will restore.

>
> You're actually just adding this AFAICS:
>
> > +#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */
>
> > #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> > #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> > #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> > #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> > +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> > #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> >
> > #if defined(CONFIG_X86)
> ^
> > # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
> > -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
> > -# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> > -# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
> > -# define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> > -# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> > -# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> > -#endif
> > +#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
> > + || defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
> > +#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> > +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
> ^ 4?
> > +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> > +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> > +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>
> That appears to be inside an #if defined(CONFIG_X86) ?
>
> > #elif defined(CONFIG_PPC)
> ^
> Should be CONFIG_PPC64_MEMORY_PROTECTION_KEYS no?

Its a little garbled. Will fix it.
>
> > +#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
> > +#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
> > +#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
> > +#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
> > +#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 /* intel does not use this bit */
> > + /* but reserved for future expansion */
>
> But this hunk is for PPC ?
>
> Is it OK for the other arches & generic code to add another VM_PKEY_BIT4 ?

No. it has to be PPC specific.

>
> Do you need to update show_smap_vma_flags() ?
>
> > # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
> > #elif defined(CONFIG_PARISC)
> > # define VM_GROWSUP VM_ARCH_1
>
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 8c27db0..b13ecc6 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -76,5 +76,5 @@
> > #define PKEY_DISABLE_WRITE 0x2
> > #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> > PKEY_DISABLE_WRITE)
> > -
> > +#define PKEY_DISABLE_EXECUTE 0x4
>
> How you can set that if it's not in PKEY_ACCESS_MASK?

I was wondering how to handle this. x86 does not support this flag.
However powerpc has the ability to enable/disable execute permission
on a key. It cannot be done from userspace, but can be done through
the sys_mprotect_pkey() sys call. Initially I was thinking of not
enabling it in powerpc aswell, but than i think we should be not block
the hardware feature from being used. I will make
PKEY_DISABLE_EXECUTE as part of the PKEY_ACCESS_MASK, and have powerpc
handle it. Also will x86 patch that return error if the flag is
provided.

makes sense?


Thanks for your comments,
RP

>
> See:
>
> SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
> {
> int pkey;
> int ret;
>
> /* No flags supported yet. */
> if (flags)
> return -EINVAL;
> /* check for unsupported init values */
> if (init_val & ~PKEY_ACCESS_MASK)
> return -EINVAL;
>
>
> cheers

--
Ram Pai

2017-06-20 23:24:14

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> > in the 4K backed hpte pages. These bits continue to be used
> > for 64K backed hpte pages in this patch, but will be freed
> > up in the next patch.
>
> The counting 3, 4, 5 and 6 are in BE format I believe, I was
> initially trying to see that from right to left as we normally
> do in the kernel and was getting confused. So basically these
> bits (which are only applicable for 64K mapping IIUC) are going
> to be freed up from the PTE format.
>
> #define _RPAGE_RSV1 0x1000000000000000UL
> #define _RPAGE_RSV2 0x0800000000000000UL
> #define _RPAGE_RSV3 0x0400000000000000UL
> #define _RPAGE_RSV4 0x0200000000000000UL
>
> As you have mentioned before this feature is available for 64K
> page size only and not for 4K mappings. So I assume we support
> both the combinations.
>
> * 64K mapping on 64K
> * 64K mapping on 4K

yes.

>
> These are the current users of the above bits
>
> #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> #define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 9
>
> and what is in there on bit 9 now ? This ?
>
> #define _RPAGE_SW2 0x00400
>
> which is used as
>
> #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */
>
> which will not be required any more ?

i think you are reading bit 9 from right to left. the bit 9 i refer to
is from left to right. Using the same numbering convention the ISA3.0 uses.
I know it is confusing, will make a mention in the comment of this
patch, to read it the big-endian way.

BTW: Bit 9 is not used currently. so using it in this patch. But this is
a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch.

Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is
used by 64K PTE backed by 64k htpe.

>
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating an invalid slot. If a hpte
> > gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> > released immediately. In other words, even though 0xF is a
>
> Release immediately means we attempt again for a new hash slot ?

yes.

>
> > valid slot we discard and consider it as an invalid
> > slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> > depend on a bit in the primary PTE in order to determine the
> > validity of a slot.
>
> So we have to see the slot number in the second half for each PTE to
> figure out if it has got a valid slot in the hash page table.

yes.

>
> >
> > When we release a hpte in the 0xF slot we also release a
> > legitimate primary slot and unmap that entry. This is to
> > ensure that we do get a legimate non-0xF slot the next time we
> > retry for a slot.
>
> Okay.
>
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and may have an effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
>
> Why you say that ? I thought every slot number has the same probability
> of hit from the hash function.

Every hash bucket has the same probability. But every slot within the
hash bucket is filled in sequentially. so it takes 15 hptes to hash to
the same bucket before we get to the 15th slot in the secondary.

>
> >
> > Compared to the current scheme, the above described scheme reduces
> > the number of false hash table updates significantly and has the
>
> How it reduces false hash table updates ?

earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE
for four consecutive 4K hptes. If any one 4k hpte got hashed-in,
the bit got set. Which means anytime it faulted on the remaining
three 4k hpte, we saw the bit already set and tried to erroneously
update that hpte. So we had a 75% update error rate. Funcationally
not bad, but bad from a performance point of view.

With the current scheme, we decide if a 4k slot is valid by looking
at its value rather than depending on a bit in the main-pte. So
there is no chance of getting mislead. And hence no chance of trying
to update a invalid hpte. Should improve performance and at the same
time give us four valuable PTE bits.


>
> > added advantage of releasing four valuable PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
> > arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
> > arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> > arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> > arch/powerpc/mm/hash64_4k.c | 14 ++---
> > arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
> > arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
> > 8 files changed, 122 insertions(+), 78 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index b4b5e6b..5ef1d81 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,18 @@
> > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >
> > +
> > +/*
> > + * Only supported by 4k linux page size
> > + */
> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > +
>
> So we moved the common 64K definitions here.

yes.
>
>
> > /* PTE flags to conserve for HPTE identification */
> > #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> > H_PAGE_F_SECOND | H_PAGE_F_GIX)
> > @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
> > }
> > #endif
> >
> > +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot)
> > +{
> > + return (slot << H_PAGE_F_GIX_SHIFT) &
> > + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +}
>
> Why we are passing the first 3 arguments of the function if we never
> use it inside. Is the caller expected to take care of it ?

trying to keep the same prototype for the 4K-pte and 64K-pte cases.
Otherwise the caller has to wonder which parameter scheme to use.

>
> > +
> > +
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >
> > static inline char *get_hpte_slot_array(pmd_t *pmdp)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > index 9732837..0eb3c89 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > @@ -10,23 +10,25 @@
> > * 64k aligned address free up few of the lower bits of RPN for us
> > * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> > */
> > -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>
> Its the same thing, changes nothing.

it fixes some space/tab problem.

>
> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +
> > +#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>
> H_PAGE_BUSY seems to be differently defined here.

Yes. it is using two different bits depending on 4K hpte v/s 64k hpte
case. But in the next patch all will be same and consistent.

>
> > +
> > /*
> > * We need to differentiate between explicit huge page and THP huge
> > * page, since THP huge page also need to track real subpage details
> > */
> > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
> >
> > -/*
> > - * Used to track subpage group valid if H_PAGE_COMBO is set
> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> > - */
> > -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
>
> H_PAGE_COMBO_VALID is not defined alternately ?

it is not needed anymore.

>
> > -
> > /* PTE flags to conserve for HPTE identification */
> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> > - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
> > +
>
> Slot information has moved to the second half, hence _PAGE_HPTEFLAGS
> need not carry that.

yes.

>
> > /*
> > * we support 16 fragments per PTE page of 64K size.
> > */
> > @@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> > return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> > }
> >
> > +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot)
> > +{
> > + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > +
> > + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> > + return 0x0UL;
> > +}
>
> New method to insert the slot information in the second half.

yes. well it basically trying to reduce code redundancy. Too many places
using exactly the same code to accomplish the same thing. Makes sense to
bring it all in one place.

>
> > +
> > #define __rpte_to_pte(r) ((r).pte)
> > extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
> > /*
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> > index 4e957b0..e7cf03a 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> > @@ -8,11 +8,8 @@
> > *
> > */
> > #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> > -#define H_PAGE_F_GIX_SHIFT 56
> > -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> > -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>
> Removing the common definitions.
>
> > +
> > +#define INIT_HIDX (~0x0UL)
> >
> > #ifdef CONFIG_PPC_64K_PAGES
> > #include <asm/book3s/64/hash-64k.h>
> > @@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
> > return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
> > }
> >
> > +static inline bool hpte_soft_invalid(unsigned long slot)
> > +{
> > + return ((slot & 0xfUL) == 0xfUL);
> > +}
> > +
> > +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
> > +
> > /* This low level function performs the actual PTE insertion
> > * Setting the PTE depends on the MMU type and other factors. It's
> > * an horrible mess that I'm not going to try to clean up now but
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > index 6981a52..cfb8169 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> > extern int __hash_page_64K(unsigned long ea, unsigned long access,
> > unsigned long vsid, pte_t *ptep, unsigned long trap,
> > unsigned long flags, int ssize);
> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot);
> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
>
> I wonder what purpose set_hidx_slot() defined previously, served.
>
> > +
> > struct mm_struct;
> > unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> > extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> > index 44fe483..b832ed3 100644
> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> > @@ -213,7 +213,7 @@ struct flag_info {
> > .val = H_PAGE_4K_PFN,
> > .set = "4K_pfn",
> > }, {
> > -#endif
> > +#else
> > .mask = H_PAGE_F_GIX,
> > .val = H_PAGE_F_GIX,
> > .set = "f_gix",
> > @@ -224,6 +224,7 @@ struct flag_info {
> > .val = H_PAGE_F_SECOND,
> > .set = "f_second",
> > }, {
> > +#endif /* CONFIG_PPC_64K_PAGES */
>
> Are we adding H_PAGE_F_GIX as an element for 4K mapping ?

I think there is mistake here.
In the next patch when these bits are divorsed from
64K ptes entirely, we will not need the above code for 64K ptes.
But good catch. Will fix the error in this patch.

>
> > #endif
> > .mask = _PAGE_SPECIAL,
> > .val = _PAGE_SPECIAL,
> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> > index 6fa450c..c673829 100644
> > --- a/arch/powerpc/mm/hash64_4k.c
> > +++ b/arch/powerpc/mm/hash64_4k.c
> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > pte_t *ptep, unsigned long trap, unsigned long flags,
> > int ssize, int subpg_prot)
> > {
> > + real_pte_t rpte;
> > unsigned long hpte_group;
> > unsigned long rflags, pa;
> > unsigned long old_pte, new_pte;
> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * need to add in 0x1 if it's a read-only user page
> > */
> > rflags = htab_convert_pte_flags(new_pte);
> > + rpte = __real_pte(__pte(old_pte), ptep);
> >
> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > /*
> > * There MIGHT be an HPTE for this pte
> > */
> > - hash = hpt_hash(vpn, shift, ssize);
> > - if (old_pte & H_PAGE_F_SECOND)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > + unsigned long gslot = get_hidx_gslot(vpn, shift,
> > + ssize, rpte, 0);
>
> I am wondering why there is a 'g' before the slot in all these
> functions.

Right. even i was confused initially. :)

hash table slots are originized as one big table. 8 consecutive entires
in that table form a bucket. the term slot is used to refer to the
slot within the bucket. the term gslot is used to refer to an entry
in the table. roughly speaking slot 2 in bucket 2, will be gslot 2*8+2=18.

>
> Its already too much of changes in a single patch. Being a single
> logical change it needs to be inside a single change but then we
> need much more description in the commit message for some one to
> understand what all changed and how.

I have further broken down this patch, one to introduce get_hidx_gslot()
one to introduce set_hidx_slot() . Hopefully that will reduce the size
of the patch to graspable level. let me know,


Thanks for your valuable comments,
RP


--
Ram Pai

2017-06-20 23:25:23

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

On Tue, Jun 20, 2017 at 04:21:45PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> > in the 64K backed hpte pages. This along with the earlier
> > patch will entirely free up the four bits from 64K PTE.
> >
> > This patch does the following change to 64K PTE that is
> > backed by 64K hpte.
> >
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
> > to bit 7. Trying to minimize gaps so that contiguous bits
> > can be allocated if needed in the future.
> >
> > The second part of the PTE will hold
> > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
>
> I still dont understand how we freed up the 5th bit which is
> used in the 5th patch. Was that bit never used for any thing
> on 64K page size (64K and 4K mappings) ?

yes. it was not used. So I gladly used it :-)


RP

2017-06-20 23:27:00

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 06/12] powerpc: Program HPTE key protection bits.

On Tue, Jun 20, 2017 at 01:51:45PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Map the PTE protection key bits to the HPTE key protection bits,
> > while creatiing HPTE entries.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +++++
> > arch/powerpc/include/asm/pkeys.h | 7 +++++++
> > arch/powerpc/mm/hash_utils_64.c | 5 +++++
> > 3 files changed, 17 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > index cfb8169..3d7872c 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > @@ -90,6 +90,8 @@
> > #define HPTE_R_PP0 ASM_CONST(0x8000000000000000)
> > #define HPTE_R_TS ASM_CONST(0x4000000000000000)
> > #define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000)
> > +#define HPTE_R_KEY_BIT0 ASM_CONST(0x2000000000000000)
> > +#define HPTE_R_KEY_BIT1 ASM_CONST(0x1000000000000000)
> > #define HPTE_R_RPN_SHIFT 12
> > #define HPTE_R_RPN ASM_CONST(0x0ffffffffffff000)
> > #define HPTE_R_RPN_3_0 ASM_CONST(0x01fffffffffff000)
> > @@ -104,6 +106,9 @@
> > #define HPTE_R_C ASM_CONST(0x0000000000000080)
> > #define HPTE_R_R ASM_CONST(0x0000000000000100)
> > #define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00)
> > +#define HPTE_R_KEY_BIT2 ASM_CONST(0x0000000000000800)
> > +#define HPTE_R_KEY_BIT3 ASM_CONST(0x0000000000000400)
> > +#define HPTE_R_KEY_BIT4 ASM_CONST(0x0000000000000200)
> >
>
> Should we indicate/document how these 5 bits are not contiguous
> in the HPTE format for any given real page ?

I can, but its all well documented in the ISA. Infact all the bits and
the macros are one to one translation from the ISA.

>
> > #define HPTE_V_1TB_SEG ASM_CONST(0x4000000000000000)
> > #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ffffff000000)
> > diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
> > index 0f3dca8..9b6820d 100644
> > --- a/arch/powerpc/include/asm/pkeys.h
> > +++ b/arch/powerpc/include/asm/pkeys.h
> > @@ -27,6 +27,13 @@
> > ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
> > ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
> >
> > +#define calc_pte_to_hpte_pkey_bits(pteflags) \
> > + (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) | \
> > + ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
> > + ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
> > + ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
> > + ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
> > +
>
> We can drop calc_ in here. pte_to_hpte_pkey_bits should be
> sufficient.

ok. will do.

thanks for your comments,
RP

2017-06-20 23:28:28

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

On Tue, Jun 20, 2017 at 01:44:25PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Replace the magic number used to check for DSI exception
> > with a meaningful value.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/reg.h | 9 ++++++++-
> > arch/powerpc/kernel/exceptions-64s.S | 2 +-
> > 2 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index 7e50e47..2dcb8a1 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -272,16 +272,23 @@
> > #define SPRN_DAR 0x013 /* Data Address Register */
> > #define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
> > #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
> > +#define DSISR_BIT32 0x80000000 /* not defined */
> > #define DSISR_NOHPTE 0x40000000 /* no translation found */
> > +#define DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
> > +#define DSISR_BIT35 0x10000000 /* not defined */
> > #define DSISR_PROTFAULT 0x08000000 /* protection fault */
> > #define DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
> > #define DSISR_ISSTORE 0x02000000 /* access was a store */
> > #define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
> > -#define DSISR_NOSEGMENT 0x00200000 /* SLB miss */
> > #define DSISR_KEYFAULT 0x00200000 /* Key fault */
> > +#define DSISR_BIT43 0x00100000 /* not defined */
> > #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> > #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> > #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> > +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> > + DSISR_PAGEATTR_CONFLT | \
> > + DSISR_BADACCESS | \
> > + DSISR_BIT43)
>
> Sorry missed this one. Seems like there are couple of unnecessary
> line additions in the subsequent patch which adds the new PKEY
> reason code.
>
> -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> - DSISR_PAGEATTR_CONFLT | \
> - DSISR_BADACCESS | \
> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> + DSISR_PAGEATTR_CONFLT | \
> + DSISR_BADACCESS | \
> + DSISR_KEYFAULT | \
> DSISR_BIT43)

i like to see them separately, one per line. But than you are right.
that is not the convention in this file. So will change it accordingly.

thanks,
RP
>
>

--
Ram Pai

2017-06-20 23:43:17

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

On Tue, Jun 20, 2017 at 12:54:45PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Handle Data and Instruction exceptions caused by memory
> > protection-key.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)

Sorry. it was residue of a bad cleanup. It got cherry-picked from my own
internal branch, but than i forgot to delete that line.

>
> To which tree this commit belongs to ?
>
> >
> > Conflicts:
> > arch/powerpc/include/asm/reg.h
> > arch/powerpc/kernel/exceptions-64s.S

same here. these two line are some residues of patching-up my tree with
commits from other internal branches.

> > ---
> > arch/powerpc/include/asm/mmu_context.h | 12 +++++
> > arch/powerpc/include/asm/pkeys.h | 9 ++++
> > arch/powerpc/include/asm/reg.h | 7 +--
> > arch/powerpc/mm/fault.c | 21 +++++++-
> > arch/powerpc/mm/pkeys.c | 90 ++++++++++++++++++++++++++++++++++
> > 5 files changed, 134 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > index da7e943..71fffe0 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
> > {
> > }
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > +bool arch_pte_access_permitted(pte_t pte, bool write);
> > +bool arch_vma_access_permitted(struct vm_area_struct *vma,
> > + bool write, bool execute, bool foreign);
> > +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +static inline bool arch_pte_access_permitted(pte_t pte, bool write)
> > +{
> > + /* by default, allow everything */
> > + return true;
> > +}
> > static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> > bool write, bool execute, bool foreign)
> > {
> > /* by default, allow everything */
> > return true;
> > }
>
> Right, these are the two functions the core VM expects the
> arch to provide.
>
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > #endif /* __KERNEL__ */
> > #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> > diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
> > index 9b6820d..405e7db 100644
> > --- a/arch/powerpc/include/asm/pkeys.h
> > +++ b/arch/powerpc/include/asm/pkeys.h
> > @@ -14,6 +14,15 @@
> > VM_PKEY_BIT3 | \
> > VM_PKEY_BIT4)
> >
> > +static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
> > +{
> > + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
> > + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
> > + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
> > + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
> > + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
> > +}
>
> Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ?

hmm...not sure if it will make the code any better.

>
> > +
> > #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
> > ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
> > ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index 2dcb8a1..a11977f 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -285,9 +285,10 @@
> > #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> > #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> > #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> > -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> > - DSISR_PAGEATTR_CONFLT | \
> > - DSISR_BADACCESS | \
> > +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> > + DSISR_PAGEATTR_CONFLT | \
> > + DSISR_BADACCESS | \
> > + DSISR_KEYFAULT | \
> > DSISR_BIT43)
>
> This should have been cleaned up before adding new
> DSISR_KEYFAULT reason code into it. But I guess its
> okay.
>
> > #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
> > #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
> > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> > index 3a7d580..c31624f 100644
> > --- a/arch/powerpc/mm/fault.c
> > +++ b/arch/powerpc/mm/fault.c
> > @@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> > * bits we are interested in. But there are some bits which
> > * indicate errors in DSISR but can validly be set in SRR1.
> > */
> > - if (trap == 0x400)
> > + if (trap == 0x400) {
> > error_code &= 0x48200000;
> > - else
> > + flags |= FAULT_FLAG_INSTRUCTION;
> > + } else
> > is_write = error_code & DSISR_ISSTORE;
> > #else
>
> Why adding the FAULT_FLAG_INSTRUCTION here ?

later in this code, this flag is checked to see if execute-protection was
violated.
>
> > is_write = error_code & ESR_DST;
> > @@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> > }
> > #endif
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + if (error_code & DSISR_KEYFAULT) {
> > + code = SEGV_PKUERR;
> > + goto bad_area_nosemaphore;
> > + }
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > /* We restore the interrupt state now */
> > if (!arch_irq_disabled_regs(regs))
> > local_irq_enable();
> > @@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> > WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
> > #endif /* CONFIG_PPC_STD_MMU */
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> > + flags & FAULT_FLAG_INSTRUCTION,
> > + 0)) {
> > + code = SEGV_PKUERR;
> > + goto bad_area;
> > + }
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
>
> I am wondering why both the above checks are required ?

Yes good question. there are two cases here.

a) when a hpte is not yet hashed to pte.

in this case the fault is because the hpte is not yet mapped.
However the access may have also violated the protection
permissions of the key associated with that address. So we need
to a software check to determine if a key was violated.

if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,...

handles this case.


b) when the hpte is hashed to the pte and keys are programmed into
the hpte.

in this case the hardware senses the key protection fault
and we just have to check if that is the case.

if (error_code & DSISR_KEYFAULT) {....

handles this case.


>
> * DSISR should contains DSISR_KEYFAULT
>
> * VMA pkey values whether they matched the fault cause
>
>
> > /*
> > * If for any reason at all we couldn't handle the fault,
> > * make sure we exit gracefully rather than endlessly redo
> > diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> > index 11a32b3..439241a 100644
> > --- a/arch/powerpc/mm/pkeys.c
> > +++ b/arch/powerpc/mm/pkeys.c
> > @@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
> > return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
> > }
> >
> > +static inline bool pkey_allows_read(int pkey)
> > +{
> > + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
> > +
> > + if (!(read_uamor() & (0x3ul << pkey_shift)))
> > + return true;
> > +
> > + return !(read_amr() & (AMR_AD_BIT << pkey_shift));
> > +}
>
> Get read_amr() into a local variable and save some cycles if we
> have to do it again.

No. not really. the AMR can be changed by the process in userspace. So anything
that we cache can go stale.
Or maybe i do not understand your comment.


RP.

2017-06-20 23:56:26

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

On Tue, Jun 20, 2017 at 12:24:53PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > The value of the AMR register at the time of exception
> > is made available in gp_regs[PT_AMR] of the siginfo.
> >
> > This field can be used to reprogram the permission bits of
> > any valid pkey.
> >
> > Similarly the value of the pkey, whose protection got violated,
> > is made available at si_pkey field of the siginfo structure.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/paca.h | 1 +
> > arch/powerpc/include/uapi/asm/ptrace.h | 3 ++-
> > arch/powerpc/kernel/asm-offsets.c | 5 ++++
> > arch/powerpc/kernel/exceptions-64s.S | 8 ++++++
> > arch/powerpc/kernel/signal_32.c | 14 ++++++++++
> > arch/powerpc/kernel/signal_64.c | 14 ++++++++++
> > arch/powerpc/kernel/traps.c | 49 ++++++++++++++++++++++++++++++++++
> > arch/powerpc/mm/fault.c | 4 +++
> > 8 files changed, 97 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> > index 1c09f8f..a41afd3 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -92,6 +92,7 @@ struct paca_struct {
> > struct dtl_entry *dispatch_log_end;
> > #endif /* CONFIG_PPC_STD_MMU_64 */
> > u64 dscr_default; /* per-CPU default DSCR */
> > + u64 paca_amr; /* value of amr at exception */
> >
> > #ifdef CONFIG_PPC_STD_MMU_64
> > /*
> > diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> > index 8036b38..7ec2428 100644
> > --- a/arch/powerpc/include/uapi/asm/ptrace.h
> > +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> > @@ -108,8 +108,9 @@ struct pt_regs {
> > #define PT_DAR 41
> > #define PT_DSISR 42
> > #define PT_RESULT 43
> > -#define PT_DSCR 44
> > #define PT_REGS_COUNT 44
> > +#define PT_DSCR 44
> > +#define PT_AMR 45
>
> PT_REGS_COUNT is not getting incremented even after adding
> one more element into the pack ?

Correct. there are 48 entires in gp_regs table AFAICT, only the first 45
are exposed through pt_regs and through gp_regs. the remaining
are exposed through gp_regs only.

>
> >
> > #define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */
> >
> > diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
> > index 709e234..17f5d8a 100644
> > --- a/arch/powerpc/kernel/asm-offsets.c
> > +++ b/arch/powerpc/kernel/asm-offsets.c
> > @@ -241,6 +241,11 @@ int main(void)
> > OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
> > OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
> > OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
> > +
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + OFFSET(PACA_AMR, paca_struct, paca_amr);
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
>
> So we now have a place in PACA for AMR.

yes.

>
> > OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
> > OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
> > OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > index 3fd0528..8db9ef8 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
> > ld r12,_MSR(r1)
> > ld r3,PACA_EXGEN+EX_DAR(r13)
> > lwz r4,PACA_EXGEN+EX_DSISR(r13)
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + mfspr r5,SPRN_AMR
> > + std r5,PACA_AMR(r13)
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > li r5,0x300
> > std r3,_DAR(r1)
> > std r4,_DSISR(r1)
> > @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
> > ld r12,_MSR(r1)
> > ld r3,_NIP(r1)
> > andis. r4,r12,0x5820
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + mfspr r5,SPRN_AMR
> > + std r5,PACA_AMR(r13)
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>
> Saving the AMR context on page faults, this seems to be
> changing in the next patch again based on whether any
> key was active at that point and fault happened for the
> key enforcement ?

yes. i am going to merge the next patch with this patch.


>
> > li r5,0x400
> > std r3,_DAR(r1)
> > std r4,_DSISR(r1)
> > diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> > index 97bb138..059766a 100644
> > --- a/arch/powerpc/kernel/signal_32.c
> > +++ b/arch/powerpc/kernel/signal_32.c
> > @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
> > (unsigned long) &frame->tramp[2]);
> > }
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
> > + return 1;
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > return 0;
> > }
> >
> > @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
> > long err;
> > unsigned int save_r2 = 0;
> > unsigned long msr;
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + unsigned long amr;
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > #ifdef CONFIG_VSX
> > int i;
> > #endif
> > @@ -750,6 +758,12 @@ static long restore_user_regs(struct pt_regs *regs,
> > return 1;
> > #endif /* CONFIG_SPE */
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + err |= __get_user(amr, &sr->mc_gregs[PT_AMR]);
> > + if (!err && amr != get_paca()->paca_amr)
> > + write_amr(amr);
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > return 0;
> > }
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> > index c83c115..35df2e4 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
> > if (set != NULL)
> > err |= __put_user(set->sig[0], &sc->oldmask);
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + err |= __put_user(get_paca()->paca_amr, &sc->gp_regs[PT_AMR]);
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > return err;
> > }
> >
> > @@ -327,6 +331,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> > unsigned long save_r13 = 0;
> > unsigned long msr;
> > struct pt_regs *regs = tsk->thread.regs;
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + unsigned long amr;
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > #ifdef CONFIG_VSX
> > int i;
> > #endif
> > @@ -406,6 +413,13 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> > tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
> > }
> > #endif
> > +
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + err |= __get_user(amr, &sc->gp_regs[PT_AMR]);
> > + if (!err && amr != get_paca()->paca_amr)
> > + write_amr(amr);
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > return err;
> > }
> >
> > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> > index d4e545d..cc4bde8b 100644
> > --- a/arch/powerpc/kernel/traps.c
> > +++ b/arch/powerpc/kernel/traps.c
> > @@ -20,6 +20,7 @@
> > #include <linux/sched/debug.h>
> > #include <linux/kernel.h>
> > #include <linux/mm.h>
> > +#include <linux/pkeys.h>
> > #include <linux/stddef.h>
> > #include <linux/unistd.h>
> > #include <linux/ptrace.h>
> > @@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
> > info->si_addr = (void __user *)regs->nip;
> > }
> >
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > +static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
> > +{
> > + struct vm_area_struct *vma;
> > +
> > + /* Fault not from Protection Keys: nothing to do */
> > + if (si_code != SEGV_PKUERR)
> > + return;
>
> Should have checked this in the caller ?

maybe. currently there is only one caller to this function. so either
way is ok. But if more than one callers show up later having the check
here reduces the burden on the caller.


>
> > +
> > + down_read(&current->mm->mmap_sem);
> > + /*
> > + * we could be racing with pkey_mprotect().
> > + * If pkey_mprotect() wins the key value could
> > + * get modified...xxx
> > + */
> > + vma = find_vma(current->mm, addr);
> > + up_read(&current->mm->mmap_sem);
> > +
> > + /*
> > + * force_sig_info_fault() is called from a number of
> > + * contexts, some of which have a VMA and some of which
> > + * do not. The Pkey-fault handing happens after we have a
> > + * valid VMA, so we should never reach this without a
> > + * valid VMA.
> > + */
>
> Also because pkey can only be used from user space when we will
> definitely have a VMA associated with it.
>
> > + if (!vma) {
> > + WARN_ONCE(1, "Pkey fault with no VMA passed in");
> > + info->si_pkey = 0;
> > + return;
> > + }
> > +
> > + /*
> > + * We could report the incorrect key because of the reason
> > + * explained above.
>
> What if we hold mm->mmap_sem for some more time till we update
> info->si_pkey ? Is there still a chance that pkey would have
> changed by the time siginfo returns to user space ? I am still
> wondering is there way to hold up VMA changes to be on safer
> side. Is the race conditions exists on x86 as well ?
>
> > + *
> > + * si_pkey should be thought off as a strong hint, but not
> > + * an absolutely guarantee because of the race explained
> > + * above.
> > + */
> > + info->si_pkey = vma_pkey(vma);
> > +}
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
> > {
> > siginfo_t info;
> > @@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
> > info.si_signo = signr;
> > info.si_code = code;
> > info.si_addr = (void __user *) addr;
> > +
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + fill_sig_info_pkey(code, &info, addr);
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> > +
> > force_sig_info(signr, &info, current);
> > }
> >
> > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> > index c31624f..dd448d2 100644
> > --- a/arch/powerpc/mm/fault.c
> > +++ b/arch/powerpc/mm/fault.c
> > @@ -453,6 +453,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> > if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> > flags & FAULT_FLAG_INSTRUCTION,
> > 0)) {
> > +
> > + /* our caller may not have saved the amr. Lets save it */
> > + get_paca()->paca_amr = read_amr();
> > +
>
> Something is not right here. PACA save should have happened before we
> come here. Why say the caller might not have saved the AMR ? Is there
> a path when its possible ?

This is a case, where the fault is cause because of page not yet being
hashed, but at the same time could have violated a protection key.
Since the page is not hashed yet, there is no protection-key fault
and hence the caller would have not saved the AMR. We do it here to
catch that case.


RP
--
Ram Pai

2017-06-20 23:57:14

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote:
> Ram Pai <[email protected]> writes:
>
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
> > 1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > index 8db9ef8..a4de1b4 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
> > ld r12,_MSR(r1)
> > ld r3,PACA_EXGEN+EX_DAR(r13)
> > lwz r4,PACA_EXGEN+EX_DSISR(r13)
> > + std r3,_DAR(r1)
> > + std r4,_DSISR(r1)
> > #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
> > + beq+ 1f
>
> This seems to be incremental on top of one of your other patches.
>
> But I don't see why, can you please just squash this into whatever patch
> adds this code in the first place.

Yes. squash it is.

next version of my patch will have it squashed.
RP

2017-06-20 23:59:14

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 10/12] powerpc: Read AMR only if pkey-violation caused the exception.

On Tue, Jun 20, 2017 at 12:16:40PM +0530, Anshuman Khandual wrote:
> On 06/19/2017 11:29 PM, Ram Pai wrote:
> > On Mon, Jun 19, 2017 at 09:06:13PM +1000, Michael Ellerman wrote:
> >> Ram Pai <[email protected]> writes:
> >>
> >>> Signed-off-by: Ram Pai <[email protected]>
> >>> ---
> >>> arch/powerpc/kernel/exceptions-64s.S | 16 ++++++++++------
> >>> 1 file changed, 10 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> >>> index 8db9ef8..a4de1b4 100644
> >>> --- a/arch/powerpc/kernel/exceptions-64s.S
> >>> +++ b/arch/powerpc/kernel/exceptions-64s.S
> >>> @@ -493,13 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
> >>> ld r12,_MSR(r1)
> >>> ld r3,PACA_EXGEN+EX_DAR(r13)
> >>> lwz r4,PACA_EXGEN+EX_DSISR(r13)
> >>> + std r3,_DAR(r1)
> >>> + std r4,_DSISR(r1)
> >>> #ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
> >>> + beq+ 1f
> >>
> >> This seems to be incremental on top of one of your other patches.
> >>
> >> But I don't see why, can you please just squash this into whatever patch
> >> adds this code in the first place.
> >
> > It was an optimization added later. But yes it can be squashed into an
> > earlier patch.
>
> Could you please explain what is the optimization this achieves ?

Don't want to read the AMR if it is not a key protection fault. This is a hot-hot-path.
A few cycles saved can accumulate into signficant savings overall.

RP

2017-06-21 00:04:46

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 11/12]Documentation: Documentation updates.

On Tue, Jun 20, 2017 at 11:48:23AM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > The Documentaton file is moved from x86 into the generic area,
> > since this feature is now supported by more than one archs.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > Documentation/vm/protection-keys.txt | 110 ++++++++++++++++++++++++++++++++++
> > Documentation/x86/protection-keys.txt | 85 --------------------------
>
> I am not sure whether this is a good idea. There might be
> specifics for each architecture which need to be detailed
> again in this new generic one.
>
> > 2 files changed, 110 insertions(+), 85 deletions(-)
> > create mode 100644 Documentation/vm/protection-keys.txt
> > delete mode 100644 Documentation/x86/protection-keys.txt
> >
> > diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt
> > new file mode 100644
> > index 0000000..b49e6bb
> > --- /dev/null
> > +++ b/Documentation/vm/protection-keys.txt
> > @@ -0,0 +1,110 @@
> > +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> > +found in new generation of intel CPUs on PowerPC CPUs.
> > +
> > +Memory Protection Keys provides a mechanism for enforcing page-based
> > +protections, but without requiring modification of the page tables
> > +when an application changes protection domains.
>
> Does resultant access through protection keys should be a
> subset of the protection bits enabled through original PTE
> PROT format ? Does the semantics exactly the same on x86
> and powerpc ?

The protection key takes precedence over protection done through
mprotect.
Yes both on x86 and powerpc we maintain the same semantics.
>
> > +
> > +
> > +On Intel:
> > +
> > +It works by dedicating 4 previously ignored bits in each page table
> > +entry to a "protection key", giving 16 possible keys.
> > +
> > +There is also a new user-accessible register (PKRU) with two separate
> > +bits (Access Disable and Write Disable) for each key. Being a CPU
> > +register, PKRU is inherently thread-local, potentially giving each
> > +thread a different set of protections from every other thread.
> > +
> > +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> > +to the new register. The feature is only available in 64-bit mode,
> > +even though there is theoretically space in the PAE PTEs. These
> > +permissions are enforced on data access only and have no effect on
> > +instruction fetches.
> > +
> > +
> > +On PowerPC:
> > +
> > +It works by dedicating 5 page table entry to a "protection key",
> > +giving 32 possible keys.
> > +
> > +There is a user-accessible register (AMR) with two separate bits
> > +(Access Disable and Write Disable) for each key. Being a CPU
> > +register, AMR is inherently thread-local, potentially giving each
> > +thread a different set of protections from every other thread.
>
> Small nit. Space needed here.
>
> > +NOTE: Disabling read permission does not disable
> > +write and vice-versa.
> > +
> > +The feature is available on 64-bit HPTE mode only.
> > +
> > +'mtspr 0xd, mem' reads the AMR register
> > +'mfspr mem, 0xd' writes into the AMR register.
> > +
> > +Permissions are enforced on data access only and have no effect on
> > +instruction fetches.
> > +
> > +=========================== Syscalls ===========================
> > +
> > +There are 3 system calls which directly interact with pkeys:
> > +
> > + int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
> > + int pkey_free(int pkey);
> > + int pkey_mprotect(unsigned long start, size_t len,
> > + unsigned long prot, int pkey);
> > +
> > +Before a pkey can be used, it must first be allocated with
> > +pkey_alloc(). An application calls the WRPKRU instruction
> > +directly in order to change access permissions to memory covered
> > +with a key. In this example WRPKRU is wrapped by a C function
> > +called pkey_set().
> > +
> > + int real_prot = PROT_READ|PROT_WRITE;
> > + pkey = pkey_alloc(0, PKEY_DENY_WRITE);
> > + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> > + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
> > + ... application runs here
> > +
> > +Now, if the application needs to update the data at 'ptr', it can
> > +gain access, do the update, then remove its write access:
> > +
> > + pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
> > + *ptr = foo; // assign something
> > + pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
> > +
> > +Now when it frees the memory, it will also free the pkey since it
> > +is no longer in use:
> > +
> > + munmap(ptr, PAGE_SIZE);
> > + pkey_free(pkey);
> > +
> > +(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
> > + An example implementation can be found in
> > + tools/testing/selftests/x86/protection_keys.c)
> > +
> > +=========================== Behavior ===========================
> > +
> > +The kernel attempts to make protection keys consistent with the
> > +behavior of a plain mprotect(). For instance if you do this:
> > +
> > + mprotect(ptr, size, PROT_NONE);
> > + something(ptr);
> > +
> > +you can expect the same effects with protection keys when doing this:
> > +
> > + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
> > + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
> > + something(ptr);
> > +
> > +That should be true whether something() is a direct access to 'ptr'
> > +like:
> > +
> > + *ptr = foo;
> > +
> > +or when the kernel does the access on the application's behalf like
> > +with a read():
> > +
> > + read(fd, ptr, 1);
> > +
> > +The kernel will send a SIGSEGV in both cases, but si_code will be set
> > +to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
> > +the plain mprotect() permissions are violated.
>
> I guess the right thing would be to have three files
>
> * Documentation/vm/protection-keys.txt
>
> - Generic interface, system calls
> - Signal handling, error codes
> - Semantics of programming with an example
>
> * Documentation/x86/protection-keys.txt
>
> - Number of active protections keys inside an address space
> - X86 protection key instruction details
> - PTE protection bits placement details
> - Page fault handling
> - Implementation details a bit ?
>
> * Documentation/powerpc/protection-keys.txt
>
> - Number of active protections keys inside an address space
> - Powerpc instructions details
> - PTE protection bits placement details
> - Page fault handling
> - Implementation details a bit ?

I see the value of your suggestion. This is something that will touch
atleast two architectures. Want to hear some more inputs before I do the
changes.

Dave Hansen: would like to hear your ideas.

RP

2017-06-21 00:11:09

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 12/12]selftest: Updated protection key selftest

On Tue, Jun 20, 2017 at 11:56:04AM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Added test support for PowerPC implementation off protection keys.
> >
> > Signed-off-by: Ram Pai <[email protected]>
>
> First of all, there are a lot of instances where we use *pkru*
> named functions on power even the real implementations have
> taken care of doing appropriate things. That looks pretty
> hacky. We need to change them to generic names first before
> adding both x86 and powerpc procedures inside it.

I have abstracted out the arch-specific code. References to
pkru should now be constricted to x86 code only.

The patch, i acknowledge, is not easily reviewable.
As Michael Ellermen mentioned I will break them into two patches.
One moves the file and the second does the code changes. That way
it will be easy to review.

RP

2017-06-21 03:18:36

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

On 06/21/2017 05:26 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 12:24:53PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> The value of the AMR register at the time of exception
>>> is made available in gp_regs[PT_AMR] of the siginfo.
>>>
>>> This field can be used to reprogram the permission bits of
>>> any valid pkey.
>>>
>>> Similarly the value of the pkey, whose protection got violated,
>>> is made available at si_pkey field of the siginfo structure.
>>>
>>> Signed-off-by: Ram Pai <[email protected]>
>>> ---
>>> arch/powerpc/include/asm/paca.h | 1 +
>>> arch/powerpc/include/uapi/asm/ptrace.h | 3 ++-
>>> arch/powerpc/kernel/asm-offsets.c | 5 ++++
>>> arch/powerpc/kernel/exceptions-64s.S | 8 ++++++
>>> arch/powerpc/kernel/signal_32.c | 14 ++++++++++
>>> arch/powerpc/kernel/signal_64.c | 14 ++++++++++
>>> arch/powerpc/kernel/traps.c | 49 ++++++++++++++++++++++++++++++++++
>>> arch/powerpc/mm/fault.c | 4 +++
>>> 8 files changed, 97 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>>> index 1c09f8f..a41afd3 100644
>>> --- a/arch/powerpc/include/asm/paca.h
>>> +++ b/arch/powerpc/include/asm/paca.h
>>> @@ -92,6 +92,7 @@ struct paca_struct {
>>> struct dtl_entry *dispatch_log_end;
>>> #endif /* CONFIG_PPC_STD_MMU_64 */
>>> u64 dscr_default; /* per-CPU default DSCR */
>>> + u64 paca_amr; /* value of amr at exception */
>>>
>>> #ifdef CONFIG_PPC_STD_MMU_64
>>> /*
>>> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
>>> index 8036b38..7ec2428 100644
>>> --- a/arch/powerpc/include/uapi/asm/ptrace.h
>>> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
>>> @@ -108,8 +108,9 @@ struct pt_regs {
>>> #define PT_DAR 41
>>> #define PT_DSISR 42
>>> #define PT_RESULT 43
>>> -#define PT_DSCR 44
>>> #define PT_REGS_COUNT 44
>>> +#define PT_DSCR 44
>>> +#define PT_AMR 45
>>
>> PT_REGS_COUNT is not getting incremented even after adding
>> one more element into the pack ?
>
> Correct. there are 48 entires in gp_regs table AFAICT, only the first 45
> are exposed through pt_regs and through gp_regs. the remaining
> are exposed through gp_regs only.
>
>>
>>>
>>> #define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */
>>>
>>> diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
>>> index 709e234..17f5d8a 100644
>>> --- a/arch/powerpc/kernel/asm-offsets.c
>>> +++ b/arch/powerpc/kernel/asm-offsets.c
>>> @@ -241,6 +241,11 @@ int main(void)
>>> OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
>>> OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
>>> OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
>>> +
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + OFFSET(PACA_AMR, paca_struct, paca_amr);
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>
>> So we now have a place in PACA for AMR.
>
> yes.
>
>>
>>> OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
>>> OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
>>> OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
>>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
>>> index 3fd0528..8db9ef8 100644
>>> --- a/arch/powerpc/kernel/exceptions-64s.S
>>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>>> @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
>>> ld r12,_MSR(r1)
>>> ld r3,PACA_EXGEN+EX_DAR(r13)
>>> lwz r4,PACA_EXGEN+EX_DSISR(r13)
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + mfspr r5,SPRN_AMR
>>> + std r5,PACA_AMR(r13)
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> li r5,0x300
>>> std r3,_DAR(r1)
>>> std r4,_DSISR(r1)
>>> @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
>>> ld r12,_MSR(r1)
>>> ld r3,_NIP(r1)
>>> andis. r4,r12,0x5820
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + mfspr r5,SPRN_AMR
>>> + std r5,PACA_AMR(r13)
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>
>> Saving the AMR context on page faults, this seems to be
>> changing in the next patch again based on whether any
>> key was active at that point and fault happened for the
>> key enforcement ?
>
> yes. i am going to merge the next patch with this patch.
>
>
>>
>>> li r5,0x400
>>> std r3,_DAR(r1)
>>> std r4,_DSISR(r1)
>>> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
>>> index 97bb138..059766a 100644
>>> --- a/arch/powerpc/kernel/signal_32.c
>>> +++ b/arch/powerpc/kernel/signal_32.c
>>> @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
>>> (unsigned long) &frame->tramp[2]);
>>> }
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
>>> + return 1;
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> return 0;
>>> }
>>>
>>> @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
>>> long err;
>>> unsigned int save_r2 = 0;
>>> unsigned long msr;
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + unsigned long amr;
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> #ifdef CONFIG_VSX
>>> int i;
>>> #endif
>>> @@ -750,6 +758,12 @@ static long restore_user_regs(struct pt_regs *regs,
>>> return 1;
>>> #endif /* CONFIG_SPE */
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + err |= __get_user(amr, &sr->mc_gregs[PT_AMR]);
>>> + if (!err && amr != get_paca()->paca_amr)
>>> + write_amr(amr);
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> return 0;
>>> }
>>>
>>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>>> index c83c115..35df2e4 100644
>>> --- a/arch/powerpc/kernel/signal_64.c
>>> +++ b/arch/powerpc/kernel/signal_64.c
>>> @@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
>>> if (set != NULL)
>>> err |= __put_user(set->sig[0], &sc->oldmask);
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + err |= __put_user(get_paca()->paca_amr, &sc->gp_regs[PT_AMR]);
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> return err;
>>> }
>>>
>>> @@ -327,6 +331,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
>>> unsigned long save_r13 = 0;
>>> unsigned long msr;
>>> struct pt_regs *regs = tsk->thread.regs;
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + unsigned long amr;
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> #ifdef CONFIG_VSX
>>> int i;
>>> #endif
>>> @@ -406,6 +413,13 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
>>> tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
>>> }
>>> #endif
>>> +
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + err |= __get_user(amr, &sc->gp_regs[PT_AMR]);
>>> + if (!err && amr != get_paca()->paca_amr)
>>> + write_amr(amr);
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> return err;
>>> }
>>>
>>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>>> index d4e545d..cc4bde8b 100644
>>> --- a/arch/powerpc/kernel/traps.c
>>> +++ b/arch/powerpc/kernel/traps.c
>>> @@ -20,6 +20,7 @@
>>> #include <linux/sched/debug.h>
>>> #include <linux/kernel.h>
>>> #include <linux/mm.h>
>>> +#include <linux/pkeys.h>
>>> #include <linux/stddef.h>
>>> #include <linux/unistd.h>
>>> #include <linux/ptrace.h>
>>> @@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
>>> info->si_addr = (void __user *)regs->nip;
>>> }
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> +static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
>>> +{
>>> + struct vm_area_struct *vma;
>>> +
>>> + /* Fault not from Protection Keys: nothing to do */
>>> + if (si_code != SEGV_PKUERR)
>>> + return;
>>
>> Should have checked this in the caller ?
>
> maybe. currently there is only one caller to this function. so either
> way is ok. But if more than one callers show up later having the check
> here reduces the burden on the caller.
>
>
>>
>>> +
>>> + down_read(&current->mm->mmap_sem);
>>> + /*
>>> + * we could be racing with pkey_mprotect().
>>> + * If pkey_mprotect() wins the key value could
>>> + * get modified...xxx
>>> + */
>>> + vma = find_vma(current->mm, addr);
>>> + up_read(&current->mm->mmap_sem);
>>> +
>>> + /*
>>> + * force_sig_info_fault() is called from a number of
>>> + * contexts, some of which have a VMA and some of which
>>> + * do not. The Pkey-fault handing happens after we have a
>>> + * valid VMA, so we should never reach this without a
>>> + * valid VMA.
>>> + */
>>
>> Also because pkey can only be used from user space when we will
>> definitely have a VMA associated with it.
>>
>>> + if (!vma) {
>>> + WARN_ONCE(1, "Pkey fault with no VMA passed in");
>>> + info->si_pkey = 0;
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * We could report the incorrect key because of the reason
>>> + * explained above.
>>
>> What if we hold mm->mmap_sem for some more time till we update
>> info->si_pkey ? Is there still a chance that pkey would have
>> changed by the time siginfo returns to user space ? I am still
>> wondering is there way to hold up VMA changes to be on safer
>> side. Is the race conditions exists on x86 as well ?

Is the race condition exists on x86 system as well ? Can we hold
up little more the mmap_sem to improve our chances ?

>>
>>> + *
>>> + * si_pkey should be thought off as trong hint, but not
>>> + * an absolutely guarantee because of the race explained
>>> + * above.
>>> + */
>>> + info->si_pkey = vma_pkey(vma);
>>> +}
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
>>> {
>>> siginfo_t info;
>>> @@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
>>> info.si_signo = signr;
>>> info.si_code = code;
>>> info.si_addr = (void __user *) addr;
>>> +
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + fill_sig_info_pkey(code, &info, addr);
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> force_sig_info(signr, &info, current);
>>> }
>>>
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>> index c31624f..dd448d2 100644
>>> --- a/arch/powerpc/mm/fault.c
>>> +++ b/arch/powerpc/mm/fault.c
>>> @@ -453,6 +453,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
>>> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>> flags & FAULT_FLAG_INSTRUCTION,
>>> 0)) {
>>> +
>>> + /* our caller may not have saved the amr. Lets save it */
>>> + get_paca()->paca_amr = read_amr();
>>> +
>>
>> Something is not right here. PACA save should have happened before we
>> come here. Why say the caller might not have saved the AMR ? Is there
>> a path when its possible ?
>
> This is a case, where the fault is cause because of page not yet being
> hashed, but at the same time could have violated a protection key.
> Since the page is not hashed yet, there is no protection-key fault
> and hence the caller would have not saved the AMR. We do it here to
> catch that case.

Is that because of the above optimization you have added in the page
fault path ?

2017-06-21 03:54:53

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

On 06/21/2017 05:13 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 12:54:45PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> Handle Data and Instruction exceptions caused by memory
>>> protection-key.
>>>
>>> Signed-off-by: Ram Pai <[email protected]>
>>> (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)
>
> Sorry. it was residue of a bad cleanup. It got cherry-picked from my own
> internal branch, but than i forgot to delete that line.
>
>>
>> To which tree this commit belongs to ?
>>
>>>
>>> Conflicts:
>>> arch/powerpc/include/asm/reg.h
>>> arch/powerpc/kernel/exceptions-64s.S
>
> same here. these two line are some residues of patching-up my tree with
> commits from other internal branches.
>
>>> ---
>>> arch/powerpc/include/asm/mmu_context.h | 12 +++++
>>> arch/powerpc/include/asm/pkeys.h | 9 ++++
>>> arch/powerpc/include/asm/reg.h | 7 +--
>>> arch/powerpc/mm/fault.c | 21 +++++++-
>>> arch/powerpc/mm/pkeys.c | 90 ++++++++++++++++++++++++++++++++++
>>> 5 files changed, 134 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>>> index da7e943..71fffe0 100644
>>> --- a/arch/powerpc/include/asm/mmu_context.h
>>> +++ b/arch/powerpc/include/asm/mmu_context.h
>>> @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
>>> {
>>> }
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> +bool arch_pte_access_permitted(pte_t pte, bool write);
>>> +bool arch_vma_access_permitted(struct vm_area_struct *vma,
>>> + bool write, bool execute, bool foreign);
>>> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +static inline bool arch_pte_access_permitted(pte_t pte, bool write)
>>> +{
>>> + /* by default, allow everything */
>>> + return true;
>>> +}
>>> static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>>> bool write, bool execute, bool foreign)
>>> {
>>> /* by default, allow everything */
>>> return true;
>>> }
>>
>> Right, these are the two functions the core VM expects the
>> arch to provide.
>>
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> #endif /* __KERNEL__ */
>>> #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
>>> diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
>>> index 9b6820d..405e7db 100644
>>> --- a/arch/powerpc/include/asm/pkeys.h
>>> +++ b/arch/powerpc/include/asm/pkeys.h
>>> @@ -14,6 +14,15 @@
>>> VM_PKEY_BIT3 | \
>>> VM_PKEY_BIT4)
>>>
>>> +static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
>>> +{
>>> + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
>>> + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
>>> + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
>>> + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
>>> + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
>>> +}
>>
>> Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ?
>
> hmm...not sure if it will make the code any better.
>
>>
>>> +
>>> #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
>>> ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
>>> ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
>>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>>> index 2dcb8a1..a11977f 100644
>>> --- a/arch/powerpc/include/asm/reg.h
>>> +++ b/arch/powerpc/include/asm/reg.h
>>> @@ -285,9 +285,10 @@
>>> #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
>>> #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
>>> #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
>>> -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
>>> - DSISR_PAGEATTR_CONFLT | \
>>> - DSISR_BADACCESS | \
>>> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
>>> + DSISR_PAGEATTR_CONFLT | \
>>> + DSISR_BADACCESS | \
>>> + DSISR_KEYFAULT | \
>>> DSISR_BIT43)
>>
>> This should have been cleaned up before adding new
>> DSISR_KEYFAULT reason code into it. But I guess its
>> okay.
>>
>>> #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
>>> #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>> index 3a7d580..c31624f 100644
>>> --- a/arch/powerpc/mm/fault.c
>>> +++ b/arch/powerpc/mm/fault.c
>>> @@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
>>> * bits we are interested in. But there are some bits which
>>> * indicate errors in DSISR but can validly be set in SRR1.
>>> */
>>> - if (trap == 0x400)
>>> + if (trap == 0x400) {
>>> error_code &= 0x48200000;
>>> - else
>>> + flags |= FAULT_FLAG_INSTRUCTION;
>>> + } else
>>> is_write = error_code & DSISR_ISSTORE;
>>> #else
>>
>> Why adding the FAULT_FLAG_INSTRUCTION here ?
>
> later in this code, this flag is checked to see if execute-protection was
> violated.

'is_exec' which is set for every 400 interrupt can be used for that
purpose ? I guess thats how we have been dealing with generic PROT_EXEC
based faults.

>>
>>> is_write = error_code & ESR_DST;
>>> @@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
>>> }
>>> #endif
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + if (error_code & DSISR_KEYFAULT) {
>>> + code = SEGV_PKUERR;
>>> + goto bad_area_nosemaphore;
>>> + }
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>> /* We restore the interrupt state now */
>>> if (!arch_irq_disabled_regs(regs))
>>> local_irq_enable();
>>> @@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
>>> WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
>>> #endif /* CONFIG_PPC_STD_MMU */
>>>
>>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
>>> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>> + flags & FAULT_FLAG_INSTRUCTION,
>>> + 0)) {
>>> + code = SEGV_PKUERR;
>>> + goto bad_area;
>>> + }
>>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>>> +
>>
>> I am wondering why both the above checks are required ?
>
> Yes good question. there are two cases here.
>
> a) when a hpte is not yet hashed to pte.
>
> in this case the fault is because the hpte is not yet mapped.
> However the access may have also violated the protection
> permissions of the key associated with that address. So we need

Both of these cannot be possible simultaneously. In this case
MMU will take a fault because of no HPTE is found for the access
not for the protection key irrespective of the pkey value and type
of the access. Are you saying that DSISR might have both DSISR_NOHPTE
and DSISR_KEYFAULT set for this case ? If not its not a good idea
to present SEGV_PKUERR as reason code during signal delivery.

> to a software check to determine if a key was violated.
>
> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,...
>
> handles this case.
>
>
> b) when the hpte is hashed to the pte and keys are programmed into
> the hpte.
>
> in this case the hardware senses the key protection fault
> and we just have to check if that is the case.
>
> if (error_code & DSISR_KEYFAULT) {....
>
> handles this case.

This is correct.

>
>
>>
>> * DSISR should contains DSISR_KEYFAULT
>>
>> * VMA pkey values whether they matched the fault cause
>>
>>
>>> /*
>>> * If for any reason at all we couldn't handle the fault,
>>> * make sure we exit gracefully rather than endlessly redo
>>> diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
>>> index 11a32b3..439241a 100644
>>> --- a/arch/powerpc/mm/pkeys.c
>>> +++ b/arch/powerpc/mm/pkeys.c
>>> @@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
>>> return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
>>> }
>>>
>>> +static inline bool pkey_allows_read(int pkey)
>>> +{
>>> + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
>>> +
>>> + if (!(read_uamor() & (0x3ul << pkey_shift)))
>>> + return true;
>>> +
>>> + return !(read_amr() & (AMR_AD_BIT << pkey_shift));
>>> +}
>>
>> Get read_amr() into a local variable and save some cycles if we
>> have to do it again.
>
> No. not really. the AMR can be changed by the process in userspace. So anything
> that we cache can go stale.
> Or maybe i do not understand your comment.

I am not saying to cache the value. Just inside the function, if we have
a local variable holding read_amr() once, it can be used twice without
reading the register again. Just inside the function.

2017-06-21 05:35:55

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On 06/21/2017 04:53 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
>>> in the 4K backed hpte pages. These bits continue to be used
>>> for 64K backed hpte pages in this patch, but will be freed
>>> up in the next patch.
>>
>> The counting 3, 4, 5 and 6 are in BE format I believe, I was
>> initially trying to see that from right to left as we normally
>> do in the kernel and was getting confused. So basically these
>> bits (which are only applicable for 64K mapping IIUC) are going
>> to be freed up from the PTE format.
>>
>> #define _RPAGE_RSV1 0x1000000000000000UL
>> #define _RPAGE_RSV2 0x0800000000000000UL
>> #define _RPAGE_RSV3 0x0400000000000000UL
>> #define _RPAGE_RSV4 0x0200000000000000UL
>>
>> As you have mentioned before this feature is available for 64K
>> page size only and not for 4K mappings. So I assume we support
>> both the combinations.
>>
>> * 64K mapping on 64K
>> * 64K mapping on 4K
>
> yes.
>
>>
>> These are the current users of the above bits
>>
>> #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>> #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> #define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>>
>>>
>>> The patch does the following change to the 64K PTE format
>>>
>>> H_PAGE_BUSY moves from bit 3 to bit 9
>>
>> and what is in there on bit 9 now ? This ?
>>
>> #define _RPAGE_SW2 0x00400
>>
>> which is used as
>>
>> #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */
>>
>> which will not be required any more ?
>
> i think you are reading bit 9 from right to left. the bit 9 i refer to
> is from left to right. Using the same numbering convention the ISA3.0 uses.

Right, my bad. Then it would be this one.

'#define _RPAGE_RPN42 0x0040000000000000UL'

> I know it is confusing, will make a mention in the comment of this
> patch, to read it the big-endian way.

Right.

>
> BTW: Bit 9 is not used currently. so using it in this patch. But this is
> a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch.
>
> Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is
> used by 64K PTE backed by 64k htpe.

Got it.

>
>>
>>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>>> of the pte.
>>> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
>>> second part of the pte.
>>>
>>> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>>> is initialized to 0xF indicating an invalid slot. If a hpte
>>> gets cached in a 0xF slot(i.e 7th slot of secondary), it is
>>> released immediately. In other words, even though 0xF is a
>>
>> Release immediately means we attempt again for a new hash slot ?
>
> yes.
>
>>
>>> valid slot we discard and consider it as an invalid
>>> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
>>> depend on a bit in the primary PTE in order to determine the
>>> validity of a slot.
>>
>> So we have to see the slot number in the second half for each PTE to
>> figure out if it has got a valid slot in the hash page table.
>
> yes.
>
>>
>>>
>>> When we release a hpte in the 0xF slot we also release a
>>> legitimate primary slot and unmap that entry. This is to
>>> ensure that we do get a legimate non-0xF slot the next time we
>>> retry for a slot.
>>
>> Okay.
>>
>>>
>>> Though treating 0xF slot as invalid reduces the number of available
>>> slots and may have an effect on the performance, the probabilty
>>> of hitting a 0xF is extermely low.
>>
>> Why you say that ? I thought every slot number has the same probability
>> of hit from the hash function.
>
> Every hash bucket has the same probability. But every slot within the
> hash bucket is filled in sequentially. so it takes 15 hptes to hash to
> the same bucket before we get to the 15th slot in the secondary.

Okay, would the last one be 16th instead ?

>
>>
>>>
>>> Compared to the current scheme, the above described scheme reduces
>>> the number of false hash table updates significantly and has the
>>
>> How it reduces false hash table updates ?
>
> earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE
> for four consecutive 4K hptes. If any one 4k hpte got hashed-in,
> the bit got set. Which means anytime it faulted on the remaining
> three 4k hpte, we saw the bit already set and tried to erroneously
> update that hpte. So we had a 75% update error rate. Funcationally
> not bad, but bad from a performance point of view.

I am bit out of sync regarding these PTE bits, after Aneesh's radix
changes went in :) Will look into this bit closer.

>
> With the current scheme, we decide if a 4k slot is valid by looking
> at its value rather than depending on a bit in the main-pte. So
> there is no chance of getting mislead. And hence no chance of trying
> to update a invalid hpte. Should improve performance and at the same
> time give us four valuable PTE bits.

I am not sure why you say 'invalid hpte'. IIUC

* We will require 16 '64K on 4K' mappings to actually cover 64K on 64K

* A single (64K on 4K)'s TLB can cover 64K on 64K as long as the TLB is
present and not flushed. That gets us performance. Once flushed, a new
HPTE entry covering new (64K on 4K) is inserted. As long as the PFN
for the 4K is different HPTE will be different and it cannot collide
with any existing ones and create problems (ERAT error ?)

As you are pointing out, I am not sure whether the existing design had
more probability for an invalid HPTE insert. Will look into this in
detail.

>
>
>>
>>> added advantage of releasing four valuable PTE bits for other
>>> purpose.
>>>
>>> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
>>> Ellermen and myself.
>>>
>>> 4K PTE format remain unchanged currently.
>>>
>>> Signed-off-by: Ram Pai <[email protected]>
>>> ---
>>> arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
>>> arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
>>> arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
>>> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
>>> arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
>>> arch/powerpc/mm/hash64_4k.c | 14 ++---
>>> arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
>>> arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
>>> 8 files changed, 122 insertions(+), 78 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>> index b4b5e6b..5ef1d81 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>> @@ -16,6 +16,18 @@
>>> #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
>>> #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>>>
>>> +
>>> +/*
>>> + * Only supported by 4k linux page size
>>> + */
>>> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>>> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>>> +#define H_PAGE_F_GIX_SHIFT 56
>>> +
>>> +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>>> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>>> +
>>> +
>>
>> So we moved the common 64K definitions here.
>
> yes.
>>
>>
>>> /* PTE flags to conserve for HPTE identification */
>>> #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
>>> H_PAGE_F_SECOND | H_PAGE_F_GIX)
>>> @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
>>> }
>>> #endif
>>>
>>> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
>>> + unsigned int subpg_index, unsigned long slot)
>>> +{
>>> + return (slot << H_PAGE_F_GIX_SHIFT) &
>>> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
>>> +}
>>
>> Why we are passing the first 3 arguments of the function if we never
>> use it inside. Is the caller expected to take care of it ?
>
> trying to keep the same prototype for the 4K-pte and 64K-pte cases.
> Otherwise the caller has to wonder which parameter scheme to use.
>
>>
>>> +
>>> +
>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>
>>> static inline char *get_hpte_slot_array(pmd_t *pmdp)
>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>>> index 9732837..0eb3c89 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>>> @@ -10,23 +10,25 @@
>>> * 64k aligned address free up few of the lower bits of RPN for us
>>> * We steal that here. For more deatils look at pte_pfn/pfn_pte()
>>> */
>>> -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
>>> -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>>> +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
>>> +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>>
>> Its the same thing, changes nothing.
>
> it fixes some space/tab problem.
>
>>
>>> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>>> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>>> +#define H_PAGE_F_GIX_SHIFT 56
>>> +
>>> +
>>> +#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
>>> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>>
>> H_PAGE_BUSY seems to be differently defined here.
>
> Yes. it is using two different bits depending on 4K hpte v/s 64k hpte
> case. But in the next patch all will be same and consistent.
>
>>
>>> +
>>> /*
>>> * We need to differentiate between explicit huge page and THP huge
>>> * page, since THP huge page also need to track real subpage details
>>> */
>>> #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
>>>
>>> -/*
>>> - * Used to track subpage group valid if H_PAGE_COMBO is set
>>> - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
>>> - */
>>> -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
>>
>> H_PAGE_COMBO_VALID is not defined alternately ?
>
> it is not needed anymore.
>
>>
>>> -
>>> /* PTE flags to conserve for HPTE identification */
>>> -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
>>> - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
>>> +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
>>> +
>>
>> Slot information has moved to the second half, hence _PAGE_HPTEFLAGS
>> need not carry that.
>
> yes.
>
>>
>>> /*
>>> * we support 16 fragments per PTE page of 64K size.
>>> */
>>> @@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
>>> return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
>>> }
>>>
>>> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
>>> + unsigned int subpg_index, unsigned long slot)
>>> +{
>>> + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>>> +
>>> + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
>>> + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
>>> + return 0x0UL;
>>> +}
>>
>> New method to insert the slot information in the second half.
>
> yes. well it basically trying to reduce code redundancy. Too many places
> using exactly the same code to accomplish the same thing. Makes sense to
> bring it all in one place.

Right.

>
>>
>>> +
>>> #define __rpte_to_pte(r) ((r).pte)
>>> extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
>>> /*
>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>>> index 4e957b0..e7cf03a 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>>> @@ -8,11 +8,8 @@
>>> *
>>> */
>>> #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
>>> -#define H_PAGE_F_GIX_SHIFT 56
>>> -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>>> -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>>> -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>>> -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>>
>> Removing the common definitions.
>>
>>> +
>>> +#define INIT_HIDX (~0x0UL)
>>>
>>> #ifdef CONFIG_PPC_64K_PAGES
>>> #include <asm/book3s/64/hash-64k.h>
>>> @@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
>>> return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
>>> }
>>>
>>> +static inline bool hpte_soft_invalid(unsigned long slot)
>>> +{
>>> + return ((slot & 0xfUL) == 0xfUL);
>>> +}
>>> +
>>> +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
>>> + int ssize, real_pte_t rpte, unsigned int subpg_index);
>>> +
>>> /* This low level function performs the actual PTE insertion
>>> * Setting the PTE depends on the MMU type and other factors. It's
>>> * an horrible mess that I'm not going to try to clean up now but
>>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>>> index 6981a52..cfb8169 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>>> @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
>>> extern int __hash_page_64K(unsigned long ea, unsigned long access,
>>> unsigned long vsid, pte_t *ptep, unsigned long trap,
>>> unsigned long flags, int ssize);
>>> +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
>>> + unsigned int subpg_index, unsigned long slot);
>>> +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
>>> + int ssize, real_pte_t rpte, unsigned int subpg_index);
>>
>> I wonder what purpose set_hidx_slot() defined previously, served.
>>
>>> +
>>> struct mm_struct;
>>> unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
>>> extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
>>> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
>>> index 44fe483..b832ed3 100644
>>> --- a/arch/powerpc/mm/dump_linuxpagetables.c
>>> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
>>> @@ -213,7 +213,7 @@ struct flag_info {
>>> .val = H_PAGE_4K_PFN,
>>> .set = "4K_pfn",
>>> }, {
>>> -#endif
>>> +#else
>>> .mask = H_PAGE_F_GIX,
>>> .val = H_PAGE_F_GIX,
>>> .set = "f_gix",
>>> @@ -224,6 +224,7 @@ struct flag_info {
>>> .val = H_PAGE_F_SECOND,
>>> .set = "f_second",
>>> }, {
>>> +#endif /* CONFIG_PPC_64K_PAGES */
>>
>> Are we adding H_PAGE_F_GIX as an element for 4K mapping ?
>
> I think there is mistake here.
> In the next patch when these bits are divorsed from
> 64K ptes entirely, we will not need the above code for 64K ptes.
> But good catch. Will fix the error in this patch.
>
>>
>>> #endif
>>> .mask = _PAGE_SPECIAL,
>>> .val = _PAGE_SPECIAL,
>>> diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
>>> index 6fa450c..c673829 100644
>>> --- a/arch/powerpc/mm/hash64_4k.c
>>> +++ b/arch/powerpc/mm/hash64_4k.c
>>> @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>>> pte_t *ptep, unsigned long trap, unsigned long flags,
>>> int ssize, int subpg_prot)
>>> {
>>> + real_pte_t rpte;
>>> unsigned long hpte_group;
>>> unsigned long rflags, pa;
>>> unsigned long old_pte, new_pte;
>>> @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>>> * need to add in 0x1 if it's a read-only user page
>>> */
>>> rflags = htab_convert_pte_flags(new_pte);
>>> + rpte = __real_pte(__pte(old_pte), ptep);
>>>
>>> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
>>> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
>>> @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>>> /*
>>> * There MIGHT be an HPTE for this pte
>>> */
>>> - hash = hpt_hash(vpn, shift, ssize);
>>> - if (old_pte & H_PAGE_F_SECOND)
>>> - hash = ~hash;
>>> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>>> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
>>> + unsigned long gslot = get_hidx_gslot(vpn, shift,
>>> + ssize, rpte, 0);
>>
>> I am wondering why there is a 'g' before the slot in all these
>> functions.
>
> Right. even i was confused initially. :)
>
> hash table slots are originized as one big table. 8 consecutive entires
> in that table form a bucket. the term slot is used to refer to the
> slot within the bucket. the term gslot is used to refer to an entry
> in the table. roughly speaking slot 2 in bucket 2, will be gslot 2*8+2=18.

Global slot as it can point any where on that two dimensional table ?

>
>>
>> Its already too much of changes in a single patch. Being a single
>> logical change it needs to be inside a single change but then we
>> need much more description in the commit message for some one to
>> understand what all changed and how.
>
> I have further broken down this patch, one to introduce get_hidx_gslot()
> one to introduce set_hidx_slot() . Hopefully that will reduce the size
> of the patch to graspable level. let me know,

I did some experiments with the first two patches.

* First of all the first patch does not compile without this.

--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1612,7 +1612,7 @@ unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
unsigned long flags)
{
- unsigned long hash, index, shift, hidx, gslot;
+ unsigned long index, shift, gslot;
int local = flags & HPTE_LOCAL_UPDATE;

DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);

* Though it boots the kernel, system is kind of unresponsive while attempting
to compile a kernel. Though I did not dig further on this, seems like the
first patch is not self sufficient yet.

* With both first and second patch, the kernel boots fine and compiles a kernel.

We need to sort out issues in the first two patches before looking into
the rest of the patch series.

2017-06-21 06:10:19

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 09/12] powerpc: Deliver SEGV signal on pkey violation.

On Wed, Jun 21, 2017 at 08:48:20AM +0530, Anshuman Khandual wrote:
> On 06/21/2017 05:26 AM, Ram Pai wrote:
> > On Tue, Jun 20, 2017 at 12:24:53PM +0530, Anshuman Khandual wrote:
> >> On 06/17/2017 09:22 AM, Ram Pai wrote:
> >>> The value of the AMR register at the time of exception
> >>> is made available in gp_regs[PT_AMR] of the siginfo.
> >>>
> >>> This field can be used to reprogram the permission bits of
> >>> any valid pkey.
> >>>
> >>> Similarly the value of the pkey, whose protection got violated,
> >>> is made available at si_pkey field of the siginfo structure.
> >>>
> >>> Signed-off-by: Ram Pai <[email protected]>
> >>> ---
> >>> arch/powerpc/include/asm/paca.h | 1 +
> >>> arch/powerpc/include/uapi/asm/ptrace.h | 3 ++-
> >>> arch/powerpc/kernel/asm-offsets.c | 5 ++++
> >>> arch/powerpc/kernel/exceptions-64s.S | 8 ++++++
> >>> arch/powerpc/kernel/signal_32.c | 14 ++++++++++
> >>> arch/powerpc/kernel/signal_64.c | 14 ++++++++++
> >>> arch/powerpc/kernel/traps.c | 49 ++++++++++++++++++++++++++++++++++
> >>> arch/powerpc/mm/fault.c | 4 +++
> >>> 8 files changed, 97 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> >>> index 1c09f8f..a41afd3 100644
> >>> --- a/arch/powerpc/include/asm/paca.h
> >>> +++ b/arch/powerpc/include/asm/paca.h
> >>> @@ -92,6 +92,7 @@ struct paca_struct {
> >>> struct dtl_entry *dispatch_log_end;
> >>> #endif /* CONFIG_PPC_STD_MMU_64 */
> >>> u64 dscr_default; /* per-CPU default DSCR */
> >>> + u64 paca_amr; /* value of amr at exception */
> >>>
> >>> #ifdef CONFIG_PPC_STD_MMU_64
> >>> /*
> >>> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> >>> index 8036b38..7ec2428 100644
> >>> --- a/arch/powerpc/include/uapi/asm/ptrace.h
> >>> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> >>> @@ -108,8 +108,9 @@ struct pt_regs {
> >>> #define PT_DAR 41
> >>> #define PT_DSISR 42
> >>> #define PT_RESULT 43
> >>> -#define PT_DSCR 44
> >>> #define PT_REGS_COUNT 44
> >>> +#define PT_DSCR 44
> >>> +#define PT_AMR 45
> >>
> >> PT_REGS_COUNT is not getting incremented even after adding
> >> one more element into the pack ?
> >
> > Correct. there are 48 entires in gp_regs table AFAICT, only the first 45
> > are exposed through pt_regs and through gp_regs. the remaining
> > are exposed through gp_regs only.
> >
> >>
> >>>
> >>> #define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */
> >>>
> >>> diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
> >>> index 709e234..17f5d8a 100644
> >>> --- a/arch/powerpc/kernel/asm-offsets.c
> >>> +++ b/arch/powerpc/kernel/asm-offsets.c
> >>> @@ -241,6 +241,11 @@ int main(void)
> >>> OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
> >>> OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
> >>> OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
> >>> +
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + OFFSET(PACA_AMR, paca_struct, paca_amr);
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>
> >> So we now have a place in PACA for AMR.
> >
> > yes.
> >
> >>
> >>> OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
> >>> OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
> >>> OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
> >>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> >>> index 3fd0528..8db9ef8 100644
> >>> --- a/arch/powerpc/kernel/exceptions-64s.S
> >>> +++ b/arch/powerpc/kernel/exceptions-64s.S
> >>> @@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
> >>> ld r12,_MSR(r1)
> >>> ld r3,PACA_EXGEN+EX_DAR(r13)
> >>> lwz r4,PACA_EXGEN+EX_DSISR(r13)
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + mfspr r5,SPRN_AMR
> >>> + std r5,PACA_AMR(r13)
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> li r5,0x300
> >>> std r3,_DAR(r1)
> >>> std r4,_DSISR(r1)
> >>> @@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
> >>> ld r12,_MSR(r1)
> >>> ld r3,_NIP(r1)
> >>> andis. r4,r12,0x5820
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + mfspr r5,SPRN_AMR
> >>> + std r5,PACA_AMR(r13)
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>
> >> Saving the AMR context on page faults, this seems to be
> >> changing in the next patch again based on whether any
> >> key was active at that point and fault happened for the
> >> key enforcement ?
> >
> > yes. i am going to merge the next patch with this patch.
> >
> >
> >>
> >>> li r5,0x400
> >>> std r3,_DAR(r1)
> >>> std r4,_DSISR(r1)
> >>> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> >>> index 97bb138..059766a 100644
> >>> --- a/arch/powerpc/kernel/signal_32.c
> >>> +++ b/arch/powerpc/kernel/signal_32.c
> >>> @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
> >>> (unsigned long) &frame->tramp[2]);
> >>> }
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
> >>> + return 1;
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> return 0;
> >>> }
> >>>
> >>> @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
> >>> long err;
> >>> unsigned int save_r2 = 0;
> >>> unsigned long msr;
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + unsigned long amr;
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> #ifdef CONFIG_VSX
> >>> int i;
> >>> #endif
> >>> @@ -750,6 +758,12 @@ static long restore_user_regs(struct pt_regs *regs,
> >>> return 1;
> >>> #endif /* CONFIG_SPE */
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + err |= __get_user(amr, &sr->mc_gregs[PT_AMR]);
> >>> + if (!err && amr != get_paca()->paca_amr)
> >>> + write_amr(amr);
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> return 0;
> >>> }
> >>>
> >>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> >>> index c83c115..35df2e4 100644
> >>> --- a/arch/powerpc/kernel/signal_64.c
> >>> +++ b/arch/powerpc/kernel/signal_64.c
> >>> @@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
> >>> if (set != NULL)
> >>> err |= __put_user(set->sig[0], &sc->oldmask);
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + err |= __put_user(get_paca()->paca_amr, &sc->gp_regs[PT_AMR]);
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> return err;
> >>> }
> >>>
> >>> @@ -327,6 +331,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> >>> unsigned long save_r13 = 0;
> >>> unsigned long msr;
> >>> struct pt_regs *regs = tsk->thread.regs;
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + unsigned long amr;
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> #ifdef CONFIG_VSX
> >>> int i;
> >>> #endif
> >>> @@ -406,6 +413,13 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> >>> tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
> >>> }
> >>> #endif
> >>> +
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + err |= __get_user(amr, &sc->gp_regs[PT_AMR]);
> >>> + if (!err && amr != get_paca()->paca_amr)
> >>> + write_amr(amr);
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> return err;
> >>> }
> >>>
> >>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> >>> index d4e545d..cc4bde8b 100644
> >>> --- a/arch/powerpc/kernel/traps.c
> >>> +++ b/arch/powerpc/kernel/traps.c
> >>> @@ -20,6 +20,7 @@
> >>> #include <linux/sched/debug.h>
> >>> #include <linux/kernel.h>
> >>> #include <linux/mm.h>
> >>> +#include <linux/pkeys.h>
> >>> #include <linux/stddef.h>
> >>> #include <linux/unistd.h>
> >>> #include <linux/ptrace.h>
> >>> @@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
> >>> info->si_addr = (void __user *)regs->nip;
> >>> }
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> +static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
> >>> +{
> >>> + struct vm_area_struct *vma;
> >>> +
> >>> + /* Fault not from Protection Keys: nothing to do */
> >>> + if (si_code != SEGV_PKUERR)
> >>> + return;
> >>
> >> Should have checked this in the caller ?
> >
> > maybe. currently there is only one caller to this function. so either
> > way is ok. But if more than one callers show up later having the check
> > here reduces the burden on the caller.
> >
> >
> >>
> >>> +
> >>> + down_read(&current->mm->mmap_sem);
> >>> + /*
> >>> + * we could be racing with pkey_mprotect().
> >>> + * If pkey_mprotect() wins the key value could
> >>> + * get modified...xxx
> >>> + */
> >>> + vma = find_vma(current->mm, addr);
> >>> + up_read(&current->mm->mmap_sem);
> >>> +
> >>> + /*
> >>> + * force_sig_info_fault() is called from a number of
> >>> + * contexts, some of which have a VMA and some of which
> >>> + * do not. The Pkey-fault handing happens after we have a
> >>> + * valid VMA, so we should never reach this without a
> >>> + * valid VMA.
> >>> + */
> >>
> >> Also because pkey can only be used from user space when we will
> >> definitely have a VMA associated with it.
> >>
> >>> + if (!vma) {
> >>> + WARN_ONCE(1, "Pkey fault with no VMA passed in");
> >>> + info->si_pkey = 0;
> >>> + return;
> >>> + }
> >>> +
> >>> + /*
> >>> + * We could report the incorrect key because of the reason
> >>> + * explained above.
> >>
> >> What if we hold mm->mmap_sem for some more time till we update
> >> info->si_pkey ? Is there still a chance that pkey would have
> >> changed by the time siginfo returns to user space ? I am still
> >> wondering is there way to hold up VMA changes to be on safer
> >> side. Is the race conditions exists on x86 as well ?
>
> Is the race condition exists on x86 system as well ? Can we hold
> up little more the mmap_sem to improve our chances ?
>

Yes it exists on x86 as well. We could hold the sem longer, but that
does not entirely solve the problem, at the same time we will slow down the
process. The fool-proof way of solving the problem is if the hardware
told us which key got violated, instead of just telling us that there
was key violation. Depending on the value in the vma, is always going to
be racy.

> >>
> >>> + *
> >>> + * si_pkey should be thought off as trong hint, but not
> >>> + * an absolutely guarantee because of the race explained
> >>> + * above.
> >>> + */
> >>> + info->si_pkey = vma_pkey(vma);
> >>> +}
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
> >>> {
> >>> siginfo_t info;
> >>> @@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
> >>> info.si_signo = signr;
> >>> info.si_code = code;
> >>> info.si_addr = (void __user *) addr;
> >>> +
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + fill_sig_info_pkey(code, &info, addr);
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> force_sig_info(signr, &info, current);
> >>> }
> >>>
> >>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> >>> index c31624f..dd448d2 100644
> >>> --- a/arch/powerpc/mm/fault.c
> >>> +++ b/arch/powerpc/mm/fault.c
> >>> @@ -453,6 +453,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> >>> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> >>> flags & FAULT_FLAG_INSTRUCTION,
> >>> 0)) {
> >>> +
> >>> + /* our caller may not have saved the amr. Lets save it */
> >>> + get_paca()->paca_amr = read_amr();
> >>> +
> >>
> >> Something is not right here. PACA save should have happened before we
> >> come here. Why say the caller might not have saved the AMR ? Is there
> >> a path when its possible ?
> >
> > This is a case, where the fault is cause because of page not yet being
> > hashed, but at the same time could have violated a protection key.
> > Since the page is not hashed yet, there is no protection-key fault
> > and hence the caller would have not saved the AMR. We do it here to
> > catch that case.
>
> Is that because of the above optimization you have added in the page
> fault path ?

yes. exactly.

--
Ram Pai

2017-06-21 06:26:20

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 08/12] powerpc: Handle exceptions caused by violation of pkey protection.

On Wed, Jun 21, 2017 at 09:24:36AM +0530, Anshuman Khandual wrote:
> On 06/21/2017 05:13 AM, Ram Pai wrote:
> > On Tue, Jun 20, 2017 at 12:54:45PM +0530, Anshuman Khandual wrote:
> >> On 06/17/2017 09:22 AM, Ram Pai wrote:
> >>> Handle Data and Instruction exceptions caused by memory
> >>> protection-key.
> >>>
> >>> Signed-off-by: Ram Pai <[email protected]>
> >>> (cherry picked from commit a5e5217619a0c475fe0cacc3b0cf1d3d33c79a09)
> >
> > Sorry. it was residue of a bad cleanup. It got cherry-picked from my own
> > internal branch, but than i forgot to delete that line.
> >
> >>
> >> To which tree this commit belongs to ?
> >>
> >>>
> >>> Conflicts:
> >>> arch/powerpc/include/asm/reg.h
> >>> arch/powerpc/kernel/exceptions-64s.S
> >
> > same here. these two line are some residues of patching-up my tree with
> > commits from other internal branches.
> >
> >>> ---
> >>> arch/powerpc/include/asm/mmu_context.h | 12 +++++
> >>> arch/powerpc/include/asm/pkeys.h | 9 ++++
> >>> arch/powerpc/include/asm/reg.h | 7 +--
> >>> arch/powerpc/mm/fault.c | 21 +++++++-
> >>> arch/powerpc/mm/pkeys.c | 90 ++++++++++++++++++++++++++++++++++
> >>> 5 files changed, 134 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >>> index da7e943..71fffe0 100644
> >>> --- a/arch/powerpc/include/asm/mmu_context.h
> >>> +++ b/arch/powerpc/include/asm/mmu_context.h
> >>> @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
> >>> {
> >>> }
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> +bool arch_pte_access_permitted(pte_t pte, bool write);
> >>> +bool arch_vma_access_permitted(struct vm_area_struct *vma,
> >>> + bool write, bool execute, bool foreign);
> >>> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +static inline bool arch_pte_access_permitted(pte_t pte, bool write)
> >>> +{
> >>> + /* by default, allow everything */
> >>> + return true;
> >>> +}
> >>> static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> >>> bool write, bool execute, bool foreign)
> >>> {
> >>> /* by default, allow everything */
> >>> return true;
> >>> }
> >>
> >> Right, these are the two functions the core VM expects the
> >> arch to provide.
> >>
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> #endif /* __KERNEL__ */
> >>> #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
> >>> diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
> >>> index 9b6820d..405e7db 100644
> >>> --- a/arch/powerpc/include/asm/pkeys.h
> >>> +++ b/arch/powerpc/include/asm/pkeys.h
> >>> @@ -14,6 +14,15 @@
> >>> VM_PKEY_BIT3 | \
> >>> VM_PKEY_BIT4)
> >>>
> >>> +static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
> >>> +{
> >>> + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
> >>> + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
> >>> + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
> >>> + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
> >>> + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
> >>> +}
> >>
> >> Add defines for the above 0x1, 0x2, 0x4, 0x8 etc ?
> >
> > hmm...not sure if it will make the code any better.
> >
> >>
> >>> +
> >>> #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
> >>> ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
> >>> ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
> >>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> >>> index 2dcb8a1..a11977f 100644
> >>> --- a/arch/powerpc/include/asm/reg.h
> >>> +++ b/arch/powerpc/include/asm/reg.h
> >>> @@ -285,9 +285,10 @@
> >>> #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> >>> #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> >>> #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> >>> -#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> >>> - DSISR_PAGEATTR_CONFLT | \
> >>> - DSISR_BADACCESS | \
> >>> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> >>> + DSISR_PAGEATTR_CONFLT | \
> >>> + DSISR_BADACCESS | \
> >>> + DSISR_KEYFAULT | \
> >>> DSISR_BIT43)
> >>
> >> This should have been cleaned up before adding new
> >> DSISR_KEYFAULT reason code into it. But I guess its
> >> okay.
> >>
> >>> #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
> >>> #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
> >>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> >>> index 3a7d580..c31624f 100644
> >>> --- a/arch/powerpc/mm/fault.c
> >>> +++ b/arch/powerpc/mm/fault.c
> >>> @@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> >>> * bits we are interested in. But there are some bits which
> >>> * indicate errors in DSISR but can validly be set in SRR1.
> >>> */
> >>> - if (trap == 0x400)
> >>> + if (trap == 0x400) {
> >>> error_code &= 0x48200000;
> >>> - else
> >>> + flags |= FAULT_FLAG_INSTRUCTION;
> >>> + } else
> >>> is_write = error_code & DSISR_ISSTORE;
> >>> #else
> >>
> >> Why adding the FAULT_FLAG_INSTRUCTION here ?
> >
> > later in this code, this flag is checked to see if execute-protection was
> > violated.
>
> 'is_exec' which is set for every 400 interrupt can be used for that
> purpose ? I guess thats how we have been dealing with generic PROT_EXEC
> based faults.
>
This is right. Thanks for pointing it out. Yes 'is_exec' is sufficient to
achieve the purpose.

> >>
> >>> is_write = error_code & ESR_DST;
> >>> @@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> >>> }

> >>> #endif
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + if (error_code & DSISR_KEYFAULT) {
> >>> + code = SEGV_PKUERR;
> >>> + goto bad_area_nosemaphore;
> >>> + }
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>> /* We restore the interrupt state now */
> >>> if (!arch_irq_disabled_regs(regs))
> >>> local_irq_enable();
> >>> @@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> >>> WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
> >>> #endif /* CONFIG_PPC_STD_MMU */
> >>>
> >>> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> >>> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> >>> + flags & FAULT_FLAG_INSTRUCTION,
> >>> + 0)) {
> >>> + code = SEGV_PKUERR;
> >>> + goto bad_area;
> >>> + }
> >>> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> >>> +
> >>
> >> I am wondering why both the above checks are required ?
> >
> > Yes good question. there are two cases here.
> >
> > a) when a hpte is not yet hashed to pte.
> >
> > in this case the fault is because the hpte is not yet mapped.
> > However the access may have also violated the protection
> > permissions of the key associated with that address. So we need
>
> Both of these cannot be possible simultaneously. In this case
> MMU will take a fault because of no HPTE is found for the access
> not for the protection key irrespective of the pkey value and type
> of the access. Are you saying that DSISR might have both DSISR_NOHPTE
> and DSISR_KEYFAULT set for this case ? If not its not a good idea
> to present SEGV_PKUERR as reason code during signal delivery.

Both DSISR_NOHPTE and DSISR_KEYFAULT may not be set simultaneously.
A HPTE needs to exist before a key can be programmed into it.

However its still a key violation, if the fault was a
DSISR_NOHPTE, and the faulting address has a key in the vma that
is violated. There is a violation, it needs to be reported as
SEG_PKUERR. The hardware may not have detected it, but software
is still responsible for detecting and reporting it.


>
> > to a software check to determine if a key was violated.
> >
> > if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,...
> >
> > handles this case.
> >
> >
> > b) when the hpte is hashed to the pte and keys are programmed into
> > the hpte.
> >
> > in this case the hardware senses the key protection fault
> > and we just have to check if that is the case.
> >
> > if (error_code & DSISR_KEYFAULT) {....
> >
> > handles this case.
>
> This is correct.
>
> >
> >
> >>
> >> * DSISR should contains DSISR_KEYFAULT
> >>
> >> * VMA pkey values whether they matched the fault cause
> >>
> >>
> >>> /*
> >>> * If for any reason at all we couldn't handle the fault,
> >>> * make sure we exit gracefully rather than endlessly redo
> >>> diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> >>> index 11a32b3..439241a 100644
> >>> --- a/arch/powerpc/mm/pkeys.c
> >>> +++ b/arch/powerpc/mm/pkeys.c
> >>> @@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
> >>> return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
> >>> }
> >>>
> >>> +static inline bool pkey_allows_read(int pkey)
> >>> +{
> >>> + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
> >>> +
> >>> + if (!(read_uamor() & (0x3ul << pkey_shift)))
> >>> + return true;
> >>> +
> >>> + return !(read_amr() & (AMR_AD_BIT << pkey_shift));
> >>> +}
> >>
> >> Get read_amr() into a local variable and save some cycles if we
> >> have to do it again.
> >
> > No. not really. the AMR can be changed by the process in userspace. So anything
> > that we cache can go stale.
> > Or maybe i do not understand your comment.
>
> I am not saying to cache the value. Just inside the function, if we have
> a local variable holding read_amr() once, it can be used twice without
> reading the register again. Just inside the function.

O!. May be you are reading read_uamor() and read_amr() as the same thing?
They are two different registers.

RP


--
Ram Pai

2017-06-21 06:35:10

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On Wed, Jun 21, 2017 at 11:05:33AM +0530, Anshuman Khandual wrote:
> On 06/21/2017 04:53 AM, Ram Pai wrote:
> > On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote:
> >> On 06/17/2017 09:22 AM, Ram Pai wrote:
> >>> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> >>> in the 4K backed hpte pages. These bits continue to be used
> >>> for 64K backed hpte pages in this patch, but will be freed
> >>> up in the next patch.
> >>
> >> The counting 3, 4, 5 and 6 are in BE format I believe, I was
> >> initially trying to see that from right to left as we normally
> >> do in the kernel and was getting confused. So basically these
> >> bits (which are only applicable for 64K mapping IIUC) are going
> >> to be freed up from the PTE format.
> >>
> >> #define _RPAGE_RSV1 0x1000000000000000UL
> >> #define _RPAGE_RSV2 0x0800000000000000UL
> >> #define _RPAGE_RSV3 0x0400000000000000UL
> >> #define _RPAGE_RSV4 0x0200000000000000UL
> >>
> >> As you have mentioned before this feature is available for 64K
> >> page size only and not for 4K mappings. So I assume we support
> >> both the combinations.
> >>
> >> * 64K mapping on 64K
> >> * 64K mapping on 4K
> >
> > yes.
> >
> >>
> >> These are the current users of the above bits
> >>
> >> #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> >> #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> >> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >> #define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >>
> >>>
> >>> The patch does the following change to the 64K PTE format
> >>>
> >>> H_PAGE_BUSY moves from bit 3 to bit 9
> >>
> >> and what is in there on bit 9 now ? This ?
> >>
> >> #define _RPAGE_SW2 0x00400
> >>
> >> which is used as
> >>
> >> #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */
> >>
> >> which will not be required any more ?
> >
> > i think you are reading bit 9 from right to left. the bit 9 i refer to
> > is from left to right. Using the same numbering convention the ISA3.0 uses.
>
> Right, my bad. Then it would be this one.
>
> '#define _RPAGE_RPN42 0x0040000000000000UL'
>
> > I know it is confusing, will make a mention in the comment of this
> > patch, to read it the big-endian way.
>
> Right.
>
> >
> > BTW: Bit 9 is not used currently. so using it in this patch. But this is
> > a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch.
> >
> > Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is
> > used by 64K PTE backed by 64k htpe.
>
> Got it.
>
> >
> >>
> >>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> >>> of the pte.
> >>> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> >>> second part of the pte.
> >>>
> >>> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> >>> is initialized to 0xF indicating an invalid slot. If a hpte
> >>> gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> >>> released immediately. In other words, even though 0xF is a
> >>
> >> Release immediately means we attempt again for a new hash slot ?
> >
> > yes.
> >
> >>
> >>> valid slot we discard and consider it as an invalid
> >>> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> >>> depend on a bit in the primary PTE in order to determine the
> >>> validity of a slot.
> >>
> >> So we have to see the slot number in the second half for each PTE to
> >> figure out if it has got a valid slot in the hash page table.
> >
> > yes.
> >
> >>
> >>>
> >>> When we release a hpte in the 0xF slot we also release a
> >>> legitimate primary slot and unmap that entry. This is to
> >>> ensure that we do get a legimate non-0xF slot the next time we
> >>> retry for a slot.
> >>
> >> Okay.
> >>
> >>>
> >>> Though treating 0xF slot as invalid reduces the number of available
> >>> slots and may have an effect on the performance, the probabilty
> >>> of hitting a 0xF is extermely low.
> >>
> >> Why you say that ? I thought every slot number has the same probability
> >> of hit from the hash function.
> >
> > Every hash bucket has the same probability. But every slot within the
> > hash bucket is filled in sequentially. so it takes 15 hptes to hash to
> > the same bucket before we get to the 15th slot in the secondary.
>
> Okay, would the last one be 16th instead ?
>
> >
> >>
> >>>
> >>> Compared to the current scheme, the above described scheme reduces
> >>> the number of false hash table updates significantly and has the
> >>
> >> How it reduces false hash table updates ?
> >
> > earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE
> > for four consecutive 4K hptes. If any one 4k hpte got hashed-in,
> > the bit got set. Which means anytime it faulted on the remaining
> > three 4k hpte, we saw the bit already set and tried to erroneously
> > update that hpte. So we had a 75% update error rate. Funcationally
> > not bad, but bad from a performance point of view.
>
> I am bit out of sync regarding these PTE bits, after Aneesh's radix
> changes went in :) Will look into this bit closer.
>
> >
> > With the current scheme, we decide if a 4k slot is valid by looking
> > at its value rather than depending on a bit in the main-pte. So
> > there is no chance of getting mislead. And hence no chance of trying
> > to update a invalid hpte. Should improve performance and at the same
> > time give us four valuable PTE bits.
>
> I am not sure why you say 'invalid hpte'. IIUC

I mean to say a entry which does not yet have a mapped hpte.

>
> * We will require 16 '64K on 4K' mappings to actually cover 64K on 64K
>
> * A single (64K on 4K)'s TLB can cover 64K on 64K as long as the TLB is
> present and not flushed. That gets us performance. Once flushed, a new
> HPTE entry covering new (64K on 4K) is inserted. As long as the PFN
> for the 4K is different HPTE will be different and it cannot collide
> with any existing ones and create problems (ERAT error ?)
>
> As you are pointing out, I am not sure whether the existing design had
> more probability for an invalid HPTE insert. Will look into this in
> detail.
>
> >
> >
> >>
> >>> added advantage of releasing four valuable PTE bits for other
> >>> purpose.
> >>>
> >>> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> >>> Ellermen and myself.
> >>>
> >>> 4K PTE format remain unchanged currently.
> >>>
> >>> Signed-off-by: Ram Pai <[email protected]>
> >>> ---
> >>> arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
> >>> arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
> >>> arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
> >>> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> >>> arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> >>> arch/powerpc/mm/hash64_4k.c | 14 ++---
> >>> arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
> >>> arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
> >>> 8 files changed, 122 insertions(+), 78 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >>> index b4b5e6b..5ef1d81 100644
> >>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >>> @@ -16,6 +16,18 @@
> >>> #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >>> #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >>>
> >>> +
> >>> +/*
> >>> + * Only supported by 4k linux page size
> >>> + */
> >>> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> >>> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >>> +#define H_PAGE_F_GIX_SHIFT 56
> >>> +
> >>> +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> >>> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >>> +
> >>> +
> >>
> >> So we moved the common 64K definitions here.
> >
> > yes.
> >>
> >>
> >>> /* PTE flags to conserve for HPTE identification */
> >>> #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> >>> H_PAGE_F_SECOND | H_PAGE_F_GIX)
> >>> @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
> >>> }
> >>> #endif
> >>>
> >>> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> >>> + unsigned int subpg_index, unsigned long slot)
> >>> +{
> >>> + return (slot << H_PAGE_F_GIX_SHIFT) &
> >>> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> >>> +}
> >>
> >> Why we are passing the first 3 arguments of the function if we never
> >> use it inside. Is the caller expected to take care of it ?
> >
> > trying to keep the same prototype for the 4K-pte and 64K-pte cases.
> > Otherwise the caller has to wonder which parameter scheme to use.
> >
> >>
> >>> +
> >>> +
> >>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>
> >>> static inline char *get_hpte_slot_array(pmd_t *pmdp)
> >>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >>> index 9732837..0eb3c89 100644
> >>> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >>> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >>> @@ -10,23 +10,25 @@
> >>> * 64k aligned address free up few of the lower bits of RPN for us
> >>> * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> >>> */
> >>> -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> >>> -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> >>> +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> >>> +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> >>
> >> Its the same thing, changes nothing.
> >
> > it fixes some space/tab problem.
> >
> >>
> >>> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> >>> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >>> +#define H_PAGE_F_GIX_SHIFT 56
> >>> +
> >>> +
> >>> +#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
> >>> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >>
> >> H_PAGE_BUSY seems to be differently defined here.
> >
> > Yes. it is using two different bits depending on 4K hpte v/s 64k hpte
> > case. But in the next patch all will be same and consistent.
> >
> >>
> >>> +
> >>> /*
> >>> * We need to differentiate between explicit huge page and THP huge
> >>> * page, since THP huge page also need to track real subpage details
> >>> */
> >>> #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
> >>>
> >>> -/*
> >>> - * Used to track subpage group valid if H_PAGE_COMBO is set
> >>> - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> >>> - */
> >>> -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
> >>
> >> H_PAGE_COMBO_VALID is not defined alternately ?
> >
> > it is not needed anymore.
> >
> >>
> >>> -
> >>> /* PTE flags to conserve for HPTE identification */
> >>> -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> >>> - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> >>> +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
> >>> +
> >>
> >> Slot information has moved to the second half, hence _PAGE_HPTEFLAGS
> >> need not carry that.
> >
> > yes.
> >
> >>
> >>> /*
> >>> * we support 16 fragments per PTE page of 64K size.
> >>> */
> >>> @@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> >>> return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> >>> }
> >>>
> >>> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> >>> + unsigned int subpg_index, unsigned long slot)
> >>> +{
> >>> + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> >>> +
> >>> + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> >>> + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> >>> + return 0x0UL;
> >>> +}
> >>
> >> New method to insert the slot information in the second half.
> >
> > yes. well it basically trying to reduce code redundancy. Too many places
> > using exactly the same code to accomplish the same thing. Makes sense to
> > bring it all in one place.
>
> Right.
>
> >
> >>
> >>> +
> >>> #define __rpte_to_pte(r) ((r).pte)
> >>> extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
> >>> /*
> >>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> >>> index 4e957b0..e7cf03a 100644
> >>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> >>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> >>> @@ -8,11 +8,8 @@
> >>> *
> >>> */
> >>> #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> >>> -#define H_PAGE_F_GIX_SHIFT 56
> >>> -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> >>> -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> >>> -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >>> -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >>
> >> Removing the common definitions.
> >>
> >>> +
> >>> +#define INIT_HIDX (~0x0UL)
> >>>
> >>> #ifdef CONFIG_PPC_64K_PAGES
> >>> #include <asm/book3s/64/hash-64k.h>
> >>> @@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
> >>> return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
> >>> }
> >>>
> >>> +static inline bool hpte_soft_invalid(unsigned long slot)
> >>> +{
> >>> + return ((slot & 0xfUL) == 0xfUL);
> >>> +}
> >>> +
> >>> +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> >>> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> >>> +
> >>> /* This low level function performs the actual PTE insertion
> >>> * Setting the PTE depends on the MMU type and other factors. It's
> >>> * an horrible mess that I'm not going to try to clean up now but
> >>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >>> index 6981a52..cfb8169 100644
> >>> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >>> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >>> @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> >>> extern int __hash_page_64K(unsigned long ea, unsigned long access,
> >>> unsigned long vsid, pte_t *ptep, unsigned long trap,
> >>> unsigned long flags, int ssize);
> >>> +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> >>> + unsigned int subpg_index, unsigned long slot);
> >>> +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> >>> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> >>
> >> I wonder what purpose set_hidx_slot() defined previously, served.
> >>
> >>> +
> >>> struct mm_struct;
> >>> unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> >>> extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> >>> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> >>> index 44fe483..b832ed3 100644
> >>> --- a/arch/powerpc/mm/dump_linuxpagetables.c
> >>> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> >>> @@ -213,7 +213,7 @@ struct flag_info {
> >>> .val = H_PAGE_4K_PFN,
> >>> .set = "4K_pfn",
> >>> }, {
> >>> -#endif
> >>> +#else
> >>> .mask = H_PAGE_F_GIX,
> >>> .val = H_PAGE_F_GIX,
> >>> .set = "f_gix",
> >>> @@ -224,6 +224,7 @@ struct flag_info {
> >>> .val = H_PAGE_F_SECOND,
> >>> .set = "f_second",
> >>> }, {
> >>> +#endif /* CONFIG_PPC_64K_PAGES */
> >>
> >> Are we adding H_PAGE_F_GIX as an element for 4K mapping ?
> >
> > I think there is mistake here.
> > In the next patch when these bits are divorsed from
> > 64K ptes entirely, we will not need the above code for 64K ptes.
> > But good catch. Will fix the error in this patch.
> >
> >>
> >>> #endif
> >>> .mask = _PAGE_SPECIAL,
> >>> .val = _PAGE_SPECIAL,
> >>> diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> >>> index 6fa450c..c673829 100644
> >>> --- a/arch/powerpc/mm/hash64_4k.c
> >>> +++ b/arch/powerpc/mm/hash64_4k.c
> >>> @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >>> pte_t *ptep, unsigned long trap, unsigned long flags,
> >>> int ssize, int subpg_prot)
> >>> {
> >>> + real_pte_t rpte;
> >>> unsigned long hpte_group;
> >>> unsigned long rflags, pa;
> >>> unsigned long old_pte, new_pte;
> >>> @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >>> * need to add in 0x1 if it's a read-only user page
> >>> */
> >>> rflags = htab_convert_pte_flags(new_pte);
> >>> + rpte = __real_pte(__pte(old_pte), ptep);
> >>>
> >>> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> >>> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> >>> @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >>> /*
> >>> * There MIGHT be an HPTE for this pte
> >>> */
> >>> - hash = hpt_hash(vpn, shift, ssize);
> >>> - if (old_pte & H_PAGE_F_SECOND)
> >>> - hash = ~hash;
> >>> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> >>> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> >>> + unsigned long gslot = get_hidx_gslot(vpn, shift,
> >>> + ssize, rpte, 0);
> >>
> >> I am wondering why there is a 'g' before the slot in all these
> >> functions.
> >
> > Right. even i was confused initially. :)
> >
> > hash table slots are originized as one big table. 8 consecutive entires
> > in that table form a bucket. the term slot is used to refer to the
> > slot within the bucket. the term gslot is used to refer to an entry
> > in the table. roughly speaking slot 2 in bucket 2, will be gslot 2*8+2=18.
>
> Global slot as it can point any where on that two dimensional table ?
>
> >
> >>
> >> Its already too much of changes in a single patch. Being a single
> >> logical change it needs to be inside a single change but then we
> >> need much more description in the commit message for some one to
> >> understand what all changed and how.
> >
> > I have further broken down this patch, one to introduce get_hidx_gslot()
> > one to introduce set_hidx_slot() . Hopefully that will reduce the size
> > of the patch to graspable level. let me know,
>
> I did some experiments with the first two patches.
>
> * First of all the first patch does not compile without this.

its a warning that a variable is defined but not used. I have fixed it
in my new patch series; about to be launched soon.

>
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -1612,7 +1612,7 @@ unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
> unsigned long flags)
> {
> - unsigned long hash, index, shift, hidx, gslot;
> + unsigned long index, shift, gslot;
> int local = flags & HPTE_LOCAL_UPDATE;
>
> DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
>
> * Though it boots the kernel, system is kind of unresponsive while attempting
> to compile a kernel. Though I did not dig further on this, seems like the
> first patch is not self sufficient yet.

I wouldn't have broken the the patch into two, because there is too much
coupling between the two. But Aneesh wanted it that way. And it makes
sense to break it from a review point of view.

>
> * With both first and second patch, the kernel boots fine and compiles a kernel.

Yes. that meets my expectation.

>
> We need to sort out issues in the first two patches before looking into
> the rest of the patch series.

I am not aware of any issues in the first two patches though. Do you see
any?

RP

--
Ram Pai

2017-06-21 06:41:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

Ram Pai <[email protected]> writes:

> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> in the 4K backed hpte pages. These bits continue to be used
> for 64K backed hpte pages in this patch, but will be freed
> up in the next patch.
>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 9
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating an invalid slot. If a hpte
> gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> released immediately. In other words, even though 0xF is a
> valid slot we discard and consider it as an invalid
> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> depend on a bit in the primary PTE in order to determine the
> validity of a slot.
>
> When we release a hpte in the 0xF slot we also release a
> legitimate primary slot and unmap that entry. This is to
> ensure that we do get a legimate non-0xF slot the next time we
> retry for a slot.
>
> Though treating 0xF slot as invalid reduces the number of available
> slots and may have an effect on the performance, the probabilty
> of hitting a 0xF is extermely low.
>
> Compared to the current scheme, the above described scheme reduces
> the number of false hash table updates significantly and has the
> added advantage of releasing four valuable PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
> arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
> arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> arch/powerpc/mm/hash64_4k.c | 14 ++---
> arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
> arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
> 8 files changed, 122 insertions(+), 78 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b..5ef1d81 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -16,6 +16,18 @@
> #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>
> +
> +/*
> + * Only supported by 4k linux page size
> + */
> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT 56
> +
> +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> +
> /* PTE flags to conserve for HPTE identification */
> #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> H_PAGE_F_SECOND | H_PAGE_F_GIX)
> @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
> }
> #endif
>
> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + return (slot << H_PAGE_F_GIX_SHIFT) &
> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +}
> +
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> static inline char *get_hpte_slot_array(pmd_t *pmdp)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 9732837..0eb3c89 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -10,23 +10,25 @@
> * 64k aligned address free up few of the lower bits of RPN for us
> * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> */
> -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT 56
> +
> +
> +#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> /*
> * We need to differentiate between explicit huge page and THP huge
> * page, since THP huge page also need to track real subpage details
> */
> #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
>
> -/*
> - * Used to track subpage group valid if H_PAGE_COMBO is set
> - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> - */
> -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
> -
> /* PTE flags to conserve for HPTE identification */
> -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)

Why in this patch ? This is related to 64K pte


> +
> /*
> * we support 16 fragments per PTE page of 64K size.
> */
> @@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> }
>
> +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> +
> + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> + return 0x0UL;
> +}
> +
> #define __rpte_to_pte(r) ((r).pte)
> extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
> /*
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index 4e957b0..e7cf03a 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -8,11 +8,8 @@
> *
> */
> #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> -#define H_PAGE_F_GIX_SHIFT 56
> -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> +#define INIT_HIDX (~0x0UL)
>
> #ifdef CONFIG_PPC_64K_PAGES
> #include <asm/book3s/64/hash-64k.h>
> @@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
> return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
> }
>
> +static inline bool hpte_soft_invalid(unsigned long slot)
> +{
> + return ((slot & 0xfUL) == 0xfUL);
> +}
> +
> +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> +
> /* This low level function performs the actual PTE insertion
> * Setting the PTE depends on the MMU type and other factors. It's
> * an horrible mess that I'm not going to try to clean up now but
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 6981a52..cfb8169 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> extern int __hash_page_64K(unsigned long ea, unsigned long access,
> unsigned long vsid, pte_t *ptep, unsigned long trap,
> unsigned long flags, int ssize);
> +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot);
> +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> +
> struct mm_struct;
> unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> index 44fe483..b832ed3 100644
> --- a/arch/powerpc/mm/dump_linuxpagetables.c
> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> @@ -213,7 +213,7 @@ struct flag_info {
> .val = H_PAGE_4K_PFN,
> .set = "4K_pfn",
> }, {
> -#endif
> +#else
> .mask = H_PAGE_F_GIX,
> .val = H_PAGE_F_GIX,
> .set = "f_gix",
> @@ -224,6 +224,7 @@ struct flag_info {
> .val = H_PAGE_F_SECOND,
> .set = "f_second",
> }, {
> +#endif /* CONFIG_PPC_64K_PAGES */
> #endif
> .mask = _PAGE_SPECIAL,
> .val = _PAGE_SPECIAL,
> diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> index 6fa450c..c673829 100644
> --- a/arch/powerpc/mm/hash64_4k.c
> +++ b/arch/powerpc/mm/hash64_4k.c
> @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> pte_t *ptep, unsigned long trap, unsigned long flags,
> int ssize, int subpg_prot)
> {
> + real_pte_t rpte;
> unsigned long hpte_group;
> unsigned long rflags, pa;
> unsigned long old_pte, new_pte;
> @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * need to add in 0x1 if it's a read-only user page
> */
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> /*
> * There MIGHT be an HPTE for this pte
> */
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> + unsigned long gslot = get_hidx_gslot(vpn, shift,
> + ssize, rpte, 0);
>
> - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
> + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
> MMU_PAGE_4K, ssize, flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
> }
> @@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> return -1;
> }
> new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> }
> *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> return 0;



None of the above changes are needed. We are not changing anything w.r.t
4k linux page table yet. So we can drop this.


> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> index 1a68cb1..3702a3c 100644
> --- a/arch/powerpc/mm/hash64_64k.c
> +++ b/arch/powerpc/mm/hash64_64k.c
> @@ -15,34 +15,13 @@
> #include <linux/mm.h>
> #include <asm/machdep.h>
> #include <asm/mmu.h>
> +
> /*
> * index from 0 - 15
> */
> bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> {
> - unsigned long g_idx;
> - unsigned long ptev = pte_val(rpte.pte);
> -
> - g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
> - index = index >> 2;
> - if (g_idx & (0x1 << index))
> - return true;
> - else
> - return false;
> -}
> -/*
> - * index from 0 - 15
> - */
> -static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
> -{
> - unsigned long g_idx;
> -
> - if (!(ptev & H_PAGE_COMBO))
> - return ptev;
> - index = index >> 2;
> - g_idx = 0x1 << index;
> -
> - return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
> + return !(hpte_soft_invalid(rpte.hidx >> (index << 2)));
> }
>
> int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> @@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> int ssize, int subpg_prot)
> {
> real_pte_t rpte;
> - unsigned long *hidxp;
> unsigned long hpte_group;
> unsigned int subpg_index;
> - unsigned long rflags, pa, hidx;
> + unsigned long rflags, pa;
> unsigned long old_pte, new_pte, subpg_pte;
> unsigned long vpn, hash, slot;
> unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
> @@ -116,28 +94,23 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * On hash insert failure we use old pte value and we don't
> * want slot information there if we have a insert failure.
> */
> - old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> - new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> + old_pte &= ~(H_PAGE_HASHPTE);
> + new_pte &= ~(H_PAGE_HASHPTE);
> goto htab_insert_hpte;
> }
> /*
> * Check for sub page valid and update
> */
> if (__rpte_sub_valid(rpte, subpg_index)) {
> - int ret;
>
> - hash = hpt_hash(vpn, shift, ssize);
> - hidx = __rpte_to_hidx(rpte, subpg_index);
> - if (hidx & _PTEIDX_SECONDARY)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += hidx & _PTEIDX_GROUP_IX;
> + unsigned long gslot = get_hidx_gslot(vpn, shift,
> + ssize, rpte, subpg_index);


Converting that to helper is also not needed in this patch. Leave it as
it is. It is much easier to review.


>
> - ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> + int ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
> MMU_PAGE_4K, MMU_PAGE_4K,
> ssize, flags);
> /*
> - *if we failed because typically the HPTE wasn't really here
> + * if we failed because typically the HPTE wasn't really here
> * we try an insertion.
> */
> if (ret == -1)
> @@ -148,6 +121,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> }
>
> htab_insert_hpte:
> +
> + /*
> + * initialize all hidx entries to a invalid value,
> + * the first time the PTE is about to allocate
> + * a 4K hpte
> + */
> + if (!(old_pte & H_PAGE_COMBO))
> + rpte.hidx = INIT_HIDX;
> +
> /*
> * handle H_PAGE_4K_PFN case
> */
> @@ -177,10 +159,20 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> rflags, HPTE_V_SECONDARY,
> MMU_PAGE_4K, MMU_PAGE_4K,
> ssize);
> - if (slot == -1) {
> - if (mftb() & 0x1)
> +
> + if (unlikely(hpte_soft_invalid(slot))) {

Should we name that hpte_slot_invalid() ? ie. s/soft/slot/ ?


> + slot = slot & _PTEIDX_GROUP_IX;
> + mmu_hash_ops.hpte_invalidate(hpte_group+slot, vpn,
> + MMU_PAGE_4K, MMU_PAGE_4K,
> + ssize, flags);

What is the last arg flags here ? I guess we need to pass 0 there ?
We can't do a local = 1 invalidate, because we don't know whether
anybody did really access this address in between and has got the entry
in TLB.


> + }
> +
> + if (unlikely(slot == -1 || hpte_soft_invalid(slot))) {
> +

Can you add a comment around explaining invalid slot always result in
removing from primary ? Also do we want to store that invalid slot
details in a variable ? instead of doing that conditional again and
again ? This is hotpath.

> + if (hpte_soft_invalid(slot) || (mftb() & 0x1))
> hpte_group = ((hash & htab_hash_mask) *
> HPTES_PER_GROUP) & ~0x7UL;
> +
> mmu_hash_ops.hpte_remove(hpte_group);
> /*
> * FIXME!! Should be try the group from which we removed ?
> @@ -204,11 +196,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * Since we have H_PAGE_BUSY set on ptep, we can be sure
> * nobody is undating hidx.
> */
> - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> - rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> - *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> - new_pte = mark_subptegroup_valid(new_pte, subpg_index);
> - new_pte |= H_PAGE_HASHPTE;
> + new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
> + new_pte |= H_PAGE_HASHPTE;
> +
> /*
> * check __real_pte for details on matching smp_rmb()
> */
> @@ -322,9 +312,10 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
> return -1;
> }
> - new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> +
> new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;

What is this change ? i guess we want this in second patch ?


> }
> *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> return 0;
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index f2095ce..c0f4b46 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
>
> void __init hash__early_init_mmu(void)
> {
> +#ifndef CONFIG_PPC_64K_PAGES
> /*
> - * We have code in __hash_page_64K() and elsewhere, which assumes it can
> + * We have code in __hash_page_4K() and elsewhere, which assumes it can
> * do the following:
> * new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> *
> @@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
> * with a BUILD_BUG_ON().
> */
> BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul << (H_PAGE_F_GIX_SHIFT + 3)));
> +#endif /* CONFIG_PPC_64K_PAGES */
>
> htab_init_page_sizes();
>
> @@ -1589,29 +1591,39 @@ static inline void tm_flush_hash_page(int local)
> }
> #endif
>
> +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index)
> +{
> + unsigned long hash, slot, hidx;
> +
> + hash = hpt_hash(vpn, shift, ssize);
> + hidx = __rpte_to_hidx(rpte, subpg_index);
> + if (hidx & _PTEIDX_SECONDARY)
> + hash = ~hash;
> + slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> + slot += hidx & _PTEIDX_GROUP_IX;
> + return slot;
> +}


We don't need this helper for this patch series ?

> +
> +
> /* WARNING: This is called from hash_low_64.S, if you change this prototype,
> * do not forget to update the assembly call site !
> */
> void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
> unsigned long flags)
> {
> - unsigned long hash, index, shift, hidx, slot;
> + unsigned long hash, index, shift, hidx, gslot;
> int local = flags & HPTE_LOCAL_UPDATE;
>
> DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
> pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
> - hash = hpt_hash(vpn, shift, ssize);
> - hidx = __rpte_to_hidx(pte, index);
> - if (hidx & _PTEIDX_SECONDARY)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += hidx & _PTEIDX_GROUP_IX;
> + gslot = get_hidx_gslot(vpn, shift, ssize, pte, index);
> DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
> /*
> * We use same base page size and actual psize, because we don't
> * use these functions for hugepage
> */
> - mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
> + mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
> ssize, local);
> } pte_iterate_hashed_end();
>
And if we avoid adding that helper, changes like this can be avoided in
the patch.


-aneesh

2017-06-21 06:51:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

Ram Pai <[email protected]> writes:

> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> in the 64K backed hpte pages. This along with the earlier
> patch will entirely free up the four bits from 64K PTE.
>
> This patch does the following change to 64K PTE that is
> backed by 64K hpte.
>
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
> to bit 7. Trying to minimize gaps so that contiguous bits
> can be allocated if needed in the future.
>
> The second part of the PTE will hold
> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.


This patch will be really simple, if you don't use the get_hidx_gslot() helper
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/hash-64k.h | 26 ++++++++------------------
> arch/powerpc/mm/hash64_64k.c | 16 +++++++---------
> arch/powerpc/mm/hugetlbpage-hash64.c | 16 ++++++----------
> 3 files changed, 21 insertions(+), 37 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 0eb3c89..2fa5c60 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -12,12 +12,8 @@
> */
> #define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> #define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> -#define H_PAGE_F_GIX_SHIFT 56
>
> -
> -#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
> +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
> #define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>
> /*
> @@ -56,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
> unsigned long *hidxp;
>
> rpte.pte = pte;
> - rpte.hidx = 0;
> - if (pte_val(pte) & H_PAGE_COMBO) {
> - /*
> - * Make sure we order the hidx load against the H_PAGE_COMBO
> - * check. The store side ordering is done in __hash_page_4K
> - */
> - smp_rmb();
> - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> - rpte.hidx = *hidxp;
> - }
> + /*
> + * The store side ordering is done in __hash_page_4K
> + */


This is not just __hash_page_4k related now and you need to explain the
stoer side ordering more. Are we doing this correctly now ?

> + smp_rmb();
> + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> + rpte.hidx = *hidxp;
> return rpte;
> }
>
> static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> {
> - if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> - return (rpte.hidx >> (index<<2)) & 0xf;
> - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> + return ((rpte.hidx >> (index<<2)) & 0xfUL);
> }
>
> static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> index 3702a3c..1c25ec2 100644
> --- a/arch/powerpc/mm/hash64_64k.c
> +++ b/arch/powerpc/mm/hash64_64k.c
> @@ -211,6 +211,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> unsigned long vsid, pte_t *ptep, unsigned long trap,
> unsigned long flags, int ssize)
> {
> + real_pte_t rpte;
> unsigned long hpte_group;
> unsigned long rflags, pa;
> unsigned long old_pte, new_pte;
> @@ -247,6 +248,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -254,16 +256,13 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
>
> vpn = hpt_vpn(ea, vsid, ssize);
> if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> + unsigned long gslot;
> +
> /*
> * There MIGHT be an HPTE for this pte
> */
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> + gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
> + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_64K,
> MMU_PAGE_64K, ssize,
> flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
> @@ -313,8 +312,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> return -1;
> }
>
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + set_hidx_slot(ptep, rpte, 0, slot);
> new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> }
> *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> index a84bb44..239ca86 100644
> --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> pte_t *ptep, unsigned long trap, unsigned long flags,
> int ssize, unsigned int shift, unsigned int mmu_psize)
> {
> + real_pte_t rpte;
> unsigned long vpn;
> unsigned long old_pte, new_pte;
> unsigned long rflags, pa, sz;
> @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> sz = ((1UL) << shift);
> if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -71,15 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> /* Check if pte already has an hpte (case 2) */
> if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> /* There MIGHT be an HPTE for this pte */
> - unsigned long hash, slot;
> + unsigned long gslot;
>
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> + gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
> + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
> mmu_psize, ssize, flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
> }
> @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> return -1;
> }
>
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> }
>
> /*
> --
> 1.8.3.1

2017-06-21 06:54:53

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

Ram Pai <[email protected]> writes:

....

> diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> index a84bb44..239ca86 100644
> --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> pte_t *ptep, unsigned long trap, unsigned long flags,
> int ssize, unsigned int shift, unsigned int mmu_psize)
> {
> + real_pte_t rpte;
> unsigned long vpn;
> unsigned long old_pte, new_pte;
> unsigned long rflags, pa, sz;
> @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> sz = ((1UL) << shift);
> if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -71,15 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> /* Check if pte already has an hpte (case 2) */
> if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> /* There MIGHT be an HPTE for this pte */
> - unsigned long hash, slot;
> + unsigned long gslot;
>
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> + gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
> + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
> mmu_psize, ssize, flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
> }
> @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> return -1;
> }
>
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);

We don't really need rpte here. We are just need to track one entry
here. May be it becomes simpler if use different helpers for 4k hpte and
others ?

-aneesh

2017-06-21 07:17:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC v2 05/12] powerpc: Implementation for sys_mprotect_pkey() system call.

Ram Pai <[email protected]> writes:

....

>
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> +
> /*
> * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
> * here. How important is the optimization?
> */
> -static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
> - unsigned long pkey)
> -{
> - return (prot & PROT_SAO) ? VM_SAO : 0;
> -}
> -#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
> +#define arch_calc_vm_prot_bits(prot, key) ( \
> + ((prot) & PROT_SAO ? VM_SAO : 0) | \
> + pkey_to_vmflag_bits(key))
> +#define arch_vm_get_page_prot(vm_flags) __pgprot( \
> + ((vm_flags) & VM_SAO ? _PAGE_SAO : 0) | \
> + vmflag_to_page_pkey_bits(vm_flags))

Can we avoid converting static inline back to macors ? They loose type checking.
> +
> +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> +
> +#define arch_calc_vm_prot_bits(prot, key) ( \
> + ((prot) & PROT_SAO ? VM_SAO : 0))
> +#define arch_vm_get_page_prot(vm_flags) __pgprot( \
> + ((vm_flags) & VM_SAO ? _PAGE_SAO : 0))
> +
> +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
>
> -static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
> -{
> - return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
> -}
> -#define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
>
> static inline bool arch_validate_prot(unsigned long prot)
> {

-aneesh

2017-06-21 07:26:03

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

Ram Pai <[email protected]> writes:

> Replace the magic number used to check for DSI exception
> with a meaningful value.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/reg.h | 9 ++++++++-
> arch/powerpc/kernel/exceptions-64s.S | 2 +-
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 7e50e47..2dcb8a1 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -272,16 +272,23 @@
> #define SPRN_DAR 0x013 /* Data Address Register */
> #define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
> #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
> +#define DSISR_BIT32 0x80000000 /* not defined */
> #define DSISR_NOHPTE 0x40000000 /* no translation found */
> +#define DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
> +#define DSISR_BIT35 0x10000000 /* not defined */
> #define DSISR_PROTFAULT 0x08000000 /* protection fault */
> #define DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
> #define DSISR_ISSTORE 0x02000000 /* access was a store */
> #define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
> -#define DSISR_NOSEGMENT 0x00200000 /* SLB miss */
> #define DSISR_KEYFAULT 0x00200000 /* Key fault */
> +#define DSISR_BIT43 0x00100000 /* not defined */
> #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> + DSISR_PAGEATTR_CONFLT | \
> + DSISR_BADACCESS | \
> + DSISR_BIT43)
> #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
> #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
> #define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index ae418b8..3fd0528 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
> .balign IFETCH_ALIGN_BYTES
> do_hash_page:
> #ifdef CONFIG_PPC_STD_MMU_64
> - andis. r0,r4,0xa410 /* weird error? */
> + andis. r0,r4,DSISR_PAGE_FAULT_MASK@h
> bne- handle_page_fault /* if not, try to insert a HPTE */
> andis. r0,r4,DSISR_DABRMATCH@h
> bne- handle_dabr_fault


Thanks for doing this. I always wondered what that 0xa410 indicates. Now
tha it is documented, I am wondering are those the only DSISR values
that we want to check early ? You also added few bit positions that is
expected to carry value 0 ? But then excluded BIT35. Any reason ?

-aneesh

2017-06-21 09:17:41

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

On Wed, Jun 21, 2017 at 12:55:42PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <[email protected]> writes:
>
> > Replace the magic number used to check for DSI exception
> > with a meaningful value.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/reg.h | 9 ++++++++-
> > arch/powerpc/kernel/exceptions-64s.S | 2 +-
> > 2 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index 7e50e47..2dcb8a1 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -272,16 +272,23 @@
> > #define SPRN_DAR 0x013 /* Data Address Register */
> > #define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
> > #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
> > +#define DSISR_BIT32 0x80000000 /* not defined */
> > #define DSISR_NOHPTE 0x40000000 /* no translation found */
> > +#define DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
> > +#define DSISR_BIT35 0x10000000 /* not defined */
> > #define DSISR_PROTFAULT 0x08000000 /* protection fault */
> > #define DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
> > #define DSISR_ISSTORE 0x02000000 /* access was a store */
> > #define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
> > -#define DSISR_NOSEGMENT 0x00200000 /* SLB miss */
> > #define DSISR_KEYFAULT 0x00200000 /* Key fault */
> > +#define DSISR_BIT43 0x00100000 /* not defined */
> > #define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
> > #define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
> > #define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
> > +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> > + DSISR_PAGEATTR_CONFLT | \
> > + DSISR_BADACCESS | \
> > + DSISR_BIT43)
> > #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
> > #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
> > #define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > index ae418b8..3fd0528 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
> > .balign IFETCH_ALIGN_BYTES
> > do_hash_page:
> > #ifdef CONFIG_PPC_STD_MMU_64
> > - andis. r0,r4,0xa410 /* weird error? */
> > + andis. r0,r4,DSISR_PAGE_FAULT_MASK@h
> > bne- handle_page_fault /* if not, try to insert a HPTE */
> > andis. r0,r4,DSISR_DABRMATCH@h
> > bne- handle_dabr_fault
>
>
> Thanks for doing this. I always wondered what that 0xa410 indicates. Now
> tha it is documented, I am wondering are those the only DSISR values
> that we want to check early ? You also added few bit positions that is
> expected to carry value 0 ? But then excluded BIT35. Any reason ?

I did not look deeply into why the exact number 0xa410 was used in the
past. I built the macro DSISR_PAGE_FAULT_MASK using whatever bits make
up 0xa410. BIT35 if added to the DSISR_PAGE_FAULT_MASK would make it
0xb410. So I did not consider it.

However the macro for BIT35 is already defined in this patch, if that is what you were
looking for.
+#define DSISR_BIT35 0x10000000 /* not defined */

RP

2017-06-21 09:30:23

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On Wed, Jun 21, 2017 at 12:11:32PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <[email protected]> writes:
>
> > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> > in the 4K backed hpte pages. These bits continue to be used
> > for 64K backed hpte pages in this patch, but will be freed
> > up in the next patch.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 9
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating an invalid slot. If a hpte
> > gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> > released immediately. In other words, even though 0xF is a
> > valid slot we discard and consider it as an invalid
> > slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> > depend on a bit in the primary PTE in order to determine the
> > validity of a slot.
> >
> > When we release a hpte in the 0xF slot we also release a
> > legitimate primary slot and unmap that entry. This is to
> > ensure that we do get a legimate non-0xF slot the next time we
> > retry for a slot.
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and may have an effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared to the current scheme, the above described scheme reduces
> > the number of false hash table updates significantly and has the
> > added advantage of releasing four valuable PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/book3s/64/hash-4k.h | 20 +++++++
> > arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++++++----
> > arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> > arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> > arch/powerpc/mm/hash64_4k.c | 14 ++---
> > arch/powerpc/mm/hash64_64k.c | 81 ++++++++++++---------------
> > arch/powerpc/mm/hash_utils_64.c | 30 +++++++---
> > 8 files changed, 122 insertions(+), 78 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index b4b5e6b..5ef1d81 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,18 @@
> > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >
> > +
> > +/*
> > + * Only supported by 4k linux page size
> > + */
> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > +
> > /* PTE flags to conserve for HPTE identification */
> > #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> > H_PAGE_F_SECOND | H_PAGE_F_GIX)
> > @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
> > }
> > #endif
> >
> > +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot)
> > +{
> > + return (slot << H_PAGE_F_GIX_SHIFT) &
> > + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +}
> > +
> > +
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >
> > static inline char *get_hpte_slot_array(pmd_t *pmdp)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > index 9732837..0eb3c89 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > @@ -10,23 +10,25 @@
> > * 64k aligned address free up few of the lower bits of RPN for us
> > * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> > */
> > -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +
> > +#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > /*
> > * We need to differentiate between explicit huge page and THP huge
> > * page, since THP huge page also need to track real subpage details
> > */
> > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
> >
> > -/*
> > - * Used to track subpage group valid if H_PAGE_COMBO is set
> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> > - */
> > -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
> > -
> > /* PTE flags to conserve for HPTE identification */
> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> > - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
>
> Why in this patch ? This is related to 64K pte
>

Yes its in the wrong patch. Have fixed it in my new series.

>
> > +
> > /*
> > * we support 16 fragments per PTE page of 64K size.
> > */
> > @@ -74,6 +76,16 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> > return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> > }
> >
> > +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot)
> > +{
> > + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > +
> > + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> > + return 0x0UL;
> > +}
> > +
> > #define __rpte_to_pte(r) ((r).pte)
> > extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
> > /*
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> > index 4e957b0..e7cf03a 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> > @@ -8,11 +8,8 @@
> > *
> > */
> > #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> > -#define H_PAGE_F_GIX_SHIFT 56
> > -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> > -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > +#define INIT_HIDX (~0x0UL)
> >
> > #ifdef CONFIG_PPC_64K_PAGES
> > #include <asm/book3s/64/hash-64k.h>
> > @@ -160,6 +157,14 @@ static inline int hash__pte_none(pte_t pte)
> > return (pte_val(pte) & ~H_PTE_NONE_MASK) == 0;
> > }
> >
> > +static inline bool hpte_soft_invalid(unsigned long slot)
> > +{
> > + return ((slot & 0xfUL) == 0xfUL);
> > +}
> > +
> > +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
> > +
> > /* This low level function performs the actual PTE insertion
> > * Setting the PTE depends on the MMU type and other factors. It's
> > * an horrible mess that I'm not going to try to clean up now but
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > index 6981a52..cfb8169 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> > extern int __hash_page_64K(unsigned long ea, unsigned long access,
> > unsigned long vsid, pte_t *ptep, unsigned long trap,
> > unsigned long flags, int ssize);
> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot);
> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
> > +
> > struct mm_struct;
> > unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> > extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> > index 44fe483..b832ed3 100644
> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> > @@ -213,7 +213,7 @@ struct flag_info {
> > .val = H_PAGE_4K_PFN,
> > .set = "4K_pfn",
> > }, {
> > -#endif
> > +#else
> > .mask = H_PAGE_F_GIX,
> > .val = H_PAGE_F_GIX,
> > .set = "f_gix",
> > @@ -224,6 +224,7 @@ struct flag_info {
> > .val = H_PAGE_F_SECOND,
> > .set = "f_second",
> > }, {
> > +#endif /* CONFIG_PPC_64K_PAGES */
> > #endif
> > .mask = _PAGE_SPECIAL,
> > .val = _PAGE_SPECIAL,
> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> > index 6fa450c..c673829 100644
> > --- a/arch/powerpc/mm/hash64_4k.c
> > +++ b/arch/powerpc/mm/hash64_4k.c
> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > pte_t *ptep, unsigned long trap, unsigned long flags,
> > int ssize, int subpg_prot)
> > {
> > + real_pte_t rpte;
> > unsigned long hpte_group;
> > unsigned long rflags, pa;
> > unsigned long old_pte, new_pte;
> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * need to add in 0x1 if it's a read-only user page
> > */
> > rflags = htab_convert_pte_flags(new_pte);
> > + rpte = __real_pte(__pte(old_pte), ptep);
> >
> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > /*
> > * There MIGHT be an HPTE for this pte
> > */
> > - hash = hpt_hash(vpn, shift, ssize);
> > - if (old_pte & H_PAGE_F_SECOND)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > + unsigned long gslot = get_hidx_gslot(vpn, shift,
> > + ssize, rpte, 0);
> >
> > - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
> > + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
> > MMU_PAGE_4K, ssize, flags) == -1)
> > old_pte &= ~_PAGE_HPTEFLAGS;
> > }
> > @@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > return -1;
> > }
> > new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> > - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> > }
> > *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> > return 0;
>
>
>
> None of the above changes are needed. We are not changing anything w.r.t
> 4k linux page table yet. So we can drop this.
>

I have put these changes in a separate patch. If needed it can be pulled
in. But its not mandatory. Would be nice to have though, since it
reduces a bunch of lines.


>
> > diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> > index 1a68cb1..3702a3c 100644
> > --- a/arch/powerpc/mm/hash64_64k.c
> > +++ b/arch/powerpc/mm/hash64_64k.c
> > @@ -15,34 +15,13 @@
> > #include <linux/mm.h>
> > #include <asm/machdep.h>
> > #include <asm/mmu.h>
> > +
> > /*
> > * index from 0 - 15
> > */
> > bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> > {
> > - unsigned long g_idx;
> > - unsigned long ptev = pte_val(rpte.pte);
> > -
> > - g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
> > - index = index >> 2;
> > - if (g_idx & (0x1 << index))
> > - return true;
> > - else
> > - return false;
> > -}
> > -/*
> > - * index from 0 - 15
> > - */
> > -static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
> > -{
> > - unsigned long g_idx;
> > -
> > - if (!(ptev & H_PAGE_COMBO))
> > - return ptev;
> > - index = index >> 2;
> > - g_idx = 0x1 << index;
> > -
> > - return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
> > + return !(hpte_soft_invalid(rpte.hidx >> (index << 2)));
> > }
> >
> > int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > @@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > int ssize, int subpg_prot)
> > {
> > real_pte_t rpte;
> > - unsigned long *hidxp;
> > unsigned long hpte_group;
> > unsigned int subpg_index;
> > - unsigned long rflags, pa, hidx;
> > + unsigned long rflags, pa;
> > unsigned long old_pte, new_pte, subpg_pte;
> > unsigned long vpn, hash, slot;
> > unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
> > @@ -116,28 +94,23 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * On hash insert failure we use old pte value and we don't
> > * want slot information there if we have a insert failure.
> > */
> > - old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> > - new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> > + old_pte &= ~(H_PAGE_HASHPTE);
> > + new_pte &= ~(H_PAGE_HASHPTE);
> > goto htab_insert_hpte;
> > }
> > /*
> > * Check for sub page valid and update
> > */
> > if (__rpte_sub_valid(rpte, subpg_index)) {
> > - int ret;
> >
> > - hash = hpt_hash(vpn, shift, ssize);
> > - hidx = __rpte_to_hidx(rpte, subpg_index);
> > - if (hidx & _PTEIDX_SECONDARY)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += hidx & _PTEIDX_GROUP_IX;
> > + unsigned long gslot = get_hidx_gslot(vpn, shift,
> > + ssize, rpte, subpg_index);
>
>
> Converting that to helper is also not needed in this patch. Leave it as
> it is. It is much easier to review.
>

ok. But dont want to reduce a bunch of lines?

>
> >
> > - ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> > + int ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
> > MMU_PAGE_4K, MMU_PAGE_4K,
> > ssize, flags);
> > /*
> > - *if we failed because typically the HPTE wasn't really here
> > + * if we failed because typically the HPTE wasn't really here
> > * we try an insertion.
> > */
> > if (ret == -1)
> > @@ -148,6 +121,15 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > }
> >
> > htab_insert_hpte:
> > +
> > + /*
> > + * initialize all hidx entries to a invalid value,
> > + * the first time the PTE is about to allocate
> > + * a 4K hpte
> > + */
> > + if (!(old_pte & H_PAGE_COMBO))
> > + rpte.hidx = INIT_HIDX;
> > +
> > /*
> > * handle H_PAGE_4K_PFN case
> > */
> > @@ -177,10 +159,20 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > rflags, HPTE_V_SECONDARY,
> > MMU_PAGE_4K, MMU_PAGE_4K,
> > ssize);
> > - if (slot == -1) {
> > - if (mftb() & 0x1)
> > +
> > + if (unlikely(hpte_soft_invalid(slot))) {
>
> Should we name that hpte_slot_invalid() ? ie. s/soft/slot/ ?

I intentionally used the word soft, since for the hardware it is a
valid slot. The *soft*ware is considering it invalid. Hence the word
*soft*.

>
>
> > + slot = slot & _PTEIDX_GROUP_IX;
> > + mmu_hash_ops.hpte_invalidate(hpte_group+slot, vpn,
> > + MMU_PAGE_4K, MMU_PAGE_4K,
> > + ssize, flags);
>
> What is the last arg flags here ? I guess we need to pass 0 there ?
> We can't do a local = 1 invalidate, because we don't know whether
> anybody did really access this address in between and has got the entry
> in TLB.

ok. I think you are right. it should be 0.

>
>
> > + }
> > +
> > + if (unlikely(slot == -1 || hpte_soft_invalid(slot))) {
> > +
>
> Can you add a comment around explaining invalid slot always result in
> removing from primary ? Also do we want to store that invalid slot
> details in a variable ? instead of doing that conditional again and
> again ? This is hotpath.
>

will do.

> > + if (hpte_soft_invalid(slot) || (mftb() & 0x1))
> > hpte_group = ((hash & htab_hash_mask) *
> > HPTES_PER_GROUP) & ~0x7UL;
> > +
> > mmu_hash_ops.hpte_remove(hpte_group);
> > /*
> > * FIXME!! Should be try the group from which we removed ?
> > @@ -204,11 +196,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * Since we have H_PAGE_BUSY set on ptep, we can be sure
> > * nobody is undating hidx.
> > */
> > - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > - rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > - *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> > - new_pte = mark_subptegroup_valid(new_pte, subpg_index);
> > - new_pte |= H_PAGE_HASHPTE;
> > + new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
> > + new_pte |= H_PAGE_HASHPTE;
> > +
> > /*
> > * check __real_pte for details on matching smp_rmb()
> > */
> > @@ -322,9 +312,10 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> > MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
> > return -1;
> > }
> > - new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> > +
> > new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > + new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
>
> What is this change ? i guess we want this in second patch ?

yes. have moved it to the second patch.

>
>
> > }
> > *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> > return 0;
> > diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> > index f2095ce..c0f4b46 100644
> > --- a/arch/powerpc/mm/hash_utils_64.c
> > +++ b/arch/powerpc/mm/hash_utils_64.c
> > @@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
> >
> > void __init hash__early_init_mmu(void)
> > {
> > +#ifndef CONFIG_PPC_64K_PAGES
> > /*
> > - * We have code in __hash_page_64K() and elsewhere, which assumes it can
> > + * We have code in __hash_page_4K() and elsewhere, which assumes it can
> > * do the following:
> > * new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > *
> > @@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
> > * with a BUILD_BUG_ON().
> > */
> > BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul << (H_PAGE_F_GIX_SHIFT + 3)));
> > +#endif /* CONFIG_PPC_64K_PAGES */
> >
> > htab_init_page_sizes();
> >
> > @@ -1589,29 +1591,39 @@ static inline void tm_flush_hash_page(int local)
> > }
> > #endif
> >
> > +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index)
> > +{
> > + unsigned long hash, slot, hidx;
> > +
> > + hash = hpt_hash(vpn, shift, ssize);
> > + hidx = __rpte_to_hidx(rpte, subpg_index);
> > + if (hidx & _PTEIDX_SECONDARY)
> > + hash = ~hash;
> > + slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > + slot += hidx & _PTEIDX_GROUP_IX;
> > + return slot;
> > +}
>
>
> We don't need this helper for this patch series ?
>

the helpers will now be moved into independent patches. Can be applied
if needed.

> > +
> > +
> > /* WARNING: This is called from hash_low_64.S, if you change this prototype,
> > * do not forget to update the assembly call site !
> > */
> > void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
> > unsigned long flags)
> > {
> > - unsigned long hash, index, shift, hidx, slot;
> > + unsigned long hash, index, shift, hidx, gslot;
> > int local = flags & HPTE_LOCAL_UPDATE;
> >
> > DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
> > pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
> > - hash = hpt_hash(vpn, shift, ssize);
> > - hidx = __rpte_to_hidx(pte, index);
> > - if (hidx & _PTEIDX_SECONDARY)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += hidx & _PTEIDX_GROUP_IX;
> > + gslot = get_hidx_gslot(vpn, shift, ssize, pte, index);
> > DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
> > /*
> > * We use same base page size and actual psize, because we don't
> > * use these functions for hugepage
> > */
> > - mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
> > + mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
> > ssize, local);
> > } pte_iterate_hashed_end();
> >
> And if we avoid adding that helper, changes like this can be avoided in
> the patch.
>
>
> -aneesh

2017-06-21 20:14:36

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

On Wed, Jun 21, 2017 at 12:24:34PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <[email protected]> writes:
>
> ....
>
> > diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> > index a84bb44..239ca86 100644
> > --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> > +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> > @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > pte_t *ptep, unsigned long trap, unsigned long flags,
> > int ssize, unsigned int shift, unsigned int mmu_psize)
> > {
> > + real_pte_t rpte;
> > unsigned long vpn;
> > unsigned long old_pte, new_pte;
> > unsigned long rflags, pa, sz;
> > @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
> >
> > rflags = htab_convert_pte_flags(new_pte);
> > + rpte = __real_pte(__pte(old_pte), ptep);
> >
> > sz = ((1UL) << shift);
> > if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -71,15 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > /* Check if pte already has an hpte (case 2) */
> > if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> > /* There MIGHT be an HPTE for this pte */
> > - unsigned long hash, slot;
> > + unsigned long gslot;
> >
> > - hash = hpt_hash(vpn, shift, ssize);
> > - if (old_pte & H_PAGE_F_SECOND)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > -
> > - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> > + gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
> > + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
> > mmu_psize, ssize, flags) == -1)
> > old_pte &= ~_PAGE_HPTEFLAGS;
> > }
> > @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > return -1;
> > }
> >
> > - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
>
> We don't really need rpte here. We are just need to track one entry
> here. May be it becomes simpler if use different helpers for 4k hpte and
> others ?

actually we need rpte here. the hidx for these 64K-hpte backed PTEs are
now stored in the second half of the pte.
I have abstracted the helpers, so that the caller need not
know the location of the hidx. It comes in really handy.

RP

2017-06-22 09:07:54

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On 06/17/2017 09:22 AM, Ram Pai wrote:
> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> in the 4K backed hpte pages. These bits continue to be used
> for 64K backed hpte pages in this patch, but will be freed
> up in the next patch.
>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 9
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating an invalid slot. If a hpte
> gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> released immediately. In other words, even though 0xF is a
> valid slot we discard and consider it as an invalid
> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> depend on a bit in the primary PTE in order to determine the
> validity of a slot.
>
> When we release a hpte in the 0xF slot we also release a
> legitimate primary slot and unmap that entry. This is to
> ensure that we do get a legimate non-0xF slot the next time we
> retry for a slot.
>
> Though treating 0xF slot as invalid reduces the number of available
> slots and may have an effect on the performance, the probabilty
> of hitting a 0xF is extermely low.
>
> Compared to the current scheme, the above described scheme reduces
> the number of false hash table updates significantly and has the
> added advantage of releasing four valuable PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.

Scanned through the PTE format again for hash 64K and 4K. It seems
to me that there might be 5 free bits already present on the PTE
format. I might have seriously mistaken something here :) Please
correct me if that is not the case. _RPAGE_RPN* I think is applicable
only for hash page table format and will not be available for radix
later.

+#define _PAGE_FREE_1 0x0000000000000040UL /* Not used */
+#define _RPAGE_SW0 0x2000000000000000UL /* Not used */
+#define _RPAGE_SW1 0x0000000000000800UL /* Not used */
+#define _RPAGE_RPN42 0x0040000000000000UL /* Not used */
+#define _RPAGE_RPN41 0x0020000000000000UL /* Not used */


2017-06-22 16:21:05

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

On Thu, Jun 22, 2017 at 02:37:27PM +0530, Anshuman Khandual wrote:
> On 06/17/2017 09:22 AM, Ram Pai wrote:
> > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
> > in the 4K backed hpte pages. These bits continue to be used
> > for 64K backed hpte pages in this patch, but will be freed
> > up in the next patch.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 9
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating an invalid slot. If a hpte
> > gets cached in a 0xF slot(i.e 7th slot of secondary), it is
> > released immediately. In other words, even though 0xF is a
> > valid slot we discard and consider it as an invalid
> > slot;i.e hpte_soft_invalid(). This gives us an opportunity to not
> > depend on a bit in the primary PTE in order to determine the
> > validity of a slot.
> >
> > When we release a hpte in the 0xF slot we also release a
> > legitimate primary slot and unmap that entry. This is to
> > ensure that we do get a legimate non-0xF slot the next time we
> > retry for a slot.
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and may have an effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared to the current scheme, the above described scheme reduces
> > the number of false hash table updates significantly and has the
> > added advantage of releasing four valuable PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
>
> Scanned through the PTE format again for hash 64K and 4K. It seems
> to me that there might be 5 free bits already present on the PTE
> format. I might have seriously mistaken something here :) Please
> correct me if that is not the case. _RPAGE_RPN* I think is applicable
> only for hash page table format and will not be available for radix
> later.
>
> +#define _PAGE_FREE_1 0x0000000000000040UL /* Not used */
> +#define _RPAGE_SW0 0x2000000000000000UL /* Not used */
> +#define _RPAGE_SW1 0x0000000000000800UL /* Not used */
> +#define _RPAGE_RPN42 0x0040000000000000UL /* Not used */
> +#define _RPAGE_RPN41 0x0020000000000000UL /* Not used */
>

The bits are chosen to future proof for radix implementation.
_RPAGE_SW* will eat into what is available for software in the future,
and these key-bits will certainly be something that the radix
hardware will read, in the future.

The _RPAGE_RPN* bits cannot be relied on for radix.

But finally the bits that we chose (H_PAGE_F_SECOND|H_PAGE_F_GIX) had
the best potential for giving us the highest number of free bits with
relatively less effort.

RP