2017-06-06 01:05:46

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys

Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

A process allocates a key and associates it with
a address range within its address space.
The process than can dynamically set read/write
permissions on the key without involving the
kernel. Any code that violates the permissions
off the address space; as defined by its associated
key, will receive a segmentation fault.

This patch series enables the feature on PPC64.
It is enabled on HPTE 64K-page platform.

ISA3.0 section 5.7.13 describes the detailed specifications.

Testing:
This patch series has passed all the protection key
tests available in the selftests directory. Though
the test are written for x86, I have updated the
tests to cater to powerpc. Will send the patch
separately, along with documentation updates.

Thanks-to: Dave Hansen, Aneesh, Paul Mackerras,
Michael Ellermen :)

Ram Pai (7):
Free up four PTE bits to accommadate memory keys
Implement sys_pkey_alloc and sys_pkey_free system call.
store and restore the key state across context switches.
Implementation for sys_mprotect_pkey() system call.
Program HPTE key protection bits.
Handle exceptions caused by violation of key protection.
Deliver SEGV signal on protection key violation.

arch/powerpc/Kconfig | 15 ++
arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 ++
arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++--
arch/powerpc/include/asm/book3s/64/hash.h | 8 +-
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 10 +
arch/powerpc/include/asm/book3s/64/mmu.h | 10 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 84 +++++++-
arch/powerpc/include/asm/mman.h | 29 +--
arch/powerpc/include/asm/mmu_context.h | 12 ++
arch/powerpc/include/asm/pkeys.h | 159 +++++++++++++++
arch/powerpc/include/asm/processor.h | 5 +
arch/powerpc/include/asm/reg.h | 10 +-
arch/powerpc/include/asm/systbl.h | 3 +
arch/powerpc/include/asm/unistd.h | 6 +-
arch/powerpc/include/uapi/asm/ptrace.h | 5 +-
arch/powerpc/include/uapi/asm/unistd.h | 3 +
arch/powerpc/kernel/asm-offsets.c | 1 +
arch/powerpc/kernel/exceptions-64s.S | 10 +-
arch/powerpc/kernel/process.c | 18 ++
arch/powerpc/kernel/signal_32.c | 18 +-
arch/powerpc/kernel/signal_64.c | 11 ++
arch/powerpc/kernel/traps.c | 49 +++++
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
arch/powerpc/mm/fault.c | 21 +-
arch/powerpc/mm/hash64_4k.c | 12 +-
arch/powerpc/mm/hash64_64k.c | 73 +++----
arch/powerpc/mm/hash_utils_64.c | 43 ++++-
arch/powerpc/mm/hugetlbpage-hash64.c | 16 +-
arch/powerpc/mm/mmu_context_book3s64.c | 5 +
arch/powerpc/mm/pkeys.c | 267 ++++++++++++++++++++++++++
include/linux/mm.h | 32 +--
include/uapi/asm-generic/mman-common.h | 2 +-
33 files changed, 856 insertions(+), 135 deletions(-)
create mode 100644 arch/powerpc/include/asm/pkeys.h
create mode 100644 arch/powerpc/mm/pkeys.c

--
1.8.3.1


2017-06-06 01:05:51

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
keys.

The patch does the following change to the 64K PTE format

H_PAGE_BUSY moves from bit 3 to bit 7
H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
second part of the pte.

The second part of the PTE will hold
a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
pte.

the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
is initialized to 0xF indicating a invalid slot. if a hashpage does
get allocated to the 0xF slot, it is released and not used. In
other words, even though 0xF is a valid slot we discard it and
consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
opportunity to not depend on a bit in the primary PTE in order to
determine the validity of a slot.

When we release a 0xF slot we also release a legitimate primary
slot and unmap that entry. This is to ensure that we do get
a legimate non-0xF slot the next time we retry for a slot.

Though treating 0xF slot as invalid reduces the number of available
slots and make have a effect on the performance, the probabilty
of hitting a 0xF is extermely low.

Compared to the current scheme, the above described scheme reduces
the number of false hash table updates significantly and has the
added advantage of releasing four valuable PTE bits for other
purpose.

This idea was jointly developed by Paul Mackerras, Aneesh, Michael
Ellermen and myself.

4K PTE format remain unchanged currently.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 +++++
arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
arch/powerpc/include/asm/book3s/64/hash.h | 8 ++-
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
arch/powerpc/mm/hash64_4k.c | 12 ++---
arch/powerpc/mm/hash64_64k.c | 73 +++++++++------------------
arch/powerpc/mm/hash_utils_64.c | 38 +++++++++++++-
arch/powerpc/mm/hugetlbpage-hash64.c | 16 +++---
9 files changed, 107 insertions(+), 98 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b..223d318 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,18 @@
#define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
#define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)

+
+/*
+ * Only supported by 4k linux page size
+ */
+#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
+
/* PTE flags to conserve for HPTE identification */
#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
H_PAGE_F_SECOND | H_PAGE_F_GIX)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..87ee22d 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,23 +10,21 @@
* 64k aligned address free up few of the lower bits of RPN for us
* We steal that here. For more deatils look at pte_pfn/pfn_pte()
*/
-#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
+
+#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
/*
* We need to differentiate between explicit huge page and THP huge
* page, since THP huge page also need to track real subpage details
*/
#define H_PAGE_THP_HUGE H_PAGE_4K_PFN

-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
/* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
- H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
+
/*
* we support 16 fragments per PTE page of 64K size.
*/
@@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
unsigned long *hidxp;

rpte.pte = pte;
- rpte.hidx = 0;
- if (pte_val(pte) & H_PAGE_COMBO) {
- /*
- * Make sure we order the hidx load against the H_PAGE_COMBO
- * check. The store side ordering is done in __hash_page_4K
- */
- smp_rmb();
- hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
- rpte.hidx = *hidxp;
- }
+ /*
+ * The store side ordering is done in __hash_page_4K
+ */
+ smp_rmb();
+ hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+ rpte.hidx = *hidxp;
return rpte;
}

static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
{
- if ((pte_val(rpte.pte) & H_PAGE_COMBO))
- return (rpte.hidx >> (index<<2)) & 0xf;
- return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
+ return ((rpte.hidx >> (index<<2)) & 0xfUL);
}

#define __rpte_to_pte(r) ((r).pte)
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..7742782 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,11 +8,9 @@
*
*/
#define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIFT 56
-#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
-#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
+
+#define INIT_HIDX (~0x0UL)
+#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)

#ifdef CONFIG_PPC_64K_PAGES
#include <asm/book3s/64/hash-64k.h>
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..cfb8169 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
extern int __hash_page_64K(unsigned long ea, unsigned long access,
unsigned long vsid, pte_t *ptep, unsigned long trap,
unsigned long flags, int ssize);
+extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot);
+extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index);
+
struct mm_struct;
unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
index 44fe483..b832ed3 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
.val = H_PAGE_4K_PFN,
.set = "4K_pfn",
}, {
-#endif
+#else
.mask = H_PAGE_F_GIX,
.val = H_PAGE_F_GIX,
.set = "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
.val = H_PAGE_F_SECOND,
.set = "f_second",
}, {
+#endif /* CONFIG_PPC_64K_PAGES */
#endif
.mask = _PAGE_SPECIAL,
.val = _PAGE_SPECIAL,
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..5b16416 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
pte_t *ptep, unsigned long trap, unsigned long flags,
int ssize, int subpg_prot)
{
+ real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
* need to add in 0x1 if it's a read-only user page
*/
rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);

if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
/*
* There MIGHT be an HPTE for this pte
*/
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
+ slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
MMU_PAGE_4K, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
@@ -118,8 +115,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
return -1;
}
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb1..feb90f1 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -15,34 +15,13 @@
#include <linux/mm.h>
#include <asm/machdep.h>
#include <asm/mmu.h>
-/*
- * index from 0 - 15
- */
-bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
-{
- unsigned long g_idx;
- unsigned long ptev = pte_val(rpte.pte);

- g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
- index = index >> 2;
- if (g_idx & (0x1 << index))
- return true;
- else
- return false;
-}
/*
* index from 0 - 15
*/
-static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
+bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
{
- unsigned long g_idx;
-
- if (!(ptev & H_PAGE_COMBO))
- return ptev;
- index = index >> 2;
- g_idx = 0x1 << index;
-
- return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
+ return !(HPTE_SOFT_INVALID(rpte.hidx >> (index << 2)));
}

int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
@@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
int ssize, int subpg_prot)
{
real_pte_t rpte;
- unsigned long *hidxp;
unsigned long hpte_group;
unsigned int subpg_index;
- unsigned long rflags, pa, hidx;
+ unsigned long rflags, pa;
unsigned long old_pte, new_pte, subpg_pte;
unsigned long vpn, hash, slot;
unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
@@ -116,8 +94,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
* On hash insert failure we use old pte value and we don't
* want slot information there if we have a insert failure.
*/
- old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
- new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
+ old_pte &= ~(H_PAGE_HASHPTE);
+ new_pte &= ~(H_PAGE_HASHPTE);
+ rpte.hidx = INIT_HIDX;
goto htab_insert_hpte;
}
/*
@@ -126,18 +105,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
if (__rpte_sub_valid(rpte, subpg_index)) {
int ret;

- hash = hpt_hash(vpn, shift, ssize);
- hidx = __rpte_to_hidx(rpte, subpg_index);
- if (hidx & _PTEIDX_SECONDARY)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += hidx & _PTEIDX_GROUP_IX;
-
+ slot = get_hidx_slot(vpn, shift, ssize, rpte, subpg_index);
ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
MMU_PAGE_4K, MMU_PAGE_4K,
ssize, flags);
/*
- *if we failed because typically the HPTE wasn't really here
+ * if we failed because typically the HPTE wasn't really here
* we try an insertion.
*/
if (ret == -1)
@@ -177,8 +150,14 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
rflags, HPTE_V_SECONDARY,
MMU_PAGE_4K, MMU_PAGE_4K,
ssize);
- if (slot == -1) {
- if (mftb() & 0x1)
+ if (unlikely(HPTE_SOFT_INVALID(slot)))
+ mmu_hash_ops.hpte_invalidate(slot, vpn,
+ MMU_PAGE_4K, MMU_PAGE_4K,
+ ssize, (flags & HPTE_LOCAL_UPDATE));
+
+ if (unlikely(slot == -1 || HPTE_SOFT_INVALID(slot))) {
+ if (HPTE_SOFT_INVALID(slot) || (mftb() & 0x1))
+
hpte_group = ((hash & htab_hash_mask) *
HPTES_PER_GROUP) & ~0x7UL;
mmu_hash_ops.hpte_remove(hpte_group);
@@ -204,11 +183,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
* Since we have H_PAGE_BUSY set on ptep, we can be sure
* nobody is undating hidx.
*/
- hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
- rpte.hidx &= ~(0xfUL << (subpg_index << 2));
- *hidxp = rpte.hidx | (slot << (subpg_index << 2));
- new_pte = mark_subptegroup_valid(new_pte, subpg_index);
- new_pte |= H_PAGE_HASHPTE;
+ new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
+ new_pte |= H_PAGE_HASHPTE;
+
/*
* check __real_pte for details on matching smp_rmb()
*/
@@ -221,6 +198,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
unsigned long vsid, pte_t *ptep, unsigned long trap,
unsigned long flags, int ssize)
{
+ real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -257,6 +235,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));

rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);

if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -267,12 +246,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
/*
* There MIGHT be an HPTE for this pte
*/
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
+ slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
MMU_PAGE_64K, ssize,
flags) == -1)
@@ -322,9 +296,8 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
return -1;
}
+ set_hidx_slot(ptep, rpte, 0, slot);
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index f2095ce..b405657 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)

void __init hash__early_init_mmu(void)
{
+#ifndef CONFIG_PPC_64K_PAGES
/*
- * We have code in __hash_page_64K() and elsewhere, which assumes it can
+ * We have code in __hash_page_4K() and elsewhere, which assumes it can
* do the following:
* new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
*
@@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
* with a BUILD_BUG_ON().
*/
BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul << (H_PAGE_F_GIX_SHIFT + 3)));
+#endif /* CONFIG_PPC_64K_PAGES */

htab_init_page_sizes();

@@ -1589,6 +1591,40 @@ static inline void tm_flush_hash_page(int local)
}
#endif

+#ifdef CONFIG_PPC_64K_PAGES
+unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot)
+{
+ unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+ rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+ *hidxp = rpte.hidx | (slot << (subpg_index << 2));
+ return 0x0UL;
+}
+#else /* CONFIG_PPC_64K_PAGES */
+unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+ unsigned int subpg_index, unsigned long slot)
+{
+ return (slot << H_PAGE_F_GIX_SHIFT) &
+ (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+#endif /* CONFIG_PPC_64K_PAGES */
+
+unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
+ int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+ unsigned long hash, slot, hidx;
+
+ hash = hpt_hash(vpn, shift, ssize);
+ hidx = __rpte_to_hidx(rpte, subpg_index);
+ if (hidx & _PTEIDX_SECONDARY)
+ hash = ~hash;
+ slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+ slot += hidx & _PTEIDX_GROUP_IX;
+ return slot;
+}
+
+
/* WARNING: This is called from hash_low_64.S, if you change this prototype,
* do not forget to update the assembly call site !
*/
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
index a84bb44..25a50eb 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -14,7 +14,7 @@
#include <asm/cacheflush.h>
#include <asm/machdep.h>

-extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
+long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
unsigned long pa, unsigned long rlags,
unsigned long vflags, int psize, int ssize);

@@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
pte_t *ptep, unsigned long trap, unsigned long flags,
int ssize, unsigned int shift, unsigned int mmu_psize)
{
+ real_pte_t rpte;
unsigned long vpn;
unsigned long old_pte, new_pte;
unsigned long rflags, pa, sz;
@@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));

rflags = htab_convert_pte_flags(new_pte);
+ rpte = __real_pte(__pte(old_pte), ptep);

sz = ((1UL) << shift);
if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -71,14 +73,9 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
/* Check if pte already has an hpte (case 2) */
if (unlikely(old_pte & H_PAGE_HASHPTE)) {
/* There MIGHT be an HPTE for this pte */
- unsigned long hash, slot;
-
- hash = hpt_hash(vpn, shift, ssize);
- if (old_pte & H_PAGE_F_SECOND)
- hash = ~hash;
- slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
- slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+ unsigned long slot;

+ slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
mmu_psize, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
@@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
return -1;
}

- new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
- (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+ new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
}

/*
--
1.8.3.1

2017-06-06 01:05:59

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 6/7 v1]powerpc: Handle exceptions caused by violation of key protection.

Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/mmu_context.h | 12 +++++
arch/powerpc/include/asm/pkeys.h | 9 ++++
arch/powerpc/include/asm/reg.h | 10 +++-
arch/powerpc/kernel/exceptions-64s.S | 2 +-
arch/powerpc/mm/fault.c | 21 +++++++-
arch/powerpc/mm/pkeys.c | 90 ++++++++++++++++++++++++++++++++++
6 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
{
}

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+ /* by default, allow everything */
+ return true;
+}
static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
{
/* by default, allow everything */
return true;
}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
#endif /* __KERNEL__ */
#endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 9b6820d..405e7db 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)

+static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
+{
+ return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
+ ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
+}
+
#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..c1ec880 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,24 @@
#define SPRN_DAR 0x013 /* Data Address Register */
#define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */
#define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */
+#define DSISR_BIT32 0x80000000 /* not defined */
#define DSISR_NOHPTE 0x40000000 /* no translation found */
+#define DSISR_PAGEATTR_CONFLT 0x20000000 /* page attribute conflict */
+#define DSISR_BIT35 0x10000000 /* not defined */
#define DSISR_PROTFAULT 0x08000000 /* protection fault */
#define DSISR_BADACCESS 0x04000000 /* bad access to CI or G */
#define DSISR_ISSTORE 0x02000000 /* access was a store */
#define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */
-#define DSISR_NOSEGMENT 0x00200000 /* SLB miss */
#define DSISR_KEYFAULT 0x00200000 /* Key fault */
+#define DSISR_BIT43 0x00100000 /* Not defined */
#define DSISR_UNSUPP_MMU 0x00080000 /* Unsupported MMU config */
#define DSISR_SET_RC 0x00040000 /* Failed setting of R/C bits */
#define DSISR_PGDIRFAULT 0x00020000 /* Fault on page directory */
+#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
+ DSISR_PAGEATTR_CONFLT | \
+ DSISR_BADACCESS | \
+ DSISR_KEYFAULT | \
+ DSISR_BIT43)
#define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */
#define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */
#define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..5226a9d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
.balign IFETCH_ALIGN_BYTES
do_hash_page:
#ifdef CONFIG_PPC_STD_MMU_64
- andis. r0,r4,0xa410 /* weird error? */
+ andis. r0,r4,DSISR_PAGE_FAULT_MASK@h /* weird error? */
bne- handle_page_fault /* if not, try to insert a HPTE */
andis. r0,r4,DSISR_DABRMATCH@h
bne- handle_dabr_fault
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..c31624f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
* bits we are interested in. But there are some bits which
* indicate errors in DSISR but can validly be set in SRR1.
*/
- if (trap == 0x400)
+ if (trap == 0x400) {
error_code &= 0x48200000;
- else
+ flags |= FAULT_FLAG_INSTRUCTION;
+ } else
is_write = error_code & DSISR_ISSTORE;
#else
is_write = error_code & ESR_DST;
@@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
}
#endif

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (error_code & DSISR_KEYFAULT) {
+ code = SEGV_PKUERR;
+ goto bad_area_nosemaphore;
+ }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/* We restore the interrupt state now */
if (!arch_irq_disabled_regs(regs))
local_irq_enable();
@@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
#endif /* CONFIG_PPC_STD_MMU */

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ 0)) {
+ code = SEGV_PKUERR;
+ goto bad_area;
+ }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..a9545f1 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
}

+static inline bool pkey_allows_read(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_amr() & (AMR_AD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_write(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_amr() & (AMR_WD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_execute(int pkey)
+{
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+ if (!(read_uamor() & (0x3ul << pkey_shift)))
+ return true;
+
+ return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
+}
+
+
/*
* set the access right in AMR IAMR and UAMOR register
* for @pkey to that specified in @init_val.
@@ -175,3 +206,62 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
*/
return vma_pkey(vma);
}
+
+bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+ int pkey = pte_flags_to_pkey(pte_val(pte));
+
+ if (!pkey_allows_read(pkey))
+ return false;
+ if (write && !pkey_allows_write(pkey))
+ return false;
+ return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+ if (!current->mm)
+ return true;
+ /*
+ * if the VMA is from another process, then PKRU has no
+ * relevance and should not be enforced.
+ */
+ if (current->mm != vma->vm_mm)
+ return true;
+
+ return false;
+}
+
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign)
+{
+ int pkey;
+ /* allow access if the VMA is not one from this process */
+ if (foreign || vma_is_foreign(vma))
+ return true;
+
+ pkey = vma_pkey(vma);
+
+ if (!pkey)
+ return true;
+
+ if (execute)
+ return pkey_allows_execute(pkey);
+
+ if (!pkey_allows_read(pkey))
+ return false;
+
+ if (write)
+ return pkey_allows_write(pkey);
+
+ return true;
+}
--
1.8.3.1

2017-06-06 01:05:57

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 5/7 v1]powerpc: Program HPTE key protection bits.

Map the PTE protection key bits to the HPTE key protection bits,
while creatiing HPTE entries.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +++++
arch/powerpc/include/asm/pkeys.h | 7 +++++++
arch/powerpc/mm/hash_utils_64.c | 5 +++++
3 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cfb8169..3d7872c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
#define HPTE_R_PP0 ASM_CONST(0x8000000000000000)
#define HPTE_R_TS ASM_CONST(0x4000000000000000)
#define HPTE_R_KEY_HI ASM_CONST(0x3000000000000000)
+#define HPTE_R_KEY_BIT0 ASM_CONST(0x2000000000000000)
+#define HPTE_R_KEY_BIT1 ASM_CONST(0x1000000000000000)
#define HPTE_R_RPN_SHIFT 12
#define HPTE_R_RPN ASM_CONST(0x0ffffffffffff000)
#define HPTE_R_RPN_3_0 ASM_CONST(0x01fffffffffff000)
@@ -104,6 +106,9 @@
#define HPTE_R_C ASM_CONST(0x0000000000000080)
#define HPTE_R_R ASM_CONST(0x0000000000000100)
#define HPTE_R_KEY_LO ASM_CONST(0x0000000000000e00)
+#define HPTE_R_KEY_BIT2 ASM_CONST(0x0000000000000800)
+#define HPTE_R_KEY_BIT3 ASM_CONST(0x0000000000000400)
+#define HPTE_R_KEY_BIT4 ASM_CONST(0x0000000000000200)

#define HPTE_V_1TB_SEG ASM_CONST(0x4000000000000000)
#define HPTE_V_VRMA_MASK ASM_CONST(0x4001ffffff000000)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0f3dca8..9b6820d 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -27,6 +27,13 @@
((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))

+#define calc_pte_to_hpte_pkey_bits(pteflags) \
+ (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
+ ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
+
/*
* Bits are in BE format.
* NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index b405657..2276392 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
#include <linux/memblock.h>
#include <linux/context_tracking.h>
#include <linux/libfdt.h>
+#include <linux/pkeys.h>

#include <asm/debugfs.h>
#include <asm/processor.h>
@@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags)
*/
rflags |= HPTE_R_M;

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ rflags |= calc_pte_to_hpte_pkey_bits(pteflags);
+#endif
+
return rflags;
}

--
1.8.3.1

2017-06-06 01:06:05

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 4/7 v1]powerpc: Implementation for sys_mprotect_pkey() system call.

This system call, associates the pkey with the pte of all
pages corresponding to the given address range.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++++++-
arch/powerpc/include/asm/mman.h | 29 +++++----
arch/powerpc/include/asm/pkeys.h | 21 ++++++-
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 4 +-
arch/powerpc/include/uapi/asm/unistd.h | 1 +
arch/powerpc/mm/pkeys.c | 93 +++++++++++++++++++++++++++-
include/linux/mm.h | 1 +
8 files changed, 154 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 87e9a89..bc845cd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,6 +37,7 @@
#define _RPAGE_RSV2 0x0800000000000000UL
#define _RPAGE_RSV3 0x0400000000000000UL
#define _RPAGE_RSV4 0x0200000000000000UL
+#define _RPAGE_RSV5 0x00040UL

#define _PAGE_PTE 0x4000000000000000UL /* distinguishes PTEs from pointers */
#define _PAGE_PRESENT 0x8000000000000000UL /* pte contains a translation */
@@ -56,6 +57,20 @@
/* Max physical address bit as per radix table */
#define _RPAGE_PA_MAX 57

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define H_PAGE_PKEY_BIT0 _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1 _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2 _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3 _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4 _RPAGE_RSV5
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0 0
+#define H_PAGE_PKEY_BIT1 0
+#define H_PAGE_PKEY_BIT2 0
+#define H_PAGE_PKEY_BIT3 0
+#define H_PAGE_PKEY_BIT4 0
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/*
* Max physical address bit we will use for now.
*
@@ -122,7 +137,12 @@
#define PAGE_PROT_BITS (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
_PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY | _PAGE_EXEC | \
- _PAGE_SOFT_DIRTY)
+ _PAGE_SOFT_DIRTY | \
+ H_PAGE_PKEY_BIT0 | \
+ H_PAGE_PKEY_BIT1 | \
+ H_PAGE_PKEY_BIT2 | \
+ H_PAGE_PKEY_BIT3 | \
+ H_PAGE_PKEY_BIT4)
/*
* We define 2 sets of base prot bits, one for basic pages (ie,
* cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..14cc1aa 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,24 +13,31 @@

#include <asm/cputable.h>
#include <linux/mm.h>
+#include <linux/pkeys.h>
#include <asm/cpu_has_feature.h>

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
/*
* This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
* here. How important is the optimization?
*/
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
- unsigned long pkey)
-{
- return (prot & PROT_SAO) ? VM_SAO : 0;
-}
-#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+#define arch_calc_vm_prot_bits(prot, key) ( \
+ ((prot) & PROT_SAO ? VM_SAO : 0) | \
+ pkey_to_vmflag_bits(key))
+#define arch_vm_get_page_prot(vm_flags) __pgprot( \
+ ((vm_flags) & VM_SAO ? _PAGE_SAO : 0) | \
+ vmflag_to_page_pkey_bits(vm_flags))
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+#define arch_calc_vm_prot_bits(prot, key) ( \
+ ((prot) & PROT_SAO ? VM_SAO : 0))
+#define arch_vm_get_page_prot(vm_flags) __pgprot( \
+ ((vm_flags) & VM_SAO ? _PAGE_SAO : 0))
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */

-static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
-{
- return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
-}
-#define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)

static inline bool arch_validate_prot(unsigned long prot)
{
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..0f3dca8 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,19 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)

+#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
+ ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) | \
+ ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) | \
+ ((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) | \
+ ((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL))
+
+#define vmflag_to_page_pkey_bits(vm_flags) \
+ (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)| \
+ ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) | \
+ ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) | \
+ ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
+ ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
+
/*
* Bits are in BE format.
* NOTE: key 31, 1, 0 are not used.
@@ -42,6 +55,12 @@
#define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
pkeybit_mask(pkey))

+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+ return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
+}
+
static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
{
/* a reserved key is never considered as 'explicitly allocated' */
@@ -114,7 +133,7 @@ static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
return __arch_set_user_pkey_access(tsk, pkey, init_val);
}

-static inline pkey_mm_init(struct mm_struct *mm)
+static inline void pkey_mm_init(struct mm_struct *mm)
{
mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
/* -1 means unallocated or invalid */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 22dd776..b33b551 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@
SYSCALL(statx)
SYSCALL(pkey_alloc)
SYSCALL(pkey_free)
+SYSCALL(pkey_mprotect)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index e0273bc..daf1ba9 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,12 +12,10 @@
#include <uapi/asm/unistd.h>


-#define NR_syscalls 386
+#define NR_syscalls 387

#define __NR__exit __NR_exit

-#define __IGNORE_pkey_mprotect
-
#ifndef __ASSEMBLY__

#include <linux/types.h>
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 7993a07..71ae45e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
#define __NR_statx 383
#define __NR_pkey_alloc 384
#define __NR_pkey_free 385
+#define __NR_pkey_mprotect 386

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index b97366e..11a32b3 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -15,6 +15,17 @@
#include <linux/pkeys.h> /* PKEY_* */
#include <uapi/asm-generic/mman-common.h>

+#define pkeyshift(pkey) ((arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY)
+
+static inline bool pkey_allows_readwrite(int pkey)
+{
+ int pkey_shift = pkeyshift(pkey);
+
+ if (!(read_uamor() & (0x3UL << pkey_shift)))
+ return true;
+
+ return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
+}

/*
* set the access right in AMR IAMR and UAMOR register
@@ -68,7 +79,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,

int __execute_only_pkey(struct mm_struct *mm)
{
- return -1;
+ bool need_to_set_mm_pkey = false;
+ int execute_only_pkey = mm->context.execute_only_pkey;
+ int ret;
+
+ /* Do we need to assign a pkey for mm's execute-only maps? */
+ if (execute_only_pkey == -1) {
+ /* Go allocate one to use, which might fail */
+ execute_only_pkey = mm_pkey_alloc(mm);
+ if (execute_only_pkey < 0)
+ return -1;
+ need_to_set_mm_pkey = true;
+ }
+
+ /*
+ * We do not want to go through the relatively costly
+ * dance to set AMR if we do not need to. Check it
+ * first and assume that if the execute-only pkey is
+ * readwrite-disabled than we do not have to set it
+ * ourselves.
+ */
+ if (!need_to_set_mm_pkey &&
+ !pkey_allows_readwrite(execute_only_pkey))
+ return execute_only_pkey;
+
+ /*
+ * Set up AMR so that it denies access for everything
+ * other than execution.
+ */
+ ret = __arch_set_user_pkey_access(current, execute_only_pkey,
+ (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+ /*
+ * If the AMR-set operation failed somehow, just return
+ * 0 and effectively disable execute-only support.
+ */
+ if (ret) {
+ mm_set_pkey_free(mm, execute_only_pkey);
+ return -1;
+ }
+
+ /* We got one, store it and use it from here on out */
+ if (need_to_set_mm_pkey)
+ mm->context.execute_only_pkey = execute_only_pkey;
+ return execute_only_pkey;
+}
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+ /* Do this check first since the vm_flags should be hot */
+ if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+ return false;
+ if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey)
+ return false;
+
+ return true;
}

/*
@@ -84,5 +148,30 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
if (pkey != -1)
return pkey;

- return 0;
+ /*
+ * Look for a protection-key-drive execute-only mapping
+ * which is now being given permissions that are not
+ * execute-only. Move it back to the default pkey.
+ */
+ if (vma_is_pkey_exec_only(vma) &&
+ (prot & (PROT_READ|PROT_WRITE))) {
+ return 0;
+ }
+ /*
+ * The mapping is execute-only. Go try to get the
+ * execute-only protection key. If we fail to do that,
+ * fall through as if we do not have execute-only
+ * support.
+ */
+ if (prot == PROT_EXEC) {
+ pkey = execute_only_pkey(vma->vm_mm);
+ if (pkey > 0)
+ return pkey;
+ }
+ /*
+ * This is a vanilla, non-pkey mprotect (or we failed to
+ * setup execute-only), inherit the pkey from the VMA we
+ * are working on.
+ */
+ return vma_pkey(vma);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 34ddac7..5399031 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#elif defined(CONFIG_PPC)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
--
1.8.3.1

2017-06-06 01:06:32

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

The value of the AMR register at the time of the exception
is made available in gp_regs[PT_AMR] of the siginfo.

This field can be used to reprogram the permission bits of
any valid protection key.

Similarly the value of the key, whose protection got violated,
is made available at si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/uapi/asm/ptrace.h | 5 +++-
arch/powerpc/kernel/asm-offsets.c | 1 +
arch/powerpc/kernel/exceptions-64s.S | 8 ++++++
arch/powerpc/kernel/signal_32.c | 18 ++++++++++---
arch/powerpc/kernel/signal_64.c | 11 ++++++++
arch/powerpc/kernel/traps.c | 49 ++++++++++++++++++++++++++++++++++
6 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..109d0c2 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -49,6 +49,8 @@ struct pt_regs {
unsigned long dar; /* Fault registers */
unsigned long dsisr; /* on 4xx/Book-E used for ESR */
unsigned long result; /* Result of a system call */
+ unsigned long dscr; /* contents of the DSCR register */
+ unsigned long amr; /* contents of AMR register */
};

#endif /* __ASSEMBLY__ */
@@ -109,7 +111,8 @@ struct pt_regs {
#define PT_DSISR 42
#define PT_RESULT 43
#define PT_DSCR 44
-#define PT_REGS_COUNT 44
+#define PT_AMR 45
+#define PT_REGS_COUNT 45

#define PT_FPR0 48 /* each FP reg occupies 2 slots in this space */

diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 709e234..3818dc5 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -301,6 +301,7 @@ int main(void)
STACK_PT_REGS_OFFSET(ORIG_GPR3, orig_gpr3);
STACK_PT_REGS_OFFSET(RESULT, result);
STACK_PT_REGS_OFFSET(_TRAP, trap);
+ STACK_PT_REGS_OFFSET(_AMR, amr);
#ifndef CONFIG_PPC64
/*
* The PowerPC 400-class & Book-E processors have neither the DAR
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 5226a9d..9872dd3 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
ld r12,_MSR(r1)
ld r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ mfspr r5,SPRN_AMR
+ std r5,_AMR(r1)
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
li r5,0x300
std r3,_DAR(r1)
std r4,_DSISR(r1)
@@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
ld r12,_MSR(r1)
ld r3,_NIP(r1)
andis. r4,r12,0x5820
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ mfspr r5,SPRN_AMR
+ std r5,_AMR(r1)
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
li r5,0x400
std r3,_DAR(r1)
std r4,_DSISR(r1)
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..f317638 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -141,9 +141,11 @@ static inline int save_general_regs(struct pt_regs *regs,

WARN_ON(!FULL_REGS(regs));

- for (i = 0; i <= PT_RESULT; i ++) {
+ for (i = 0; i <= PT_REGS_COUNT; i++) {
if (i == 14 && !FULL_REGS(regs))
i = 32;
+ if (i == PT_DSCR)
+ continue;
if (__put_user((unsigned int)gregs[i], &frame->mc_gregs[i]))
return -EFAULT;
}
@@ -156,8 +158,8 @@ static inline int restore_general_regs(struct pt_regs *regs,
elf_greg_t64 *gregs = (elf_greg_t64 *)regs;
int i;

- for (i = 0; i <= PT_RESULT; i++) {
- if ((i == PT_MSR) || (i == PT_SOFTE))
+ for (i = 0; i <= PT_REGS_COUNT; i++) {
+ if ((i == PT_MSR) || (i == PT_SOFTE) || (i == PT_DSCR))
continue;
if (__get_user(gregs[i], &sr->mc_gregs[i]))
return -EFAULT;
@@ -661,6 +663,9 @@ static long restore_user_regs(struct pt_regs *regs,
long err;
unsigned int save_r2 = 0;
unsigned long msr;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#ifdef CONFIG_VSX
int i;
#endif
@@ -750,6 +755,13 @@ static long restore_user_regs(struct pt_regs *regs,
return 1;
#endif /* CONFIG_SPE */

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ amr = regs->amr;
+ err |= __get_user(regs->amr, &sr->mc_gregs[PT_AMR]);
+ if (!err && amr != regs->amr)
+ mtspr(SPRN_AMR, regs->amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return 0;
}

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index c83c115..d86d232 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -327,6 +327,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#ifdef CONFIG_VSX
int i;
#endif
@@ -406,6 +409,14 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
}
#endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ amr = regs->amr;
+ err |= __get_user(regs->amr, &sc->gp_regs[PT_AMR]);
+ if (!err && amr != regs->amr)
+ mtspr(SPRN_AMR, regs->amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return err;
}

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d4e545d..cc4bde8b 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -20,6 +20,7 @@
#include <linux/sched/debug.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/pkeys.h>
#include <linux/stddef.h>
#include <linux/unistd.h>
#include <linux/ptrace.h>
@@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
info->si_addr = (void __user *)regs->nip;
}

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
+{
+ struct vm_area_struct *vma;
+
+ /* Fault not from Protection Keys: nothing to do */
+ if (si_code != SEGV_PKUERR)
+ return;
+
+ down_read(&current->mm->mmap_sem);
+ /*
+ * we could be racing with pkey_mprotect().
+ * If pkey_mprotect() wins the key value could
+ * get modified...xxx
+ */
+ vma = find_vma(current->mm, addr);
+ up_read(&current->mm->mmap_sem);
+
+ /*
+ * force_sig_info_fault() is called from a number of
+ * contexts, some of which have a VMA and some of which
+ * do not. The Pkey-fault handing happens after we have a
+ * valid VMA, so we should never reach this without a
+ * valid VMA.
+ */
+ if (!vma) {
+ WARN_ONCE(1, "Pkey fault with no VMA passed in");
+ info->si_pkey = 0;
+ return;
+ }
+
+ /*
+ * We could report the incorrect key because of the reason
+ * explained above.
+ *
+ * si_pkey should be thought off as a strong hint, but not
+ * an absolutely guarantee because of the race explained
+ * above.
+ */
+ info->si_pkey = vma_pkey(vma);
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
{
siginfo_t info;
@@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
info.si_signo = signr;
info.si_code = code;
info.si_addr = (void __user *) addr;
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ fill_sig_info_pkey(code, &info, addr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
force_sig_info(signr, &info, current);
}

--
1.8.3.1

2017-06-06 01:06:54

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 3/7 v1]powerpc: store and restore the key state across context switches.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/include/asm/processor.h | 5 +++++
arch/powerpc/kernel/process.c | 18 ++++++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
struct thread_vr_state ckvr_state; /* Checkpointed VR state */
unsigned long ckvrsave; /* Checkpointed VRSAVE */
#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ unsigned long amr;
+ unsigned long iamr;
+ unsigned long uamor;
+#endif
#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
void* kvm_shadow_vcpu; /* KVM internal data */
#endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
t->tar = mfspr(SPRN_TAR);
}
#endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ t->amr = mfspr(SPRN_AMR);
+ t->iamr = mfspr(SPRN_IAMR);
+ t->uamor = mfspr(SPRN_UAMOR);
+#endif
}

static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct *old_thread,
mtspr(SPRN_TAR, new_thread->tar);
}
#endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ if (old_thread->amr != new_thread->amr)
+ mtspr(SPRN_AMR, new_thread->amr);
+ if (old_thread->iamr != new_thread->iamr)
+ mtspr(SPRN_IAMR, new_thread->iamr);
+ if (old_thread->uamor != new_thread->uamor)
+ mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
}

struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp)
current->thread.tm_texasr = 0;
current->thread.tm_tfiar = 0;
#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ current->thread.amr = 0x0ul;
+ current->thread.iamr = 0x0ul;
+ current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
}
EXPORT_SYMBOL(start_thread);

--
1.8.3.1

2017-06-06 01:07:11

by Ram Pai

[permalink] [raw]
Subject: [RFC PATCH 2/7 v1]powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.

Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free() frees up the key.

Total 32 keys are supported on powerpc. However key 0,1 and 31
are reserved. So effectively we have 29 keys.

Signed-off-by: Ram Pai <[email protected]>
---
arch/powerpc/Kconfig | 15 ++++
arch/powerpc/include/asm/book3s/64/mmu.h | 10 +++
arch/powerpc/include/asm/book3s/64/pgtable.h | 62 ++++++++++++++
arch/powerpc/include/asm/pkeys.h | 124 +++++++++++++++++++++++++++
arch/powerpc/include/asm/systbl.h | 2 +
arch/powerpc/include/asm/unistd.h | 4 +-
arch/powerpc/include/uapi/asm/unistd.h | 2 +
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/mmu_context_book3s64.c | 5 ++
arch/powerpc/mm/pkeys.c | 88 +++++++++++++++++++
include/linux/mm.h | 31 ++++---
include/uapi/asm-generic/mman-common.h | 2 +-
12 files changed, 331 insertions(+), 15 deletions(-)
create mode 100644 arch/powerpc/include/asm/pkeys.h
create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..b6960617 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP

If unsure, say Y. Only embedded should say N here.

+config PPC64_MEMORY_PROTECTION_KEYS
+ prompt "PowerPC Memory Protection Keys"
+ def_bool y
+ # Note: only available in 64-bit mode
+ depends on PPC64 && PPC_64K_PAGES
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select ARCH_HAS_PKEYS
+ ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/powerpc/protection-keys.txt
+
+ If unsure, say y.
endmenu

config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
#ifdef CONFIG_SPAPR_TCE_IOMMU
struct list_head iommu_group_mem_list;
#endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ /*
+ * Each bit represents one protection key.
+ * bit set -> key allocated
+ * bit unset -> key available for allocation
+ */
+ u32 pkey_allocation_map;
+ s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
} mm_context_t;

/*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
}

+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include <asm/reg.h>
+static inline u64 read_amr(void)
+{
+ return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+ mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+ return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+ mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+ return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+ mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_amr(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_uamor(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+ return -1;
+}
+static inline void write_iamr(u64 value)
+{
+ WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 0000000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey() 32
+
+#define AMR_AD_BIT 0x1UL
+#define AMR_WD_BIT 0x2UL
+#define IAMR_EX_BIT 0x1UL
+#define AMR_BITS_PER_PKEY 2
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | \
+ VM_PKEY_BIT1 | \
+ VM_PKEY_BIT2 | \
+ VM_PKEY_BIT3 | \
+ VM_PKEY_BIT4)
+
+/*
+ * Bits are in BE format.
+ * NOTE: key 31, 1, 0 are not used.
+ * key 0 is used by default. It give read/write/execute permission.
+ * key 31 is reserved by the hypervisor.
+ * key 1 is recommended to be not used.
+ * PowerISA(3.0) page 1015, programming note.
+ */
+#define PKEY_INITIAL_ALLOCAION 0xc0000001
+
+#define pkeybit_mask(pkey) (0x1 << (arch_max_pkey() - pkey - 1))
+
+#define mm_pkey_allocation_map(mm) (mm->context.pkey_allocation_map)
+
+#define mm_set_pkey_allocated(mm, pkey) { \
+ mm_pkey_allocation_map(mm) |= pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_free(mm, pkey) { \
+ mm_pkey_allocation_map(mm) &= ~pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_is_allocated(mm, pkey) \
+ (mm_pkey_allocation_map(mm) & pkeybit_mask(pkey))
+
+#define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
+ pkeybit_mask(pkey))
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+ /* a reserved key is never considered as 'explicitly allocated' */
+ return (!mm_set_pkey_is_reserved(mm, pkey) &&
+ mm_set_pkey_is_allocated(mm, pkey));
+}
+
+/*
+ * Returns a positive, 5-bit key on success, or -1 on failure.
+ */
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+ /*
+ * Note: this is the one and only place we make sure
+ * that the pkey is valid as far as the hardware is
+ * concerned. The rest of the kernel trusts that
+ * only good, valid pkeys come out of here.
+ */
+ u32 all_pkeys_mask = (u32)(~(0x0));
+ int ret;
+
+ /*
+ * Are we out of pkeys? We must handle this specially
+ * because ffz() behavior is undefined if there are no
+ * zeros.
+ */
+ if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+ return -1;
+
+ ret = arch_max_pkey() -
+ ffz((u32)mm_pkey_allocation_map(mm))
+ - 1;
+ mm_set_pkey_allocated(mm, ret);
+ return ret;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+ if (!mm_pkey_is_allocated(mm, pkey))
+ return -EINVAL;
+
+ mm_set_pkey_free(mm, pkey);
+
+ return 0;
+}
+
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+extern int __execute_only_pkey(struct mm_struct *mm);
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+ return __execute_only_pkey(mm);
+}
+
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+ int prot, int pkey);
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+ int prot, int pkey)
+{
+ return __arch_override_mprotect_pkey(vma, prot, pkey);
+}
+
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val);
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+{
+ return __arch_set_user_pkey_access(tsk, pkey, init_val);
+}
+
+static inline pkey_mm_init(struct mm_struct *mm)
+{
+ mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
+ /* -1 means unallocated or invalid */
+ mm->context.execute_only_pkey = -1;
+}
+
+#endif /*_ASM_PPC64_PKEYS_H */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 1c94708..22dd776 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -388,3 +388,5 @@
COMPAT_SYS_SPU(pwritev2)
SYSCALL(kexec_file_load)
SYSCALL(statx)
+SYSCALL(pkey_alloc)
+SYSCALL(pkey_free)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 9ba11db..e0273bc 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,13 +12,11 @@
#include <uapi/asm/unistd.h>


-#define NR_syscalls 384
+#define NR_syscalls 386

#define __NR__exit __NR_exit

#define __IGNORE_pkey_mprotect
-#define __IGNORE_pkey_alloc
-#define __IGNORE_pkey_free

#ifndef __ASSEMBLY__

diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index b85f142..7993a07 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -394,5 +394,7 @@
#define __NR_pwritev2 381
#define __NR_kexec_file_load 382
#define __NR_statx 383
+#define __NR_pkey_alloc 384
+#define __NR_pkey_free 385

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 7414034..8cc2ff1 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -45,3 +45,4 @@ obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_iommu.o
obj-$(CONFIG_PPC_PTDUMP) += dump_linuxpagetables.o
obj-$(CONFIG_PPC_HTDUMP) += dump_hashpagetable.o
+obj-$(CONFIG_PPC64_MEMORY_PROTECTION_KEYS) += pkeys.o
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2a..2da9931 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -16,6 +16,7 @@
#include <linux/string.h>
#include <linux/types.h>
#include <linux/mm.h>
+#include <linux/pkeys.h>
#include <linux/spinlock.h>
#include <linux/idr.h>
#include <linux/export.h>
@@ -120,6 +121,10 @@ static int hash__init_new_context(struct mm_struct *mm)

subpage_prot_init_new_context(mm);

+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+ pkey_mm_init(mm);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return index;
}

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
new file mode 100644
index 0000000..b97366e
--- /dev/null
+++ b/arch/powerpc/mm/pkeys.c
@@ -0,0 +1,88 @@
+/*
+ * PowerPC Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/pkeys.h> /* PKEY_* */
+#include <uapi/asm-generic/mman-common.h>
+
+
+/*
+ * set the access right in AMR IAMR and UAMOR register
+ * for @pkey to that specified in @init_val.
+ */
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+{
+ u64 old_amr, old_uamor, old_iamr;
+ int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+ u64 new_amr_bits = 0x0ul;
+ u64 new_iamr_bits = 0x0ul;
+ u64 new_uamor_bits = 0x3ul;
+
+ /* Set the bits we need in AMR: */
+ if (init_val & PKEY_DISABLE_ACCESS)
+ new_amr_bits |= AMR_AD_BIT;
+ if (init_val & PKEY_DISABLE_WRITE)
+ new_amr_bits |= AMR_WD_BIT;
+
+ /*
+ * By default execute is disabled.
+ * To enable execute, PKEY_ENABLE_EXECUTE
+ * needs to be specified.
+ */
+ if ((init_val & PKEY_DISABLE_EXECUTE))
+ new_iamr_bits |= IAMR_EX_BIT;
+
+ /* Shift the bits in to the correct place in AMR for pkey: */
+ new_amr_bits <<= pkey_shift;
+ new_iamr_bits <<= pkey_shift;
+ new_uamor_bits <<= pkey_shift;
+
+ /* Get old AMR and mask off any old bits in place: */
+ old_amr = read_amr();
+ old_amr &= ~((u64)(AMR_AD_BIT|AMR_WD_BIT) << pkey_shift);
+
+ old_iamr = read_iamr();
+ old_iamr &= ~(0x3ul << pkey_shift);
+
+ old_uamor = read_uamor();
+ old_uamor &= ~(0x3ul << pkey_shift);
+
+ /* Write old part along with new part: */
+ write_amr(old_amr | new_amr_bits);
+ write_iamr(old_iamr | new_iamr_bits);
+ write_uamor(old_uamor | new_uamor_bits);
+
+ return 0;
+}
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+ return -1;
+}
+
+/*
+ * This should only be called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
+ int pkey)
+{
+ /*
+ * Is this an mprotect_pkey() call? If so, never
+ * override the value that came from the user.
+ */
+ if (pkey != -1)
+ return pkey;
+
+ return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..34ddac7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */

#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
-# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
-# define VM_PKEY_BIT1 VM_HIGH_ARCH_1
-# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
-# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
-#endif
+#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
+ || defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+#define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
+#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
#elif defined(CONFIG_PPC)
+#define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+#define VM_PKEY_BIT4 VM_HIGH_ARCH_4 /* intel does not use this bit */
+ /* but reserved for future expansion */
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
# define VM_GROWSUP VM_ARCH_1
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0..b13ecc6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -76,5 +76,5 @@
#define PKEY_DISABLE_WRITE 0x2
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
PKEY_DISABLE_WRITE)
-
+#define PKEY_DISABLE_EXECUTE 0x4
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
--
1.8.3.1

2017-06-12 06:58:58

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

Ram Pai <[email protected]> writes:

> Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
> memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
> keys.
>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 7
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> The second part of the PTE will hold
> a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
> and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
> pte.
>
> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating a invalid slot. if a hashpage does
> get allocated to the 0xF slot, it is released and not used. In
> other words, even though 0xF is a valid slot we discard it and
> consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
> opportunity to not depend on a bit in the primary PTE in order to
> determine the validity of a slot.

Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
slot is valid or not. For 4K hptes, we do need this right ? ie, only
when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot


>
> When we release a 0xF slot we also release a legitimate primary
> slot and unmap that entry. This is to ensure that we do get
> a legimate non-0xF slot the next time we retry for a slot.

Can you explain this more ? what is a primary slot here ?

>
> Though treating 0xF slot as invalid reduces the number of available
> slots and make have a effect on the performance, the probabilty
> of hitting a 0xF is extermely low.
>
> Compared to the current scheme, the above described scheme reduces
> the number of false hash table updates significantly and has the
> added advantage of releasing four valuable PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.
>
> Signed-off-by: Ram Pai <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 +++++
> arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
> arch/powerpc/include/asm/book3s/64/hash.h | 8 ++-
> arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> arch/powerpc/mm/hash64_4k.c | 12 ++---
> arch/powerpc/mm/hash64_64k.c | 73 +++++++++------------------
> arch/powerpc/mm/hash_utils_64.c | 38 +++++++++++++-
> arch/powerpc/mm/hugetlbpage-hash64.c | 16 +++---
> 9 files changed, 107 insertions(+), 98 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b..223d318 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -16,6 +16,18 @@
> #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>
> +
> +/*
> + * Only supported by 4k linux page size

that comment is confusing, you can drop that, it is part of hash-4k.h
which means these defines are local to 4k linux page size config.

> + */
> +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT 56
> +
> +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> +
> /* PTE flags to conserve for HPTE identification */
> #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> H_PAGE_F_SECOND | H_PAGE_F_GIX)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 9732837..87ee22d 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -10,23 +10,21 @@
> * 64k aligned address free up few of the lower bits of RPN for us
> * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> */
> -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> +
> +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> /*
> * We need to differentiate between explicit huge page and THP huge
> * page, since THP huge page also need to track real subpage details
> */
> #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
>
> -/*
> - * Used to track subpage group valid if H_PAGE_COMBO is set
> - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> - */
> -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
> -
> /* PTE flags to conserve for HPTE identification */
> -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
> +

You drop H_PAGE_COMBO here. Is that intentional ?

We have code which does

if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
MMU_PAGE_64K, ssize,
flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;


I guess they expect slot details to be cleared. With the above
_PAGE_HPTEFLAGS that is not true.


> /*
> * we support 16 fragments per PTE page of 64K size.
> */
> @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
> unsigned long *hidxp;
>
> rpte.pte = pte;
> - rpte.hidx = 0;
> - if (pte_val(pte) & H_PAGE_COMBO) {
> - /*
> - * Make sure we order the hidx load against the H_PAGE_COMBO
> - * check. The store side ordering is done in __hash_page_4K
> - */
> - smp_rmb();
> - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> - rpte.hidx = *hidxp;
> - }
> + /*
> + * The store side ordering is done in __hash_page_4K
> + */
> + smp_rmb();
> + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> + rpte.hidx = *hidxp;
> return rpte;
> }
>
> static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> {
> - if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> - return (rpte.hidx >> (index<<2)) & 0xf;
> - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> + return ((rpte.hidx >> (index<<2)) & 0xfUL);
> }
>
> #define __rpte_to_pte(r) ((r).pte)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index 4e957b0..7742782 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -8,11 +8,9 @@
> *
> */
> #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> -#define H_PAGE_F_GIX_SHIFT 56
> -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> +
> +#define INIT_HIDX (~0x0UL)

But then you use it like

rpte.hidx = INIT_HIDX;

A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
there ?

> +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)

Can you do that as a static inline ?

>
> #ifdef CONFIG_PPC_64K_PAGES
> #include <asm/book3s/64/hash-64k.h>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 6981a52..cfb8169 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> extern int __hash_page_64K(unsigned long ea, unsigned long access,
> unsigned long vsid, pte_t *ptep, unsigned long trap,
> unsigned long flags, int ssize);
> +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot);
> +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index);
> +
> struct mm_struct;
> unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> index 44fe483..b832ed3 100644
> --- a/arch/powerpc/mm/dump_linuxpagetables.c
> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> @@ -213,7 +213,7 @@ struct flag_info {
> .val = H_PAGE_4K_PFN,
> .set = "4K_pfn",
> }, {
> -#endif
> +#else
> .mask = H_PAGE_F_GIX,
> .val = H_PAGE_F_GIX,
> .set = "f_gix",
> @@ -224,6 +224,7 @@ struct flag_info {
> .val = H_PAGE_F_SECOND,
> .set = "f_second",
> }, {
> +#endif /* CONFIG_PPC_64K_PAGES */
> #endif
> .mask = _PAGE_SPECIAL,
> .val = _PAGE_SPECIAL,
> diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> index 6fa450c..5b16416 100644
> --- a/arch/powerpc/mm/hash64_4k.c
> +++ b/arch/powerpc/mm/hash64_4k.c
> @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> pte_t *ptep, unsigned long trap, unsigned long flags,
> int ssize, int subpg_prot)
> {
> + real_pte_t rpte;
> unsigned long hpte_group;
> unsigned long rflags, pa;
> unsigned long old_pte, new_pte;
> @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * need to add in 0x1 if it's a read-only user page
> */
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> /*
> * There MIGHT be an HPTE for this pte
> */
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);


This confused me, the rest of the code assume slot as the 4 bit value.
But get_hidx_slot is not exactly returning that.



> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
> MMU_PAGE_4K, ssize, flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
> @@ -118,8 +115,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> return -1;
> }
> new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> }
> *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> return 0;
> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> index 1a68cb1..feb90f1 100644
> --- a/arch/powerpc/mm/hash64_64k.c
> +++ b/arch/powerpc/mm/hash64_64k.c
> @@ -15,34 +15,13 @@
> #include <linux/mm.h>
> #include <asm/machdep.h>
> #include <asm/mmu.h>
> -/*
> - * index from 0 - 15
> - */
> -bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> -{
> - unsigned long g_idx;
> - unsigned long ptev = pte_val(rpte.pte);
>
> - g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
> - index = index >> 2;
> - if (g_idx & (0x1 << index))
> - return true;
> - else
> - return false;
> -}
> /*
> * index from 0 - 15
> */
> -static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
> +bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> {
> - unsigned long g_idx;
> -
> - if (!(ptev & H_PAGE_COMBO))
> - return ptev;
> - index = index >> 2;
> - g_idx = 0x1 << index;
> -
> - return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
> + return !(HPTE_SOFT_INVALID(rpte.hidx >> (index << 2)));
> }
>
> int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> @@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> int ssize, int subpg_prot)
> {
> real_pte_t rpte;
> - unsigned long *hidxp;
> unsigned long hpte_group;
> unsigned int subpg_index;
> - unsigned long rflags, pa, hidx;
> + unsigned long rflags, pa;
> unsigned long old_pte, new_pte, subpg_pte;
> unsigned long vpn, hash, slot;
> unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
> @@ -116,8 +94,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * On hash insert failure we use old pte value and we don't
> * want slot information there if we have a insert failure.
> */
> - old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> - new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> + old_pte &= ~(H_PAGE_HASHPTE);
> + new_pte &= ~(H_PAGE_HASHPTE);
> + rpte.hidx = INIT_HIDX;
> goto htab_insert_hpte;
> }
> /*
> @@ -126,18 +105,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> if (__rpte_sub_valid(rpte, subpg_index)) {
> int ret;
>
> - hash = hpt_hash(vpn, shift, ssize);
> - hidx = __rpte_to_hidx(rpte, subpg_index);
> - if (hidx & _PTEIDX_SECONDARY)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += hidx & _PTEIDX_GROUP_IX;
> -
> + slot = get_hidx_slot(vpn, shift, ssize, rpte, subpg_index);
> ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> MMU_PAGE_4K, MMU_PAGE_4K,
> ssize, flags);
> /*
> - *if we failed because typically the HPTE wasn't really here
> + * if we failed because typically the HPTE wasn't really here
> * we try an insertion.
> */
> if (ret == -1)
> @@ -177,8 +150,14 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> rflags, HPTE_V_SECONDARY,
> MMU_PAGE_4K, MMU_PAGE_4K,
> ssize);
> - if (slot == -1) {
> - if (mftb() & 0x1)
> + if (unlikely(HPTE_SOFT_INVALID(slot)))
> + mmu_hash_ops.hpte_invalidate(slot, vpn,
> + MMU_PAGE_4K, MMU_PAGE_4K,
> + ssize, (flags & HPTE_LOCAL_UPDATE));

Did this work ? invalidate first arg is the offset of hpte from the htab
base. where as insert return 4 bit slot number within group. I guess we
need to do a cleanup patch before to differentiate between slot number
within a group and slot number in hash page table. May be name slot
number within a group as gslot ?


> +
> + if (unlikely(slot == -1 || HPTE_SOFT_INVALID(slot))) {
> + if (HPTE_SOFT_INVALID(slot) || (mftb() & 0x1))
> +

So we always try to remove from primary if we got 0xf slot by insert ?


> hpte_group = ((hash & htab_hash_mask) *
> HPTES_PER_GROUP) & ~0x7UL;
> mmu_hash_ops.hpte_remove(hpte_group);
> @@ -204,11 +183,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> * Since we have H_PAGE_BUSY set on ptep, we can be sure
> * nobody is undating hidx.
> */
> - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> - rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> - *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> - new_pte = mark_subptegroup_valid(new_pte, subpg_index);
> - new_pte |= H_PAGE_HASHPTE;
> + new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
> + new_pte |= H_PAGE_HASHPTE;
> +
> /*
> * check __real_pte for details on matching smp_rmb()
> */
> @@ -221,6 +198,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> unsigned long vsid, pte_t *ptep, unsigned long trap,
> unsigned long flags, int ssize)
> {
> + real_pte_t rpte;
> unsigned long hpte_group;
> unsigned long rflags, pa;
> unsigned long old_pte, new_pte;
> @@ -257,6 +235,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -267,12 +246,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> /*
> * There MIGHT be an HPTE for this pte
> */
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> MMU_PAGE_64K, ssize,
> flags) == -1)
> @@ -322,9 +296,8 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
> return -1;
> }
> + set_hidx_slot(ptep, rpte, 0, slot);
> new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> }
> *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> return 0;
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index f2095ce..b405657 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
>
> void __init hash__early_init_mmu(void)
> {
> +#ifndef CONFIG_PPC_64K_PAGES
> /*
> - * We have code in __hash_page_64K() and elsewhere, which assumes it can
> + * We have code in __hash_page_4K() and elsewhere, which assumes it can
> * do the following:
> * new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> *
> @@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
> * with a BUILD_BUG_ON().
> */
> BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul << (H_PAGE_F_GIX_SHIFT + 3)));
> +#endif /* CONFIG_PPC_64K_PAGES */
>
> htab_init_page_sizes();
>
> @@ -1589,6 +1591,40 @@ static inline void tm_flush_hash_page(int local)
> }
> #endif
>
> +#ifdef CONFIG_PPC_64K_PAGES
> +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> +
> + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> + return 0x0UL;
> +}
> +#else /* CONFIG_PPC_64K_PAGES */
> +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> + unsigned int subpg_index, unsigned long slot)
> +{
> + return (slot << H_PAGE_F_GIX_SHIFT) &
> + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +}

Move them to header files so that we can avoid #ifdef here. Here slot is
4 bit value


> +#endif /* CONFIG_PPC_64K_PAGES */
> +
> +unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> + int ssize, real_pte_t rpte, unsigned int subpg_index)
> +{
> + unsigned long hash, slot, hidx;
> +
> + hash = hpt_hash(vpn, shift, ssize);
> + hidx = __rpte_to_hidx(rpte, subpg_index);
> + if (hidx & _PTEIDX_SECONDARY)
> + hash = ~hash;
> + slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> + slot += hidx & _PTEIDX_GROUP_IX;
> + return slot;

Here slot is not. Can you rename this function ?

> +}
> +
> +
> /* WARNING: This is called from hash_low_64.S, if you change this prototype,
> * do not forget to update the assembly call site !
> */
> diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> index a84bb44..25a50eb 100644
> --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> @@ -14,7 +14,7 @@
> #include <asm/cacheflush.h>
> #include <asm/machdep.h>
>
> -extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> +long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> unsigned long pa, unsigned long rlags,
> unsigned long vflags, int psize, int ssize);
>
> @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> pte_t *ptep, unsigned long trap, unsigned long flags,
> int ssize, unsigned int shift, unsigned int mmu_psize)
> {
> + real_pte_t rpte;
> unsigned long vpn;
> unsigned long old_pte, new_pte;
> unsigned long rflags, pa, sz;
> @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
> rflags = htab_convert_pte_flags(new_pte);
> + rpte = __real_pte(__pte(old_pte), ptep);
>
> sz = ((1UL) << shift);
> if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -71,14 +73,9 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> /* Check if pte already has an hpte (case 2) */
> if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> /* There MIGHT be an HPTE for this pte */
> - unsigned long hash, slot;
> -
> - hash = hpt_hash(vpn, shift, ssize);
> - if (old_pte & H_PAGE_F_SECOND)
> - hash = ~hash;
> - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> + unsigned long slot;
>
> + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> mmu_psize, ssize, flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
> @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> return -1;
> }
>
> - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> }
>
> /*
> --
> 1.8.3.1

2017-06-12 22:20:28

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <[email protected]> writes:
>
> > Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
> > memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
> > keys.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 7
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > The second part of the PTE will hold
> > a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
> > and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
> > pte.
> >
> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating a invalid slot. if a hashpage does
> > get allocated to the 0xF slot, it is released and not used. In
> > other words, even though 0xF is a valid slot we discard it and
> > consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
> > opportunity to not depend on a bit in the primary PTE in order to
> > determine the validity of a slot.
>
> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
> slot is valid or not. For 4K hptes, we do need this right ? ie, only
> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot

for 64k hptes; you are right, we do not use 0xF to
track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.

for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
value of the slot. H_PAGE_HASHPTE tells if there exists any valid
4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
us if they are valid values.

However in either case we do not need H_PAGE_COMBO. That flag is not
used for ptes. But we continue to use that flag for pmd to track
hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
the 64K pagesize case.

>
>
> >
> > When we release a 0xF slot we also release a legitimate primary
> > slot and unmap that entry. This is to ensure that we do get
> > a legimate non-0xF slot the next time we retry for a slot.
>
> Can you explain this more ? what is a primary slot here ?

I may not be using the right terminology here. But when I say slot, i
mean the four bits that tell the position of the hpte in the hash
buckets. Bit 0, indicates if it the primary or secondary hash bucket.
and bit 1,2,3 indicates the entry in the hash bucket.

So when i say primary slot, I mean a entry in the primary hash bucket.

The idea is, when hpte_insert returns a hpte which is cached in the 7slot
of the secondary bucket i.e 0xF, we discard it, and also release a
random entry from the primary bucket, so that on retry we can get that
entry.


>
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and make have a effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared to the current scheme, the above described scheme reduces
> > the number of false hash table updates significantly and has the
> > added advantage of releasing four valuable PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
> > Signed-off-by: Ram Pai <[email protected]>
> > ---
> > arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 +++++
> > arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
> > arch/powerpc/include/asm/book3s/64/hash.h | 8 ++-
> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> > arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> > arch/powerpc/mm/hash64_4k.c | 12 ++---
> > arch/powerpc/mm/hash64_64k.c | 73 +++++++++------------------
> > arch/powerpc/mm/hash_utils_64.c | 38 +++++++++++++-
> > arch/powerpc/mm/hugetlbpage-hash64.c | 16 +++---
> > 9 files changed, 107 insertions(+), 98 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index b4b5e6b..223d318 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,18 @@
> > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >
> > +
> > +/*
> > + * Only supported by 4k linux page size
>
> that comment is confusing, you can drop that, it is part of hash-4k.h
> which means these defines are local to 4k linux page size config.
>
> > + */

right. ok.

> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > +
> > /* PTE flags to conserve for HPTE identification */
> > #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> > H_PAGE_F_SECOND | H_PAGE_F_GIX)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > index 9732837..87ee22d 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > @@ -10,23 +10,21 @@
> > * 64k aligned address free up few of the lower bits of RPN for us
> > * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> > */
> > -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +
> > +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > /*
> > * We need to differentiate between explicit huge page and THP huge
> > * page, since THP huge page also need to track real subpage details
> > */
> > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
> >
> > -/*
> > - * Used to track subpage group valid if H_PAGE_COMBO is set
> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> > - */
> > -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
> > -
> > /* PTE flags to conserve for HPTE identification */
> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> > - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
> > +
>
> You drop H_PAGE_COMBO here. Is that intentional ?

Yes. that is intentional. See a downside? We dont depend on
H_PAGE_COMBO for 64K pagesize PTE anymore.

>
> We have code which does
>
> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> MMU_PAGE_64K, ssize,
> flags) == -1)
> old_pte &= ~_PAGE_HPTEFLAGS;
>
>
> I guess they expect slot details to be cleared. With the above
> _PAGE_HPTEFLAGS that is not true.

since we dont depend on COMBO flag, it should be fine. Let me know if
you see a downside.

>
>
> > /*
> > * we support 16 fragments per PTE page of 64K size.
> > */
> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
> > unsigned long *hidxp;
> >
> > rpte.pte = pte;
> > - rpte.hidx = 0;
> > - if (pte_val(pte) & H_PAGE_COMBO) {
> > - /*
> > - * Make sure we order the hidx load against the H_PAGE_COMBO
> > - * check. The store side ordering is done in __hash_page_4K
> > - */
> > - smp_rmb();
> > - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > - rpte.hidx = *hidxp;
> > - }
> > + /*
> > + * The store side ordering is done in __hash_page_4K
> > + */
> > + smp_rmb();
> > + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > + rpte.hidx = *hidxp;
> > return rpte;
> > }
> >
> > static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> > {
> > - if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> > - return (rpte.hidx >> (index<<2)) & 0xf;
> > - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> > + return ((rpte.hidx >> (index<<2)) & 0xfUL);
> > }
> >
> > #define __rpte_to_pte(r) ((r).pte)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> > index 4e957b0..7742782 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> > @@ -8,11 +8,9 @@
> > *
> > */
> > #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> > -#define H_PAGE_F_GIX_SHIFT 56
> > -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> > -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> > +
> > +#define INIT_HIDX (~0x0UL)
>
> But then you use it like
>
> rpte.hidx = INIT_HIDX;
>
> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
> there ?

No. We are initializing all the entries in the second-part-of-the-pte to
0xF. which essentially boils down to ~0x0UL.

>
> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
>
> Can you do that as a static inline ?

ok.

>
> >
> > #ifdef CONFIG_PPC_64K_PAGES
> > #include <asm/book3s/64/hash-64k.h>
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > index 6981a52..cfb8169 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> > extern int __hash_page_64K(unsigned long ea, unsigned long access,
> > unsigned long vsid, pte_t *ptep, unsigned long trap,
> > unsigned long flags, int ssize);
> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot);
> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
> > +
> > struct mm_struct;
> > unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> > extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> > index 44fe483..b832ed3 100644
> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> > @@ -213,7 +213,7 @@ struct flag_info {
> > .val = H_PAGE_4K_PFN,
> > .set = "4K_pfn",
> > }, {
> > -#endif
> > +#else
> > .mask = H_PAGE_F_GIX,
> > .val = H_PAGE_F_GIX,
> > .set = "f_gix",
> > @@ -224,6 +224,7 @@ struct flag_info {
> > .val = H_PAGE_F_SECOND,
> > .set = "f_second",
> > }, {
> > +#endif /* CONFIG_PPC_64K_PAGES */
> > #endif
> > .mask = _PAGE_SPECIAL,
> > .val = _PAGE_SPECIAL,
> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> > index 6fa450c..5b16416 100644
> > --- a/arch/powerpc/mm/hash64_4k.c
> > +++ b/arch/powerpc/mm/hash64_4k.c
> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > pte_t *ptep, unsigned long trap, unsigned long flags,
> > int ssize, int subpg_prot)
> > {
> > + real_pte_t rpte;
> > unsigned long hpte_group;
> > unsigned long rflags, pa;
> > unsigned long old_pte, new_pte;
> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * need to add in 0x1 if it's a read-only user page
> > */
> > rflags = htab_convert_pte_flags(new_pte);
> > + rpte = __real_pte(__pte(old_pte), ptep);
> >
> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > /*
> > * There MIGHT be an HPTE for this pte
> > */
> > - hash = hpt_hash(vpn, shift, ssize);
> > - if (old_pte & H_PAGE_F_SECOND)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > -
> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
>
>
> This confused me, the rest of the code assume slot as the 4 bit value.
> But get_hidx_slot is not exactly returning that.

get_hidx_slot() is returning a 4bit value. no?

>
>
>
> > if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
> > MMU_PAGE_4K, ssize, flags) == -1)
> > old_pte &= ~_PAGE_HPTEFLAGS;
> > @@ -118,8 +115,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > return -1;
> > }
> > new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> > - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> > }
> > *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> > return 0;
> > diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> > index 1a68cb1..feb90f1 100644
> > --- a/arch/powerpc/mm/hash64_64k.c
> > +++ b/arch/powerpc/mm/hash64_64k.c
> > @@ -15,34 +15,13 @@
> > #include <linux/mm.h>
> > #include <asm/machdep.h>
> > #include <asm/mmu.h>
> > -/*
> > - * index from 0 - 15
> > - */
> > -bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> > -{
> > - unsigned long g_idx;
> > - unsigned long ptev = pte_val(rpte.pte);
> >
> > - g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
> > - index = index >> 2;
> > - if (g_idx & (0x1 << index))
> > - return true;
> > - else
> > - return false;
> > -}
> > /*
> > * index from 0 - 15
> > */
> > -static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
> > +bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> > {
> > - unsigned long g_idx;
> > -
> > - if (!(ptev & H_PAGE_COMBO))
> > - return ptev;
> > - index = index >> 2;
> > - g_idx = 0x1 << index;
> > -
> > - return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
> > + return !(HPTE_SOFT_INVALID(rpte.hidx >> (index << 2)));
> > }
> >
> > int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > @@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > int ssize, int subpg_prot)
> > {
> > real_pte_t rpte;
> > - unsigned long *hidxp;
> > unsigned long hpte_group;
> > unsigned int subpg_index;
> > - unsigned long rflags, pa, hidx;
> > + unsigned long rflags, pa;
> > unsigned long old_pte, new_pte, subpg_pte;
> > unsigned long vpn, hash, slot;
> > unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
> > @@ -116,8 +94,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * On hash insert failure we use old pte value and we don't
> > * want slot information there if we have a insert failure.
> > */
> > - old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> > - new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> > + old_pte &= ~(H_PAGE_HASHPTE);
> > + new_pte &= ~(H_PAGE_HASHPTE);
> > + rpte.hidx = INIT_HIDX;
> > goto htab_insert_hpte;
> > }
> > /*
> > @@ -126,18 +105,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > if (__rpte_sub_valid(rpte, subpg_index)) {
> > int ret;
> >
> > - hash = hpt_hash(vpn, shift, ssize);
> > - hidx = __rpte_to_hidx(rpte, subpg_index);
> > - if (hidx & _PTEIDX_SECONDARY)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += hidx & _PTEIDX_GROUP_IX;
> > -
> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, subpg_index);
> > ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> > MMU_PAGE_4K, MMU_PAGE_4K,
> > ssize, flags);
> > /*
> > - *if we failed because typically the HPTE wasn't really here
> > + * if we failed because typically the HPTE wasn't really here
> > * we try an insertion.
> > */
> > if (ret == -1)
> > @@ -177,8 +150,14 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > rflags, HPTE_V_SECONDARY,
> > MMU_PAGE_4K, MMU_PAGE_4K,
> > ssize);
> > - if (slot == -1) {
> > - if (mftb() & 0x1)
> > + if (unlikely(HPTE_SOFT_INVALID(slot)))
> > + mmu_hash_ops.hpte_invalidate(slot, vpn,
> > + MMU_PAGE_4K, MMU_PAGE_4K,
> > + ssize, (flags & HPTE_LOCAL_UPDATE));
>
> Did this work ?

Good catch. I have not been able to excercise this code. getting 0xf
value from mmu_hash_ops.hpte_insert() has been impossible.

> invalidate first arg is the offset of hpte from the htab
> base. where as insert return 4 bit slot number within group. I guess we
> need to do a cleanup patch before to differentiate between slot number
> within a group and slot number in hash page table. May be name slot
> number within a group as gslot ?
>
>
> > +
> > + if (unlikely(slot == -1 || HPTE_SOFT_INVALID(slot))) {
> > + if (HPTE_SOFT_INVALID(slot) || (mftb() & 0x1))
> > +
>
> So we always try to remove from primary if we got 0xf slot by insert ?


yes.

>
>
> > hpte_group = ((hash & htab_hash_mask) *
> > HPTES_PER_GROUP) & ~0x7UL;
> > mmu_hash_ops.hpte_remove(hpte_group);
> > @@ -204,11 +183,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > * Since we have H_PAGE_BUSY set on ptep, we can be sure
> > * nobody is undating hidx.
> > */
> > - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > - rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > - *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> > - new_pte = mark_subptegroup_valid(new_pte, subpg_index);
> > - new_pte |= H_PAGE_HASHPTE;
> > + new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
> > + new_pte |= H_PAGE_HASHPTE;
> > +
> > /*
> > * check __real_pte for details on matching smp_rmb()
> > */
> > @@ -221,6 +198,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> > unsigned long vsid, pte_t *ptep, unsigned long trap,
> > unsigned long flags, int ssize)
> > {
> > + real_pte_t rpte;
> > unsigned long hpte_group;
> > unsigned long rflags, pa;
> > unsigned long old_pte, new_pte;
> > @@ -257,6 +235,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> > } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
> >
> > rflags = htab_convert_pte_flags(new_pte);
> > + rpte = __real_pte(__pte(old_pte), ptep);
> >
> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -267,12 +246,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> > /*
> > * There MIGHT be an HPTE for this pte
> > */
> > - hash = hpt_hash(vpn, shift, ssize);
> > - if (old_pte & H_PAGE_F_SECOND)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > -
> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> > if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> > MMU_PAGE_64K, ssize,
> > flags) == -1)
> > @@ -322,9 +296,8 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> > MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
> > return -1;
> > }
> > + set_hidx_slot(ptep, rpte, 0, slot);
> > new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> > - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > }
> > *ptep = __pte(new_pte & ~H_PAGE_BUSY);
> > return 0;
> > diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> > index f2095ce..b405657 100644
> > --- a/arch/powerpc/mm/hash_utils_64.c
> > +++ b/arch/powerpc/mm/hash_utils_64.c
> > @@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
> >
> > void __init hash__early_init_mmu(void)
> > {
> > +#ifndef CONFIG_PPC_64K_PAGES
> > /*
> > - * We have code in __hash_page_64K() and elsewhere, which assumes it can
> > + * We have code in __hash_page_4K() and elsewhere, which assumes it can
> > * do the following:
> > * new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > *
> > @@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
> > * with a BUILD_BUG_ON().
> > */
> > BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul << (H_PAGE_F_GIX_SHIFT + 3)));
> > +#endif /* CONFIG_PPC_64K_PAGES */
> >
> > htab_init_page_sizes();
> >
> > @@ -1589,6 +1591,40 @@ static inline void tm_flush_hash_page(int local)
> > }
> > #endif
> >
> > +#ifdef CONFIG_PPC_64K_PAGES
> > +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot)
> > +{
> > + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > +
> > + rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > + *hidxp = rpte.hidx | (slot << (subpg_index << 2));
> > + return 0x0UL;
> > +}
> > +#else /* CONFIG_PPC_64K_PAGES */
> > +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > + unsigned int subpg_index, unsigned long slot)
> > +{
> > + return (slot << H_PAGE_F_GIX_SHIFT) &
> > + (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +}
>
> Move them to header files so that we can avoid #ifdef here. Here slot is
> 4 bit value

ok.

>
>
> > +#endif /* CONFIG_PPC_64K_PAGES */
> > +
> > +unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> > + int ssize, real_pte_t rpte, unsigned int subpg_index)
> > +{
> > + unsigned long hash, slot, hidx;
> > +
> > + hash = hpt_hash(vpn, shift, ssize);
> > + hidx = __rpte_to_hidx(rpte, subpg_index);
> > + if (hidx & _PTEIDX_SECONDARY)
> > + hash = ~hash;
> > + slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > + slot += hidx & _PTEIDX_GROUP_IX;
> > + return slot;
>
> Here slot is not. Can you rename this function ?

rename to? I possibly am using the wrong terminology which is getting
reflected in the names of the function. Will have to get that right
to avoid confusion.


Thanks for your comments,

RP

>
> > +}
> > +
> > +
> > /* WARNING: This is called from hash_low_64.S, if you change this prototype,
> > * do not forget to update the assembly call site !
> > */
> > diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> > index a84bb44..25a50eb 100644
> > --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> > +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> > @@ -14,7 +14,7 @@
> > #include <asm/cacheflush.h>
> > #include <asm/machdep.h>
> >
> > -extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> > +long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> > unsigned long pa, unsigned long rlags,
> > unsigned long vflags, int psize, int ssize);
> >
> > @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > pte_t *ptep, unsigned long trap, unsigned long flags,
> > int ssize, unsigned int shift, unsigned int mmu_psize)
> > {
> > + real_pte_t rpte;
> > unsigned long vpn;
> > unsigned long old_pte, new_pte;
> > unsigned long rflags, pa, sz;
> > @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
> >
> > rflags = htab_convert_pte_flags(new_pte);
> > + rpte = __real_pte(__pte(old_pte), ptep);
> >
> > sz = ((1UL) << shift);
> > if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -71,14 +73,9 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > /* Check if pte already has an hpte (case 2) */
> > if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> > /* There MIGHT be an HPTE for this pte */
> > - unsigned long hash, slot;
> > -
> > - hash = hpt_hash(vpn, shift, ssize);
> > - if (old_pte & H_PAGE_F_SECOND)
> > - hash = ~hash;
> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > + unsigned long slot;
> >
> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> > if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> > mmu_psize, ssize, flags) == -1)
> > old_pte &= ~_PAGE_HPTEFLAGS;
> > @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> > return -1;
> > }
> >
> > - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > - (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > + new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> > }
> >
> > /*
> > --
> > 1.8.3.1

--
Ram Pai

2017-06-13 02:02:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

Ram Pai <[email protected]> writes:

> On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
>> Ram Pai <[email protected]> writes:
>>
>> > Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
>> > memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
>> > keys.
>> >
>> > The patch does the following change to the 64K PTE format
>> >
>> > H_PAGE_BUSY moves from bit 3 to bit 7
>> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>> > of the pte.
>> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
>> > second part of the pte.
>> >
>> > The second part of the PTE will hold
>> > a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
>> > and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
>> > pte.
>> >
>> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>> > is initialized to 0xF indicating a invalid slot. if a hashpage does
>> > get allocated to the 0xF slot, it is released and not used. In
>> > other words, even though 0xF is a valid slot we discard it and
>> > consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
>> > opportunity to not depend on a bit in the primary PTE in order to
>> > determine the validity of a slot.
>>
>> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
>> slot is valid or not. For 4K hptes, we do need this right ? ie, only
>> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot
>
> for 64k hptes; you are right, we do not use 0xF to
> track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.
>
> for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
> value of the slot. H_PAGE_HASHPTE tells if there exists any valid
> 4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
> us if they are valid values.
>
> However in either case we do not need H_PAGE_COMBO. That flag is not
> used for ptes. But we continue to use that flag for pmd to track
> hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
> the 64K pagesize case.


Really ? May be i am missing that in the patch. H_PAGE_COMBO indicate
whether a 64K linux page is mapped via 4k hash page table enries. I
don't see you changing that in __hash_page_4k()

we still continue to do.

new_pte = old_pte | H_PAGE_BUSY | _PAGE_ACCESSED | H_PAGE_COMBO;

/*
* Check if the pte was already inserted into the hash table
* as a 64k HW page, and invalidate the 64k HPTE if so.
*/
if (!(old_pte & H_PAGE_COMBO)) {
flush_hash_page(vpn, rpte, MMU_PAGE_64K, ssize, flags);



>
>>
>>
>> >
>> > When we release a 0xF slot we also release a legitimate primary
>> > slot and unmap that entry. This is to ensure that we do get
>> > a legimate non-0xF slot the next time we retry for a slot.
>>
>> Can you explain this more ? what is a primary slot here ?
>
> I may not be using the right terminology here. But when I say slot, i
> mean the four bits that tell the position of the hpte in the hash
> buckets. Bit 0, indicates if it the primary or secondary hash bucket.
> and bit 1,2,3 indicates the entry in the hash bucket.
>
> So when i say primary slot, I mean a entry in the primary hash bucket.
>
> The idea is, when hpte_insert returns a hpte which is cached in the 7slot
> of the secondary bucket i.e 0xF, we discard it, and also release a
> random entry from the primary bucket, so that on retry we can get that
> entry.
>
>
>>
>> >
>> > Though treating 0xF slot as invalid reduces the number of available
>> > slots and make have a effect on the performance, the probabilty
>> > of hitting a 0xF is extermely low.
>> >
>> > Compared to the current scheme, the above described scheme reduces
>> > the number of false hash table updates significantly and has the
>> > added advantage of releasing four valuable PTE bits for other
>> > purpose.
>> >
>> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
>> > Ellermen and myself.
>> >
>> > 4K PTE format remain unchanged currently.
>> >
>> > Signed-off-by: Ram Pai <[email protected]>
>> > ---
>> > arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 +++++
>> > arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
>> > arch/powerpc/include/asm/book3s/64/hash.h | 8 ++-
>> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
>> > arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
>> > arch/powerpc/mm/hash64_4k.c | 12 ++---
>> > arch/powerpc/mm/hash64_64k.c | 73 +++++++++------------------
>> > arch/powerpc/mm/hash_utils_64.c | 38 +++++++++++++-
>> > arch/powerpc/mm/hugetlbpage-hash64.c | 16 +++---
>> > 9 files changed, 107 insertions(+), 98 deletions(-)
>> >
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > index b4b5e6b..223d318 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > @@ -16,6 +16,18 @@
>> > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
>> > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>> >
>> > +
>> > +/*
>> > + * Only supported by 4k linux page size
>>
>> that comment is confusing, you can drop that, it is part of hash-4k.h
>> which means these defines are local to 4k linux page size config.
>>
>> > + */
>
> right. ok.
>
>> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> > +#define H_PAGE_F_GIX_SHIFT 56
>> > +
>> > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>> > +
>> > +
>> > /* PTE flags to conserve for HPTE identification */
>> > #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
>> > H_PAGE_F_SECOND | H_PAGE_F_GIX)
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > index 9732837..87ee22d 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > @@ -10,23 +10,21 @@
>> > * 64k aligned address free up few of the lower bits of RPN for us
>> > * We steal that here. For more deatils look at pte_pfn/pfn_pte()
>> > */
>> > -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
>> > -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>> > +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
>> > +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
>> > +
>> > +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
>> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>> > +
>> > /*
>> > * We need to differentiate between explicit huge page and THP huge
>> > * page, since THP huge page also need to track real subpage details
>> > */
>> > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
>> >
>> > -/*
>> > - * Used to track subpage group valid if H_PAGE_COMBO is set
>> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
>> > - */
>> > -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
>> > -
>> > /* PTE flags to conserve for HPTE identification */
>> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
>> > - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
>> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
>> > +
>>
>> You drop H_PAGE_COMBO here. Is that intentional ?
>
> Yes. that is intentional. See a downside? We dont depend on
> H_PAGE_COMBO for 64K pagesize PTE anymore.
>
>>
>> We have code which does
>>
>> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
>> MMU_PAGE_64K, ssize,
>> flags) == -1)
>> old_pte &= ~_PAGE_HPTEFLAGS;
>>
>>
>> I guess they expect slot details to be cleared. With the above
>> _PAGE_HPTEFLAGS that is not true.
>
> since we dont depend on COMBO flag, it should be fine. Let me know if
> you see a downside.
>
>>
>>
>> > /*
>> > * we support 16 fragments per PTE page of 64K size.
>> > */
>> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
>> > unsigned long *hidxp;
>> >
>> > rpte.pte = pte;
>> > - rpte.hidx = 0;
>> > - if (pte_val(pte) & H_PAGE_COMBO) {
>> > - /*
>> > - * Make sure we order the hidx load against the H_PAGE_COMBO
>> > - * check. The store side ordering is done in __hash_page_4K
>> > - */
>> > - smp_rmb();
>> > - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>> > - rpte.hidx = *hidxp;
>> > - }
>> > + /*
>> > + * The store side ordering is done in __hash_page_4K
>> > + */
>> > + smp_rmb();
>> > + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>> > + rpte.hidx = *hidxp;
>> > return rpte;
>> > }
>> >
>> > static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
>> > {
>> > - if ((pte_val(rpte.pte) & H_PAGE_COMBO))
>> > - return (rpte.hidx >> (index<<2)) & 0xf;
>> > - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
>> > + return ((rpte.hidx >> (index<<2)) & 0xfUL);
>> > }
>> >
>> > #define __rpte_to_pte(r) ((r).pte)
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>> > index 4e957b0..7742782 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>> > @@ -8,11 +8,9 @@
>> > *
>> > */
>> > #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
>> > -#define H_PAGE_F_GIX_SHIFT 56
>> > -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
>> > -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
>> > -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> > -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
>> > +
>> > +#define INIT_HIDX (~0x0UL)
>>
>> But then you use it like
>>
>> rpte.hidx = INIT_HIDX;
>>
>> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
>> there ?
>
> No. We are initializing all the entries in the second-part-of-the-pte to
> 0xF. which essentially boils down to ~0x0UL.

that you unsigned long casting is what i was suggesting to remove. We
may want to treat it as a 4 bits every where ?


>
>>
>> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
>>
>> Can you do that as a static inline ?
>
> ok.
>
>>
>> >
>> > #ifdef CONFIG_PPC_64K_PAGES
>> > #include <asm/book3s/64/hash-64k.h>
>> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > index 6981a52..cfb8169 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
>> > extern int __hash_page_64K(unsigned long ea, unsigned long access,
>> > unsigned long vsid, pte_t *ptep, unsigned long trap,
>> > unsigned long flags, int ssize);
>> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
>> > + unsigned int subpg_index, unsigned long slot);
>> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
>> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
>> > +
>> > struct mm_struct;
>> > unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
>> > extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
>> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
>> > index 44fe483..b832ed3 100644
>> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
>> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
>> > @@ -213,7 +213,7 @@ struct flag_info {
>> > .val = H_PAGE_4K_PFN,
>> > .set = "4K_pfn",
>> > }, {
>> > -#endif
>> > +#else
>> > .mask = H_PAGE_F_GIX,
>> > .val = H_PAGE_F_GIX,
>> > .set = "f_gix",
>> > @@ -224,6 +224,7 @@ struct flag_info {
>> > .val = H_PAGE_F_SECOND,
>> > .set = "f_second",
>> > }, {
>> > +#endif /* CONFIG_PPC_64K_PAGES */
>> > #endif
>> > .mask = _PAGE_SPECIAL,
>> > .val = _PAGE_SPECIAL,
>> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
>> > index 6fa450c..5b16416 100644
>> > --- a/arch/powerpc/mm/hash64_4k.c
>> > +++ b/arch/powerpc/mm/hash64_4k.c
>> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> > pte_t *ptep, unsigned long trap, unsigned long flags,
>> > int ssize, int subpg_prot)
>> > {
>> > + real_pte_t rpte;
>> > unsigned long hpte_group;
>> > unsigned long rflags, pa;
>> > unsigned long old_pte, new_pte;
>> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> > * need to add in 0x1 if it's a read-only user page
>> > */
>> > rflags = htab_convert_pte_flags(new_pte);
>> > + rpte = __real_pte(__pte(old_pte), ptep);
>> >
>> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
>> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
>> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> > /*
>> > * There MIGHT be an HPTE for this pte
>> > */
>> > - hash = hpt_hash(vpn, shift, ssize);
>> > - if (old_pte & H_PAGE_F_SECOND)
>> > - hash = ~hash;
>> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
>> > -
>> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
>>
>>
>> This confused me, the rest of the code assume slot as the 4 bit value.
>> But get_hidx_slot is not exactly returning that.
>
> get_hidx_slot() is returning a 4bit value. no?

It is not
unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
int ssize, real_pte_t rpte, unsigned int subpg_index)
{
unsigned long hash, slot, hidx;

hash = hpt_hash(vpn, shift, ssize);
hidx = __rpte_to_hidx(rpte, subpg_index);
if (hidx & _PTEIDX_SECONDARY)
hash = ~hash;
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
slot += hidx & _PTEIDX_GROUP_IX;
return slot;
}

-aneesh

2017-06-13 04:52:56

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

Ram Pai <[email protected]> writes:

> Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
> memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
> keys.
>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 7
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> of the pte.
> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> second part of the pte.
>
> The second part of the PTE will hold
> a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
> and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
> pte.
>
> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating a invalid slot. if a hashpage does
> get allocated to the 0xF slot, it is released and not used. In
> other words, even though 0xF is a valid slot we discard it and
> consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
> opportunity to not depend on a bit in the primary PTE in order to
> determine the validity of a slot.
>
> When we release a 0xF slot we also release a legitimate primary
> slot and unmap that entry. This is to ensure that we do get
> a legimate non-0xF slot the next time we retry for a slot.
>
> Though treating 0xF slot as invalid reduces the number of available
> slots and make have a effect on the performance, the probabilty
> of hitting a 0xF is extermely low.
>
> Compared to the current scheme, the above described scheme reduces
> the number of false hash table updates significantly and has the
> added advantage of releasing four valuable PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.
>

Can you also split this patch into two. One which changes
__hash_page_4k() ie, linux pte format w.r.t 4k hash pte. Second patch
with changes w.r.t __hash_page_64k() ie, pte format w.r.t 64k hash pte.

-aneesh

2017-06-13 21:51:26

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

On Tue, Jun 13, 2017 at 07:32:24AM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <[email protected]> writes:
>
> > On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
> >> Ram Pai <[email protected]> writes:
> >>
> >> > Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
> >> > memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
> >> > keys.
> >> >
> >> > The patch does the following change to the 64K PTE format
> >> >
> >> > H_PAGE_BUSY moves from bit 3 to bit 7
> >> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> >> > of the pte.
> >> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> >> > second part of the pte.
> >> >
> >> > The second part of the PTE will hold
> >> > a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
> >> > and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
> >> > pte.
> >> >
> >> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> >> > is initialized to 0xF indicating a invalid slot. if a hashpage does
> >> > get allocated to the 0xF slot, it is released and not used. In
> >> > other words, even though 0xF is a valid slot we discard it and
> >> > consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
> >> > opportunity to not depend on a bit in the primary PTE in order to
> >> > determine the validity of a slot.
> >>
> >> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
> >> slot is valid or not. For 4K hptes, we do need this right ? ie, only
> >> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot
> >
> > for 64k hptes; you are right, we do not use 0xF to
> > track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.
> >
> > for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
> > value of the slot. H_PAGE_HASHPTE tells if there exists any valid
> > 4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
> > us if they are valid values.
> >
> > However in either case we do not need H_PAGE_COMBO. That flag is not
> > used for ptes. But we continue to use that flag for pmd to track
> > hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
> > the 64K pagesize case.
>
>
> Really ? May be i am missing that in the patch. H_PAGE_COMBO indicate
> whether a 64K linux page is mapped via 4k hash page table enries. I
> don't see you changing that in __hash_page_4k()
>
> we still continue to do.
>
> new_pte = old_pte | H_PAGE_BUSY | _PAGE_ACCESSED | H_PAGE_COMBO;
>
> /*
> * Check if the pte was already inserted into the hash table
> * as a 64k HW page, and invalidate the 64k HPTE if so.
> */
> if (!(old_pte & H_PAGE_COMBO)) {
> flush_hash_page(vpn, rpte, MMU_PAGE_64K, ssize, flags);
>

ok. my memory blanked. In this patch We continue to depend on COMBO
flag to distinguish between pte's backed by 4k hpte and 64k hpte. So
H_PAGE_COMBO flag cannot go for now in this patch. Which means
_PAGE_HPTEFLAGS must continue to have the H_PAGE_COMBO flag too.

I had a patch to get rid of the COMBO bit too, which was dropped. Will
regenerate a separate patch to rid the COMBO bit in __hash_page_4k().
That will divorse the COMBO bit entirely from 64K pte.

>
>
> >
> >>
> >>
> >> >
> >> > When we release a 0xF slot we also release a legitimate primary
> >> > slot and unmap that entry. This is to ensure that we do get
> >> > a legimate non-0xF slot the next time we retry for a slot.
> >>
> >> Can you explain this more ? what is a primary slot here ?
> >
> > I may not be using the right terminology here. But when I say slot, i
> > mean the four bits that tell the position of the hpte in the hash
> > buckets. Bit 0, indicates if it the primary or secondary hash bucket.
> > and bit 1,2,3 indicates the entry in the hash bucket.
> >
> > So when i say primary slot, I mean a entry in the primary hash bucket.
> >
> > The idea is, when hpte_insert returns a hpte which is cached in the 7slot
> > of the secondary bucket i.e 0xF, we discard it, and also release a
> > random entry from the primary bucket, so that on retry we can get that
> > entry.
> >
> >
> >>
> >> >
> >> > Though treating 0xF slot as invalid reduces the number of available
> >> > slots and make have a effect on the performance, the probabilty
> >> > of hitting a 0xF is extermely low.
> >> >
> >> > Compared to the current scheme, the above described scheme reduces
> >> > the number of false hash table updates significantly and has the
> >> > added advantage of releasing four valuable PTE bits for other
> >> > purpose.
> >> >
> >> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> >> > Ellermen and myself.
> >> >
> >> > 4K PTE format remain unchanged currently.
> >> >
> >> > Signed-off-by: Ram Pai <[email protected]>
> >> > ---
> >> > arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 +++++
> >> > arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
> >> > arch/powerpc/include/asm/book3s/64/hash.h | 8 ++-
> >> > arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 ++
> >> > arch/powerpc/mm/dump_linuxpagetables.c | 3 +-
> >> > arch/powerpc/mm/hash64_4k.c | 12 ++---
> >> > arch/powerpc/mm/hash64_64k.c | 73 +++++++++------------------
> >> > arch/powerpc/mm/hash_utils_64.c | 38 +++++++++++++-
> >> > arch/powerpc/mm/hugetlbpage-hash64.c | 16 +++---
> >> > 9 files changed, 107 insertions(+), 98 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> > index b4b5e6b..223d318 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> > @@ -16,6 +16,18 @@
> >> > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >> > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >> >
> >> > +
> >> > +/*
> >> > + * Only supported by 4k linux page size
> >>
> >> that comment is confusing, you can drop that, it is part of hash-4k.h
> >> which means these defines are local to 4k linux page size config.
> >>
> >> > + */
> >
> > right. ok.
> >
> >> > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> >> > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >> > +#define H_PAGE_F_GIX_SHIFT 56
> >> > +
> >> > +#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> >> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >> > +
> >> > +
> >> > /* PTE flags to conserve for HPTE identification */
> >> > #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> >> > H_PAGE_F_SECOND | H_PAGE_F_GIX)
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >> > index 9732837..87ee22d 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >> > @@ -10,23 +10,21 @@
> >> > * 64k aligned address free up few of the lower bits of RPN for us
> >> > * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> >> > */
> >> > -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> >> > -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> >> > +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> >> > +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
> >> > +
> >> > +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are busy */
> >> > +#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >> > +
> >> > /*
> >> > * We need to differentiate between explicit huge page and THP huge
> >> > * page, since THP huge page also need to track real subpage details
> >> > */
> >> > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN
> >> >
> >> > -/*
> >> > - * Used to track subpage group valid if H_PAGE_COMBO is set
> >> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> >> > - */
> >> > -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
> >> > -
> >> > /* PTE flags to conserve for HPTE identification */
> >> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> >> > - H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> >> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
> >> > +
> >>
> >> You drop H_PAGE_COMBO here. Is that intentional ?
> >
> > Yes. that is intentional. See a downside? We dont depend on
> > H_PAGE_COMBO for 64K pagesize PTE anymore.
> >
> >>
> >> We have code which does
> >>
> >> if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> >> MMU_PAGE_64K, ssize,
> >> flags) == -1)
> >> old_pte &= ~_PAGE_HPTEFLAGS;
> >>
> >>
> >> I guess they expect slot details to be cleared. With the above
> >> _PAGE_HPTEFLAGS that is not true.
> >
> > since we dont depend on COMBO flag, it should be fine. Let me know if
> > you see a downside.
> >
> >>
> >>
> >> > /*
> >> > * we support 16 fragments per PTE page of 64K size.
> >> > */
> >> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
> >> > unsigned long *hidxp;
> >> >
> >> > rpte.pte = pte;
> >> > - rpte.hidx = 0;
> >> > - if (pte_val(pte) & H_PAGE_COMBO) {
> >> > - /*
> >> > - * Make sure we order the hidx load against the H_PAGE_COMBO
> >> > - * check. The store side ordering is done in __hash_page_4K
> >> > - */
> >> > - smp_rmb();
> >> > - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> >> > - rpte.hidx = *hidxp;
> >> > - }
> >> > + /*
> >> > + * The store side ordering is done in __hash_page_4K
> >> > + */
> >> > + smp_rmb();
> >> > + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> >> > + rpte.hidx = *hidxp;
> >> > return rpte;
> >> > }
> >> >
> >> > static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> >> > {
> >> > - if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> >> > - return (rpte.hidx >> (index<<2)) & 0xf;
> >> > - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> >> > + return ((rpte.hidx >> (index<<2)) & 0xfUL);
> >> > }
> >> >
> >> > #define __rpte_to_pte(r) ((r).pte)
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> >> > index 4e957b0..7742782 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> >> > @@ -8,11 +8,9 @@
> >> > *
> >> > */
> >> > #define H_PTE_NONE_MASK _PAGE_HPTEFLAGS
> >> > -#define H_PAGE_F_GIX_SHIFT 56
> >> > -#define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */
> >> > -#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> >> > -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >> > -#define H_PAGE_HASHPTE _RPAGE_RPN43 /* PTE has associated HPTE */
> >> > +
> >> > +#define INIT_HIDX (~0x0UL)
> >>
> >> But then you use it like
> >>
> >> rpte.hidx = INIT_HIDX;
> >>
> >> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
> >> there ?
> >
> > No. We are initializing all the entries in the second-part-of-the-pte to
> > 0xF. which essentially boils down to ~0x0UL.
>
> that you unsigned long casting is what i was suggesting to remove. We
> may want to treat it as a 4 bits every where ?
>
>
> >
> >>
> >> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
> >>
> >> Can you do that as a static inline ?
> >
> > ok.
> >
> >>
> >> >
> >> > #ifdef CONFIG_PPC_64K_PAGES
> >> > #include <asm/book3s/64/hash-64k.h>
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> > index 6981a52..cfb8169 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> >> > extern int __hash_page_64K(unsigned long ea, unsigned long access,
> >> > unsigned long vsid, pte_t *ptep, unsigned long trap,
> >> > unsigned long flags, int ssize);
> >> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> >> > + unsigned int subpg_index, unsigned long slot);
> >> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> >> > + int ssize, real_pte_t rpte, unsigned int subpg_index);
> >> > +
> >> > struct mm_struct;
> >> > unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> >> > extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> >> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> >> > index 44fe483..b832ed3 100644
> >> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
> >> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> >> > @@ -213,7 +213,7 @@ struct flag_info {
> >> > .val = H_PAGE_4K_PFN,
> >> > .set = "4K_pfn",
> >> > }, {
> >> > -#endif
> >> > +#else
> >> > .mask = H_PAGE_F_GIX,
> >> > .val = H_PAGE_F_GIX,
> >> > .set = "f_gix",
> >> > @@ -224,6 +224,7 @@ struct flag_info {
> >> > .val = H_PAGE_F_SECOND,
> >> > .set = "f_second",
> >> > }, {
> >> > +#endif /* CONFIG_PPC_64K_PAGES */
> >> > #endif
> >> > .mask = _PAGE_SPECIAL,
> >> > .val = _PAGE_SPECIAL,
> >> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> >> > index 6fa450c..5b16416 100644
> >> > --- a/arch/powerpc/mm/hash64_4k.c
> >> > +++ b/arch/powerpc/mm/hash64_4k.c
> >> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >> > pte_t *ptep, unsigned long trap, unsigned long flags,
> >> > int ssize, int subpg_prot)
> >> > {
> >> > + real_pte_t rpte;
> >> > unsigned long hpte_group;
> >> > unsigned long rflags, pa;
> >> > unsigned long old_pte, new_pte;
> >> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >> > * need to add in 0x1 if it's a read-only user page
> >> > */
> >> > rflags = htab_convert_pte_flags(new_pte);
> >> > + rpte = __real_pte(__pte(old_pte), ptep);
> >> >
> >> > if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> >> > !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> >> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >> > /*
> >> > * There MIGHT be an HPTE for this pte
> >> > */
> >> > - hash = hpt_hash(vpn, shift, ssize);
> >> > - if (old_pte & H_PAGE_F_SECOND)
> >> > - hash = ~hash;
> >> > - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> >> > - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> >> > -
> >> > + slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> >>
> >>
> >> This confused me, the rest of the code assume slot as the 4 bit value.
> >> But get_hidx_slot is not exactly returning that.
> >
> > get_hidx_slot() is returning a 4bit value. no?
>
> It is not
> unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> int ssize, real_pte_t rpte, unsigned int subpg_index)
> {
> unsigned long hash, slot, hidx;
>
> hash = hpt_hash(vpn, shift, ssize);
> hidx = __rpte_to_hidx(rpte, subpg_index);
> if (hidx & _PTEIDX_SECONDARY)
> hash = ~hash;
> slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> slot += hidx & _PTEIDX_GROUP_IX;
> return slot;
> }

Actually the piece of redundant code has been captured into a function
and called from multiple places. there is no logic changes. It just
re-arrangement of code. I think your confusion is derived from the
naming of the function. Will rename it to get_hidx_slot_offset(),
better?


Thanks for your comments, will be sending out a new v2 version soon.

RP
>
> -aneesh

2017-06-13 21:52:35

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys

On Tue, Jun 13, 2017 at 10:22:43AM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <[email protected]> writes:
>
> > Rearrange PTE bits to free up bits 3, 4, 5 and 6 for
> > memory keys. Bit 3, 4, 5, 6 and 57 shall be used for memory
> > keys.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 7
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > The second part of the PTE will hold
> > a (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 64K page backed pte,
> > and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX) for 4k backed
> > pte.
> >
> > the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating a invalid slot. if a hashpage does
> > get allocated to the 0xF slot, it is released and not used. In
> > other words, even though 0xF is a valid slot we discard it and
> > consider it as invalid slot(HPTE_SOFT_INVALID). This gives us an
> > opportunity to not depend on a bit in the primary PTE in order to
> > determine the validity of a slot.
> >
> > When we release a 0xF slot we also release a legitimate primary
> > slot and unmap that entry. This is to ensure that we do get
> > a legimate non-0xF slot the next time we retry for a slot.
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and make have a effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared to the current scheme, the above described scheme reduces
> > the number of false hash table updates significantly and has the
> > added advantage of releasing four valuable PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
>
> Can you also split this patch into two. One which changes
> __hash_page_4k() ie, linux pte format w.r.t 4k hash pte. Second patch
> with changes w.r.t __hash_page_64k() ie, pte format w.r.t 64k hash pte.

ok. A v2 version of the patch series will be out in a day or two.
RP

2017-06-16 09:20:30

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On 06/06/2017 06:35 AM, Ram Pai wrote:
> The value of the AMR register at the time of the exception
> is made available in gp_regs[PT_AMR] of the siginfo.

But its already available there in uctxt->uc_mcontext.regs->amr
while inside the signal delivery context in the user space. The
pt_regs already got updated with new AMR register. Then why we
need gp_regs to also contain AMR as well ?

2017-06-16 10:33:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On Fri, 2017-06-16 at 14:50 +0530, Anshuman Khandual wrote:
> On 06/06/2017 06:35 AM, Ram Pai wrote:
> > The value of the AMR register at the time of the exception
> > is made available in gp_regs[PT_AMR] of the siginfo.
>
> But its already available there in uctxt->uc_mcontext.regs->amr
> while inside the signal delivery context in the user space. The
> pt_regs already got updated with new AMR register. Then why we
> need gp_regs to also contain AMR as well ?

Also changing gp_regs layout/size is a major ABI issue...

Ben.

2017-06-16 11:18:37

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

Ram Pai <[email protected]> writes:
> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> index 8036b38..109d0c2 100644
> --- a/arch/powerpc/include/uapi/asm/ptrace.h
> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> @@ -49,6 +49,8 @@ struct pt_regs {
> unsigned long dar; /* Fault registers */
> unsigned long dsisr; /* on 4xx/Book-E used for ESR */
> unsigned long result; /* Result of a system call */
> + unsigned long dscr; /* contents of the DSCR register */
> + unsigned long amr; /* contents of AMR register */
> };

You can't change pt_regs, it's ABI.

> @@ -109,7 +111,8 @@ struct pt_regs {
> #define PT_DSISR 42
> #define PT_RESULT 43
> #define PT_DSCR 44
> -#define PT_REGS_COUNT 44
> +#define PT_AMR 45
> +#define PT_REGS_COUNT 45

You can add PT_AMR, but it has to be synthetic like DSCR, ie. not
actually in pt_regs but available via ptrace.

But do we want to do that? How does the x86 code export the key(s) of a
process? Or doesn't it?

cheers

2017-06-16 19:10:56

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On Fri, Jun 16, 2017 at 02:50:13PM +0530, Anshuman Khandual wrote:
> On 06/06/2017 06:35 AM, Ram Pai wrote:
> > The value of the AMR register at the time of the exception
> > is made available in gp_regs[PT_AMR] of the siginfo.
>
> But its already available there in uctxt->uc_mcontext.regs->amr
> while inside the signal delivery context in the user space. The
> pt_regs already got updated with new AMR register. Then why we
> need gp_regs to also contain AMR as well ?

It should not be available in uctxt->uc_mcontext.regs->amr.
In fact that field itself should not be there.

The ideas was to use one of the unused fields in gp_regs; without
extending gp_regs, to provide the contents of AMR. the
PT_AMR offset in gp_regs is currently not used, which I am using
in this patch.

However this patch needs to be modified not to extend pt_regs,
or uctxt->uc_mcontext.regs

Thanks for initiating this concern.
RP

--
Ram Pai

2017-06-16 19:15:52

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On Fri, Jun 16, 2017 at 08:33:01PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-06-16 at 14:50 +0530, Anshuman Khandual wrote:
> > On 06/06/2017 06:35 AM, Ram Pai wrote:
> > > The value of the AMR register at the time of the exception
> > > is made available in gp_regs[PT_AMR] of the siginfo.
> >
> > But its already available there in uctxt->uc_mcontext.regs->amr
> > while inside the signal delivery context in the user space. The
> > pt_regs already got updated with new AMR register. Then why we
> > need gp_regs to also contain AMR as well ?
>
> Also changing gp_regs layout/size is a major ABI issue...

Ben,

gp_regs size is not changed, nor is the layout. A unused field in
the gp_regs is used to fill in the AMR contents. Old binaries will not
be knowing about this unused field, and hence should not break.

New binaries can leverage this already existing but newly defined
field; to read the contents of AMR.

Is it still a concern?
RP

>
> Ben.

--
Ram Pai

2017-06-16 19:36:56

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On Fri, Jun 16, 2017 at 09:18:29PM +1000, Michael Ellerman wrote:
> Ram Pai <[email protected]> writes:
> > diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> > index 8036b38..109d0c2 100644
> > --- a/arch/powerpc/include/uapi/asm/ptrace.h
> > +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> > @@ -49,6 +49,8 @@ struct pt_regs {
> > unsigned long dar; /* Fault registers */
> > unsigned long dsisr; /* on 4xx/Book-E used for ESR */
> > unsigned long result; /* Result of a system call */
> > + unsigned long dscr; /* contents of the DSCR register */
> > + unsigned long amr; /* contents of AMR register */
> > };
>
> You can't change pt_regs, it's ABI.
>
> > @@ -109,7 +111,8 @@ struct pt_regs {
> > #define PT_DSISR 42
> > #define PT_RESULT 43
> > #define PT_DSCR 44
> > -#define PT_REGS_COUNT 44
> > +#define PT_AMR 45
> > +#define PT_REGS_COUNT 45
>
> You can add PT_AMR, but it has to be synthetic like DSCR, ie. not
> actually in pt_regs but available via ptrace.

ok.

>
> But do we want to do that? How does the x86 code export the key(s) of a
> process? Or doesn't it?

The semantics defined on x86 is,

signal handler has to have a way of knowing the contents of the
PKRU; (the x86 equivalent of AMR). Also the signal handler
has to have the ability to modify the PKRU before it returns
from the signal handler. This modified information will be
used by the kernel to program the CPU's PKRU register.

if the signal handler does not have the ability to do so, than
when the signal handler returns and the user code restarts executing
where it had left, it will continue to access the same protected
address and fault again, which will again invoke the signal handler
and this will continue infinitely.

We have to provide the same semantics on powerpc. The way I intend to
do it is to use one of the unused field in the gp_regs and fill that
with the contents of the AMR register. PT_AMR, at offset 45 in gp_regs
is not used currently. offset 45, 46, and 47 are available AFIACT.


Dave: Why is it not ok to reprogram the PKRU from the signal handler,
instead of telling the kernel to do so on its behalf? Or
have I got my understanding of the semantics wrong?


>
> cheers

--
Ram Pai

2017-06-16 22:55:30

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On Fri, 2017-06-16 at 12:15 -0700, Ram Pai wrote:
> gp_regs size is not changed, nor is the layout. A unused field in
> the gp_regs is used to fill in the AMR contents. Old binaries will not
> be knowing about this unused field, and hence should not break.
>
> New binaries can leverage this already existing but newly defined
> field; to read the contents of AMR.
>
> Is it still a concern?

Calls to sys_swapcontext with a made-up context will end up with a crap
AMR if done by code who didn't know about that register.

Ben.

2017-06-20 07:07:55

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys

Hi!

> Memory protection keys enable applications to protect its
> address space from inadvertent access or corruption from
> itself.
>
> The overall idea:
>
> A process allocates a key and associates it with
> a address range within its address space.
> The process than can dynamically set read/write
> permissions on the key without involving the
> kernel. Any code that violates the permissions
> off the address space; as defined by its associated
> key, will receive a segmentation fault.

Do you have some documentation how userspace should use this? Will it
be possible to hide details in libc so that it works across
architectures? Do you have some kind of library that hides them?

Where would you like it to be used? Web browsers?

How does it interact with ptrace()? With /dev/mem? With /proc/XXX/mem?
Will it enable malware to become very hard to understand?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.04 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-06-22 21:41:46

by Ram Pai

[permalink] [raw]
Subject: Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.

On Sat, Jun 17, 2017 at 08:54:44AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-06-16 at 12:15 -0700, Ram Pai wrote:
> > gp_regs size is not changed, nor is the layout. A unused field in
> > the gp_regs is used to fill in the AMR contents. Old binaries will not
> > be knowing about this unused field, and hence should not break.
> >
> > New binaries can leverage this already existing but newly defined
> > field; to read the contents of AMR.
> >
> > Is it still a concern?
>
> Calls to sys_swapcontext with a made-up context will end up with a crap
> AMR if done by code who didn't know about that register.

Turns out x86 does not have this problem, because x86 does not implement
sys_swapcontext. However; unlike x86, powerpc lets signal handler
program the AMR(x86 PKRU equivalent), which can persist even after the
signal handler returns to the kernel through sys_sigreturn.

So I am inclined to deviate from the x86 protection-key semantics.

On x86 the persistent way for the signal handler
to change the key register(PKRU) is through a field in the siginfo structure.

And on powerpc the persistent way for the signal handler to change the
key register(AMR) will be to directly program the AMR register.

This should resolve your concern on powerpc, since there is no way
a crap AMR value will change the real AMR register, because the powerpc
kernel will not be letting it happen. Acceptable?

RP