Multikey Total Memory Encryption (MKTME)[1] is a technology that allows
transparent memory encryption in upcoming Intel platforms. See overview
below.
Here's updated version of my patchset that brings support of MKTME.
Please review and consider applying.
The patchset provides in-kernel infrastructure for MKTME, but doesn't yet
have userspace interface.
First 4 patches are for core-mm. The rest is x86-specific.
The patchset is on top of page_ext cleanups I've posted earlier[2].
v3:
- Kernel now can access encrypted pages via per-KeyID direct mapping.
- Rework page allocation for encrypted memory to minimize overhead on
non-encrypted pages. It comes with cost for allocation of encrypted
pages: we have to flush cache on every time we allocate *and* free
encrypted page. We will need to optimize it later.
v2:
- Store KeyID of page in page_ext->flags rather than in anon_vma.
anon_vma approach turned out to be problematic. The main problem is
that anon_vma of the page is no longer stable after last mapcount has
gone. We would like to preserve last used KeyID even for freed
pages as it allows to avoid unneccessary cache flushing on allocation
of an encrypted page. page_ext serves this well enough.
- KeyID is now propagated through page allocator. No need in GFP_ENCRYPT
anymore.
- Patch "Decouple dynamic __PHYSICAL_MASK from AMD SME" has been fix to
work with AMD SEV (need to be confirmed by AMD folks).
------------------------------------------------------------------------------
MKTME is built on top of TME. TME allows encryption of the entirety of
system memory using a single key. MKTME allows to have multiple encryption
domains, each having own key -- different memory pages can be encrypted
with different keys.
Key design points of Intel MKTME:
- Initial HW implementation would support upto 63 keys (plus one default
TME key). But the number of keys may be as low as 3, depending to SKU
and BIOS settings
- To access encrypted memory you need to use mapping with proper KeyID
int the page table entry. KeyID is encoded in upper bits of PFN in page
table entry.
- CPU does not enforce coherency between mappings of the same physical
page with different KeyIDs or encryption keys. We wound need to take
care about flushing cache on allocation of encrypted page and on
returning it back to free pool.
- For managing keys, there's MKTME_KEY_PROGRAM leaf of the new PCONFIG
(platform configuration) instruction. It allows load and clear keys
associated with a KeyID. You can also ask CPU to generate a key for
you or disable memory encryption when a KeyID is used.
[1] https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf
[2] https://lkml.kernel.org/r/[email protected]
Kirill A. Shutemov (17):
mm: Do no merge VMAs with different encryption KeyIDs
mm/khugepaged: Do not collapse pages in encrypted VMAs
mm/ksm: Do not merge pages with different KeyIDs
mm/page_alloc: Handle allocation for encrypted memory
x86/mm: Mask out KeyID bits from page table entry pfn
x86/mm: Introduce variables to store number, shift and mask of KeyIDs
x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()
x86/mm: Implement vma_is_encrypted() and vma_keyid()
x86/mm: Implement page_keyid() using page_ext
x86/mm: Implement prep_encrypted_page() and arch_free_page()
x86/mm: Rename CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING
x86/mm: Allow to disable MKTME after enumeration
x86/mm: Detect MKTME early
x86/mm: Introduce direct_mapping_size
x86/mm: Implement sync_direct_mapping()
x86/mm: Handle encrypted memory in page_to_virt() and __pa()
x86: Introduce CONFIG_X86_INTEL_MKTME
arch/alpha/include/asm/page.h | 2 +-
arch/x86/Kconfig | 21 +-
arch/x86/include/asm/mktme.h | 60 +++
arch/x86/include/asm/page.h | 1 +
arch/x86/include/asm/page_64.h | 3 +-
arch/x86/include/asm/pgtable_types.h | 7 +-
arch/x86/kernel/cpu/intel.c | 40 +-
arch/x86/kernel/head64.c | 2 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/init_64.c | 6 +
arch/x86/mm/kaslr.c | 21 +-
arch/x86/mm/mktme.c | 583 +++++++++++++++++++++++++++
include/linux/gfp.h | 38 +-
include/linux/migrate.h | 8 +-
include/linux/mm.h | 21 +
include/linux/page_ext.h | 11 +-
mm/compaction.c | 4 +
mm/khugepaged.c | 2 +
mm/ksm.c | 3 +
mm/mempolicy.c | 25 +-
mm/migrate.c | 4 +-
mm/mmap.c | 3 +-
mm/page_alloc.c | 63 +++
mm/page_ext.c | 3 +
24 files changed, 893 insertions(+), 40 deletions(-)
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mktme.c
--
2.17.1
Add new config option to enabled/disable Multi-Key Total Memory
Encryption support.
MKTME uses MEMORY_PHYSICAL_PADDING to reserve enough space in per-KeyID
direct mappings for memory hotplug.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4fa2cf807321..d013495bb4ae 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1513,6 +1513,23 @@ config ARCH_USE_MEMREMAP_PROT
def_bool y
depends on AMD_MEM_ENCRYPT
+config X86_INTEL_MKTME
+ bool "Intel Multi-Key Total Memory Encryption"
+ select DYNAMIC_PHYSICAL_MASK
+ select PAGE_EXTENSION
+ depends on X86_64 && CPU_SUP_INTEL
+ ---help---
+ Say yes to enable support for Multi-Key Total Memory Encryption.
+ This requires Intel processor that has support of the feature.
+
+ Multikey Total Memory Encryption (MKTME) is a technology that allows
+ transparent memory encryption in upcoming Intel platforms.
+
+ MKTME is built on top of TME. TME allows encryption of the entirety
+ of system memory using a single key. MKTME allows to have multiple
+ encryption domains, each having own key -- different memory pages can
+ be encrypted with different keys.
+
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
@@ -2189,7 +2206,7 @@ config RANDOMIZE_MEMORY
config MEMORY_PHYSICAL_PADDING
hex "Physical memory mapping padding" if EXPERT
- depends on RANDOMIZE_MEMORY
+ depends on RANDOMIZE_MEMORY || X86_INTEL_MKTME
default "0xa" if MEMORY_HOTPLUG
default "0x0"
range 0x1 0x40 if MEMORY_HOTPLUG
--
2.17.1
Per-KeyID direct mappings require changes into how we find the right
virtual address for a page and virt-to-phys address translations.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 3 +++
arch/x86/include/asm/page_64.h | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index efc0d4bb3b35..d6edcabacfc7 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -43,6 +43,9 @@ void mktme_disable(void);
void setup_direct_mapping_size(void);
int sync_direct_mapping(void);
+#define page_to_virt(x) \
+ (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 53c32af895ab..ffad496aadad 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -23,7 +23,7 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
/* use the carry flag to determine if x was < __START_KERNEL_map */
x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
- return x;
+ return x % direct_mapping_size;
}
#ifdef CONFIG_DEBUG_VIRTUAL
--
2.17.1
For encrypted memory, we need to allocated pages for a specific
encryption KeyID.
There are two cases when we need to allocate a page for encryption:
- Allocation for an encrypted VMA;
- Allocation for migration of encrypted page;
The first case can be covered within alloc_page_vma().
The second case requires few new page allocation routines that would
allocate the page for a specific KeyID.
Encrypted page has to be cleared after KeyID set. This is handled by
prep_encrypted_page() that will be provided by arch-specific code.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/alpha/include/asm/page.h | 2 +-
include/linux/gfp.h | 38 ++++++++++++++++-----
include/linux/migrate.h | 8 +++--
mm/compaction.c | 4 +++
mm/mempolicy.c | 25 ++++++++++----
mm/migrate.c | 4 +--
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++++++++
7 files changed, 122 insertions(+), 22 deletions(-)
diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index f3fb2848470a..9a6fbb5269f3 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -18,7 +18,7 @@ extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
- alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vmaddr)
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
extern void copy_page(void * _to, void * _from);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fc5ab85278d5..59d607d135e9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -446,16 +446,30 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
+#ifndef prep_encrypted_page
+static inline void prep_encrypted_page(struct page *page, int order,
+ int keyid, bool zero)
+{
+}
+#endif
+
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+struct page *
+__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
+ int preferred_nid, nodemask_t *nodemask, int keyid);
+
static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
{
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
}
+struct page *__alloc_pages_node_keyid(int nid, int keyid,
+ gfp_t gfp_mask, unsigned int order);
+
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
@@ -483,6 +497,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages_node(nid, gfp_mask, order);
}
+static inline struct page *alloc_pages_node_keyid(int nid, int keyid,
+ gfp_t gfp_mask, unsigned int order)
+{
+ if (nid == NUMA_NO_NODE)
+ nid = numa_mem_id();
+
+ return __alloc_pages_node_keyid(nid, keyid, gfp_mask, order);
+}
+
+extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ int node, bool hugepage);
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
@@ -491,24 +518,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_current(gfp_mask, order);
}
-extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
- struct vm_area_struct *vma, unsigned long addr,
- int node, bool hugepage);
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
- alloc_pages(gfp_mask, order)
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+ alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..6da504bad841 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -38,9 +38,11 @@ static inline struct page *new_page_nodemask(struct page *page,
unsigned int order = 0;
struct page *new_page = NULL;
- if (PageHuge(page))
+ if (PageHuge(page)) {
+ WARN_ON(page_keyid(page));
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);
+ }
if (PageTransHuge(page)) {
gfp_mask |= GFP_TRANSHUGE;
@@ -50,8 +52,8 @@ static inline struct page *new_page_nodemask(struct page *page,
if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
gfp_mask |= __GFP_HIGHMEM;
- new_page = __alloc_pages_nodemask(gfp_mask, order,
- preferred_nid, nodemask);
+ new_page = __alloc_pages_nodemask_keyid(gfp_mask, order,
+ preferred_nid, nodemask, page_keyid(page));
if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
diff --git a/mm/compaction.c b/mm/compaction.c
index 29bd1df18b98..55261e634c34 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1170,6 +1170,7 @@ static struct page *compaction_alloc(struct page *migratepage,
{
struct compact_control *cc = (struct compact_control *)data;
struct page *freepage;
+ int keyid;
/*
* Isolate free pages if necessary, and if we are not aborting due to
@@ -1187,6 +1188,9 @@ static struct page *compaction_alloc(struct page *migratepage,
list_del(&freepage->lru);
cc->nr_freepages--;
+ keyid = page_keyid(migratepage);
+ if (keyid)
+ prep_encrypted_page(freepage, 0, keyid, false);
return freepage;
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9ac49ef17b4e..00bccbececea 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -920,22 +920,24 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
/* page allocation callback for NUMA node migration */
struct page *alloc_new_node_page(struct page *page, unsigned long node)
{
- if (PageHuge(page))
+ if (PageHuge(page)) {
+ WARN_ON(page_keyid(page));
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else if (PageTransHuge(page)) {
+ } else if (PageTransHuge(page)) {
struct page *thp;
- thp = alloc_pages_node(node,
+ thp = alloc_pages_node_keyid(node, page_keyid(page),
(GFP_TRANSHUGE | __GFP_THISNODE),
HPAGE_PMD_ORDER);
if (!thp)
return NULL;
prep_transhuge_page(thp);
return thp;
- } else
- return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
- __GFP_THISNODE, 0);
+ } else {
+ return __alloc_pages_node_keyid(node, page_keyid(page),
+ GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
+ }
}
/*
@@ -2012,9 +2014,16 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
{
struct mempolicy *pol;
struct page *page;
- int preferred_nid;
+ bool zero = false;
+ int keyid, preferred_nid;
nodemask_t *nmask;
+ keyid = vma_keyid(vma);
+ if (keyid && gfp & __GFP_ZERO) {
+ zero = true;
+ gfp &= ~__GFP_ZERO;
+ }
+
pol = get_vma_policy(vma, addr);
if (pol->mode == MPOL_INTERLEAVE) {
@@ -2057,6 +2066,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
mpol_cond_put(pol);
out:
+ if (page && keyid)
+ prep_encrypted_page(page, order, keyid, zero);
return page;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..eb8dea219dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1847,7 +1847,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
int nid = (int) data;
struct page *newpage;
- newpage = __alloc_pages_node(nid,
+ newpage = __alloc_pages_node_keyid(nid, page_keyid(page),
(GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE | __GFP_NOMEMALLOC |
__GFP_NORETRY | __GFP_NOWARN) &
@@ -2030,7 +2030,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
goto out_dropref;
- new_page = alloc_pages_node(node,
+ new_page = alloc_pages_node_keyid(node, page_keyid(page),
(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
HPAGE_PMD_ORDER);
if (!new_page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27489..472286b0553f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3700,6 +3700,49 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
}
#endif /* CONFIG_COMPACTION */
+#ifndef CONFIG_NUMA
+struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ int node, bool hugepage)
+{
+ struct page *page;
+ bool zero = false;
+ int keyid = vma_keyid(vma);
+
+ if (keyid && (gfp_mask & __GFP_ZERO)) {
+ zero = true;
+ gfp_mask &= ~__GFP_ZERO;
+ }
+
+ page = alloc_pages(gfp_mask, order);
+ if (page && keyid)
+ prep_encrypted_page(page, order, keyid, zero);
+
+ return page;
+}
+#endif
+
+struct page * __alloc_pages_node_keyid(int nid, int keyid,
+ gfp_t gfp_mask, unsigned int order)
+{
+ struct page *page;
+ bool zero = false;
+
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+ VM_WARN_ON(!node_online(nid));
+
+ if (keyid && (gfp_mask & __GFP_ZERO)) {
+ zero = true;
+ gfp_mask &= ~__GFP_ZERO;
+ }
+
+ page = __alloc_pages(gfp_mask, order, nid);
+ if (page && keyid)
+ prep_encrypted_page(page, order, keyid, zero);
+
+ return page;
+}
+
#ifdef CONFIG_LOCKDEP
struct lockdep_map __fs_reclaim_map =
STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
@@ -4396,6 +4439,26 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
+struct page *
+__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
+ int preferred_nid, nodemask_t *nodemask, int keyid)
+{
+ struct page *page;
+ bool zero = false;
+
+ if (keyid && (gfp_mask & __GFP_ZERO)) {
+ zero = true;
+ gfp_mask &= ~__GFP_ZERO;
+ }
+
+ page = __alloc_pages_nodemask(gfp_mask, order,
+ preferred_nid, nodemask);
+ if (page && keyid)
+ prep_encrypted_page(page, order, keyid, zero);
+ return page;
+}
+EXPORT_SYMBOL(__alloc_pages_nodemask_keyid);
+
/*
* Common helper functions.
*/
--
2.17.1
Kernel need to have a way to access encrypted memory. We are going to
use per-KeyID direct mapping to facilitate the access with minimal
overhead.
Direct mapping for each KeyID will be put next to each other in the
virtual address space. We need to have a way to find boundaries of
direct mapping for particular KeyID.
The new variable direct_mapping_size specifies the size of direct
mapping. With the value, it's trivial to find direct mapping for
KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
Size of direct mapping is calculated during KASLR setup. If KALSR is
disable it happens during MKTME initialization.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 2 ++
arch/x86/include/asm/page_64.h | 1 +
arch/x86/kernel/head64.c | 2 ++
arch/x86/mm/kaslr.c | 21 ++++++++++++---
arch/x86/mm/mktme.c | 48 ++++++++++++++++++++++++++++++++++
5 files changed, 71 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 9363b989a021..3bf481fe3f56 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -40,6 +40,8 @@ int page_keyid(const struct page *page);
void mktme_disable(void);
+void setup_direct_mapping_size(void);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..53c32af895ab 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -14,6 +14,7 @@ extern unsigned long phys_base;
extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;
+extern unsigned long direct_mapping_size;
static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index a21d6ace648e..b6175376b2e1 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -59,6 +59,8 @@ EXPORT_SYMBOL(vmalloc_base);
unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
#endif
+unsigned long direct_mapping_size __ro_after_init = -1UL;
+EXPORT_SYMBOL(direct_mapping_size);
#define __head __section(.head.text)
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 4408cd9a3bef..3d8ef8cb97e1 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
}
+#ifndef CONFIG_X86_INTEL_MKTME
+static void __init setup_direct_mapping_size(void)
+{
+ direct_mapping_size = max_pfn << PAGE_SHIFT;
+ direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
+ direct_mapping_size += (1UL << TB_SHIFT) * CONFIG_MEMORY_PHYSICAL_PADDING;
+}
+#endif
+
/* Initialize base and padding for each memory region randomized with KASLR */
void __init kernel_randomize_memory(void)
{
@@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
if (!kaslr_memory_enabled())
return;
- kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
+ /*
+ * Upper limit for direct mapping size is 1/4 of whole virtual
+ * address space
+ */
+ kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);
kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
/*
@@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
* add padding if needed (especially for memory hotplug support).
*/
BUG_ON(kaslr_regions[0].base != &page_offset_base);
- memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
- CONFIG_MEMORY_PHYSICAL_PADDING;
+
+ setup_direct_mapping_size();
+
+ memory_tb = direct_mapping_size * mktme_nr_keyids + 1;
/* Adapt phyiscal memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 43a44f0f2a2d..3e5322bf035e 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -89,3 +89,51 @@ static bool need_page_mktme(void)
struct page_ext_operations page_mktme_ops = {
.need = need_page_mktme,
};
+
+void __init setup_direct_mapping_size(void)
+{
+ unsigned long available_va;
+
+ /* 1/4 of virtual address space is didicated for direct mapping */
+ available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
+
+ /* How much memory the systrem has? */
+ direct_mapping_size = max_pfn << PAGE_SHIFT;
+ direct_mapping_size = round_up(direct_mapping_size, 1UL << 40);
+
+ if (mktme_status != MKTME_ENUMERATED)
+ goto out;
+
+ /*
+ * Not enough virtual address space to address all physical memory with
+ * MKTME enabled. Even without padding.
+ *
+ * Disable MKTME instead.
+ */
+ if (direct_mapping_size > available_va / mktme_nr_keyids + 1) {
+ pr_err("x86/mktme: Disabled. Not enough virtual address space\n");
+ pr_err("x86/mktme: Consider switching to 5-level paging\n");
+ mktme_disable();
+ goto out;
+ }
+
+ /*
+ * Virtual address space is divided between per-KeyID direct mappings.
+ */
+ available_va /= mktme_nr_keyids + 1;
+out:
+ /* Add padding, if there's enough virtual address space */
+ direct_mapping_size += (1UL << 40) * CONFIG_MEMORY_PHYSICAL_PADDING;
+ if (direct_mapping_size > available_va)
+ direct_mapping_size = available_va;
+}
+
+static int __init mktme_init(void)
+{
+ /* KASLR didn't initialized it for us. */
+ if (direct_mapping_size == -1UL)
+ setup_direct_mapping_size();
+
+ return 0;
+}
+arch_initcall(mktme_init)
--
2.17.1
For MKTME we use per-KeyID direct mappings. This allows kernel to have
access to encrypted memory.
sync_direct_mapping() sync per-KeyID direct mappings with a canonical
one -- KeyID-0.
The function tracks changes in the canonical mapping:
- creating or removing chunks of the translation tree;
- changes in mapping flags (i.e. protection bits);
- splitting huge page mapping into a page table;
- replacing page table with a huge page mapping;
The function need to be called on every change to the direct mapping:
hotplug, hotremove, changes in permissions bits, etc.
The function is nop until MKTME is enabled.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 6 +
arch/x86/mm/init_64.c | 6 +
arch/x86/mm/mktme.c | 444 +++++++++++++++++++++++++++++++++++
3 files changed, 456 insertions(+)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 3bf481fe3f56..efc0d4bb3b35 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -41,11 +41,17 @@ int page_keyid(const struct page *page);
void mktme_disable(void);
void setup_direct_mapping_size(void);
+int sync_direct_mapping(void);
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
#define mktme_keyid_shift 0
+
+static inline int sync_direct_mapping(void)
+{
+ return 0;
+}
#endif
#endif
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 17383f9677fa..032b9a1ba8e1 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -731,6 +731,8 @@ kernel_physical_mapping_init(unsigned long paddr_start,
pgd_changed = true;
}
+ sync_direct_mapping();
+
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
@@ -1142,10 +1144,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
static void __meminit
kernel_physical_mapping_remove(unsigned long start, unsigned long end)
{
+ int ret;
start = (unsigned long)__va(start);
end = (unsigned long)__va(end);
remove_pagetable(start, end, true, NULL);
+ ret = sync_direct_mapping();
+ WARN_ON(ret);
}
int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
@@ -1290,6 +1295,7 @@ void mark_rodata_ro(void)
(unsigned long) __va(__pa_symbol(rodata_end)),
(unsigned long) __va(__pa_symbol(_sdata)));
+ sync_direct_mapping();
debug_checkwx();
/*
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 3e5322bf035e..6dd7d0c090e8 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,6 +1,8 @@
#include <linux/mm.h>
#include <linux/highmem.h>
#include <asm/mktme.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
@@ -90,6 +92,440 @@ struct page_ext_operations page_mktme_ops = {
.need = need_page_mktme,
};
+static int sync_direct_mapping_pte(unsigned long keyid,
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ unsigned long addr, unsigned long end)
+{
+ pte_t *src_pte, *dst_pte;
+ pte_t *new_pte = NULL;
+ bool remove_pte;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_pte = !src_pmd && PAGE_ALIGNED(addr) && PAGE_ALIGNED(end);
+
+ /*
+ * PMD page got split into page table.
+ * Clear PMD mapping. Page table will be established instead.
+ */
+ if (pmd_large(*dst_pmd)) {
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(dst_pmd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Allocate a new page table if needed. */
+ if (pmd_none(*dst_pmd)) {
+ new_pte = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_pte)
+ return -ENOMEM;
+ dst_pte = new_pte + pte_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_pte = pte_offset_map(dst_pmd, addr + keyid * direct_mapping_size);
+ }
+ src_pte = src_pmd ? pte_offset_map(src_pmd, addr) : NULL;
+
+ spin_lock(&init_mm.page_table_lock);
+
+ do {
+ pteval_t val;
+
+ if (!src_pte || pte_none(*src_pte)) {
+ set_pte(dst_pte, __pte(0));
+ goto next;
+ }
+
+ if (!pte_none(*dst_pte)) {
+ /*
+ * Sanity check: PFNs must match between source
+ * and destination even if the rest doesn't.
+ */
+ BUG_ON(pte_pfn(*dst_pte) != pte_pfn(*src_pte));
+ }
+
+ /* Copy entry, but set KeyID. */
+ val = pte_val(*src_pte) | keyid << mktme_keyid_shift;
+ set_pte(dst_pte, __pte(val));
+next:
+ addr += PAGE_SIZE;
+ dst_pte++;
+ if (src_pte)
+ src_pte++;
+ } while (addr != end);
+
+ if (new_pte)
+ pmd_populate_kernel(&init_mm, dst_pmd, new_pte);
+
+ if (remove_pte) {
+ __free_page(pmd_page(*dst_pmd));
+ pmd_clear(dst_pmd);
+ }
+
+ spin_unlock(&init_mm.page_table_lock);
+
+ return 0;
+}
+
+static int sync_direct_mapping_pmd(unsigned long keyid,
+ pud_t *dst_pud, pud_t *src_pud,
+ unsigned long addr, unsigned long end)
+{
+ pmd_t *src_pmd, *dst_pmd;
+ pmd_t *new_pmd = NULL;
+ bool remove_pmd = false;
+ unsigned long next;
+ int ret;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_pmd = !src_pud && IS_ALIGNED(addr, PUD_SIZE) && IS_ALIGNED(end, PUD_SIZE);
+
+ /*
+ * PUD page got split into page table.
+ * Clear PUD mapping. Page table will be established instead.
+ */
+ if (pud_large(*dst_pud)) {
+ spin_lock(&init_mm.page_table_lock);
+ pud_clear(dst_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Allocate a new page table if needed. */
+ if (pud_none(*dst_pud)) {
+ new_pmd = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_pmd)
+ return -ENOMEM;
+ dst_pmd = new_pmd + pmd_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_pmd = pmd_offset(dst_pud, addr + keyid * direct_mapping_size);
+ }
+ src_pmd = src_pud ? pmd_offset(src_pud, addr) : NULL;
+
+ do {
+ pmd_t *__src_pmd = src_pmd;
+
+ next = pmd_addr_end(addr, end);
+ if (!__src_pmd || pmd_none(*__src_pmd)) {
+ if (pmd_none(*dst_pmd))
+ goto next;
+ if (pmd_large(*dst_pmd)) {
+ spin_lock(&init_mm.page_table_lock);
+ set_pmd(dst_pmd, __pmd(0));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+ __src_pmd = NULL;
+ }
+
+ if (__src_pmd && pmd_large(*__src_pmd)) {
+ pmdval_t val;
+
+ if (pmd_large(*dst_pmd)) {
+ /*
+ * Sanity check: PFNs must match between source
+ * and destination even if the rest doesn't.
+ */
+ BUG_ON(pmd_pfn(*dst_pmd) != pmd_pfn(*__src_pmd));
+ } else if (!pmd_none(*dst_pmd)) {
+ /*
+ * Page table is replaced with a PMD page.
+ * Free and unmap the page table.
+ */
+ __free_page(pmd_page(*dst_pmd));
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(dst_pmd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Copy entry, but set KeyID. */
+ val = pmd_val(*__src_pmd) | keyid << mktme_keyid_shift;
+ spin_lock(&init_mm.page_table_lock);
+ set_pmd(dst_pmd, __pmd(val));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+
+ ret = sync_direct_mapping_pte(keyid, dst_pmd, __src_pmd,
+ addr, next);
+next:
+ addr = next;
+ dst_pmd++;
+ if (src_pmd)
+ src_pmd++;
+ } while (addr != end && !ret);
+
+ if (new_pmd) {
+ spin_lock(&init_mm.page_table_lock);
+ pud_populate(&init_mm, dst_pud, new_pmd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ if (remove_pmd) {
+ spin_lock(&init_mm.page_table_lock);
+ __free_page(pud_page(*dst_pud));
+ pud_clear(dst_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ return ret;
+}
+
+static int sync_direct_mapping_pud(unsigned long keyid,
+ p4d_t *dst_p4d, p4d_t *src_p4d,
+ unsigned long addr, unsigned long end)
+{
+ pud_t *src_pud, *dst_pud;
+ pud_t *new_pud = NULL;
+ bool remove_pud = false;
+ unsigned long next;
+ int ret;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_pud = !src_p4d && IS_ALIGNED(addr, P4D_SIZE) && IS_ALIGNED(end, P4D_SIZE);
+
+ /*
+ * P4D page got split into page table.
+ * Clear P4D mapping. Page table will be established instead.
+ */
+ if (p4d_large(*dst_p4d)) {
+ spin_lock(&init_mm.page_table_lock);
+ p4d_clear(dst_p4d);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Allocate a new page table if needed. */
+ if (p4d_none(*dst_p4d)) {
+ new_pud = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_pud)
+ return -ENOMEM;
+ dst_pud = new_pud + pud_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_pud = pud_offset(dst_p4d, addr + keyid * direct_mapping_size);
+ }
+ src_pud = src_p4d ? pud_offset(src_p4d, addr) : NULL;
+
+ do {
+ pud_t *__src_pud = src_pud;
+
+ next = pud_addr_end(addr, end);
+ if (!__src_pud || pud_none(*__src_pud)) {
+ if (pud_none(*dst_pud))
+ goto next;
+ if (pud_large(*dst_pud)) {
+ spin_lock(&init_mm.page_table_lock);
+ set_pud(dst_pud, __pud(0));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+ __src_pud = NULL;
+ }
+
+ if (__src_pud && pud_large(*__src_pud)) {
+ pudval_t val;
+
+ if (pud_large(*dst_pud)) {
+ /*
+ * Sanity check: PFNs must match between source
+ * and destination even if the rest doesn't.
+ */
+ BUG_ON(pud_pfn(*dst_pud) != pud_pfn(*__src_pud));
+ } else if (!pud_none(*dst_pud)) {
+ /*
+ * Page table is replaced with a pud page.
+ * Free and unmap the page table.
+ */
+ __free_page(pud_page(*dst_pud));
+ spin_lock(&init_mm.page_table_lock);
+ pud_clear(dst_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Copy entry, but set KeyID. */
+ val = pud_val(*__src_pud) | keyid << mktme_keyid_shift;
+ spin_lock(&init_mm.page_table_lock);
+ set_pud(dst_pud, __pud(val));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+
+ ret = sync_direct_mapping_pmd(keyid, dst_pud, __src_pud,
+ addr, next);
+next:
+ addr = next;
+ dst_pud++;
+ if (src_pud)
+ src_pud++;
+ } while (addr != end && !ret);
+
+ if (new_pud) {
+ spin_lock(&init_mm.page_table_lock);
+ p4d_populate(&init_mm, dst_p4d, new_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ if (remove_pud) {
+ spin_lock(&init_mm.page_table_lock);
+ __free_page(p4d_page(*dst_p4d));
+ p4d_clear(dst_p4d);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ return ret;
+}
+
+static int sync_direct_mapping_p4d(unsigned long keyid,
+ pgd_t *dst_pgd, pgd_t *src_pgd,
+ unsigned long addr, unsigned long end)
+{
+ p4d_t *src_p4d, *dst_p4d;
+ p4d_t *new_p4d_1 = NULL, *new_p4d_2 = NULL;
+ bool remove_p4d = false;
+ unsigned long next;
+ int ret;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_p4d = !src_pgd && IS_ALIGNED(addr, PGDIR_SIZE) && IS_ALIGNED(end, PGDIR_SIZE);
+
+ /* Allocate a new page table if needed. */
+ if (pgd_none(*dst_pgd)) {
+ new_p4d_1 = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_p4d_1)
+ return -ENOMEM;
+ dst_p4d = new_p4d_1 + p4d_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_p4d = p4d_offset(dst_pgd, addr + keyid * direct_mapping_size);
+ }
+ src_p4d = src_pgd ? p4d_offset(src_pgd, addr) : NULL;
+
+ do {
+ p4d_t *__src_p4d = src_p4d;
+
+ next = p4d_addr_end(addr, end);
+ if (!__src_p4d || p4d_none(*__src_p4d)) {
+ if (p4d_none(*dst_p4d))
+ goto next;
+ __src_p4d = NULL;
+ }
+
+ ret = sync_direct_mapping_pud(keyid, dst_p4d, __src_p4d,
+ addr, next);
+next:
+ addr = next;
+ dst_p4d++;
+
+ /*
+ * Direct mappings are 1TiB-aligned. With 5-level paging it
+ * means that on PGD level there can be misalignment between
+ * source and distiantion.
+ *
+ * Allocate the new page table if dst_p4d crosses page table
+ * boundary.
+ */
+ if (!((unsigned long)dst_p4d & ~PAGE_MASK) && addr != end) {
+ if (pgd_none(dst_pgd[1])) {
+ new_p4d_2 = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_p4d_2)
+ ret = -ENOMEM;
+ dst_p4d = new_p4d_2;
+ } else {
+ dst_p4d = p4d_offset(dst_pgd + 1, 0);
+ }
+ }
+ if (src_p4d)
+ src_p4d++;
+ } while (addr != end && !ret);
+
+ if (new_p4d_1 || new_p4d_2) {
+ spin_lock(&init_mm.page_table_lock);
+ if (new_p4d_1)
+ pgd_populate(&init_mm, dst_pgd, new_p4d_1);
+ if (new_p4d_2)
+ pgd_populate(&init_mm, dst_pgd + 1, new_p4d_2);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ if (remove_p4d) {
+ spin_lock(&init_mm.page_table_lock);
+ __free_page(pgd_page(*dst_pgd));
+ pgd_clear(dst_pgd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ return ret;
+}
+
+static int sync_direct_mapping_keyid(unsigned long keyid)
+{
+ pgd_t *src_pgd, *dst_pgd;
+ unsigned long addr, end, next;
+ int ret;
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + direct_mapping_size;
+
+ dst_pgd = pgd_offset_k(addr + keyid * direct_mapping_size);
+ src_pgd = pgd_offset_k(addr);
+
+ do {
+ pgd_t *__src_pgd = src_pgd;
+
+ next = pgd_addr_end(addr, end);
+ if (pgd_none(*__src_pgd)) {
+ if (pgd_none(*dst_pgd))
+ continue;
+ __src_pgd = NULL;
+ }
+
+ ret = sync_direct_mapping_p4d(keyid, dst_pgd, __src_pgd,
+ addr, next);
+ } while (dst_pgd++, src_pgd++, addr = next, addr != end && !ret);
+
+ return ret;
+}
+
+/*
+ * For MKTME we maintain per-KeyID direct mappings. This allows kernel to have
+ * access to encrypted memory.
+ *
+ * sync_direct_mapping() sync per-KeyID direct mappings with a canonical
+ * one -- KeyID-0.
+ *
+ * The function tracks changes in the canonical mapping:
+ * - creating or removing chunks of the translation tree;
+ * - changes in mapping flags (i.e. protection bits);
+ * - splitting huge page mapping into a page table;
+ * - replacing page table with a huge page mapping;
+ *
+ * The function need to be called on every change to the direct mapping:
+ * hotplug, hotremove, changes in permissions bits, etc.
+ *
+ * The function is nop until MKTME is enabled.
+ */
+int sync_direct_mapping(void)
+{
+ int i, ret = 0;
+
+ if (mktme_status != MKTME_ENABLED)
+ return 0;
+
+ for (i = 1; !ret && i <= mktme_nr_keyids; i++)
+ ret = sync_direct_mapping_keyid(i);
+
+ __flush_tlb_all();
+
+ return ret;
+}
+
void __init setup_direct_mapping_size(void)
{
unsigned long available_va;
@@ -134,6 +570,14 @@ static int __init mktme_init(void)
if (direct_mapping_size == -1UL)
setup_direct_mapping_size();
+ if (mktme_status == MKTME_ENUMERATED)
+ mktme_status = MKTME_ENABLED;
+
+ if (sync_direct_mapping()) {
+ pr_err("x86/mktme: sync_direct_mapping() failed. Disable MKTME\n");
+ mktme_disable();
+ }
+
return 0;
}
arch_initcall(mktme_init)
--
2.17.1
VMAs with different KeyID do not mix together. Only VMAs with the same
KeyID are compatible.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/mmap.c | 3 ++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 02a616e2f17d..1c3c15f37ed6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1492,6 +1492,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
return !vma->vm_ops;
}
+#ifndef vma_keyid
+static inline int vma_keyid(struct vm_area_struct *vma)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/mmap.c b/mm/mmap.c
index d817764a9974..3ff89fc1752b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1217,7 +1217,8 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
mpol_equal(vma_policy(a), vma_policy(b)) &&
a->vm_file == b->vm_file &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC|VM_SOFTDIRTY)) &&
- b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+ b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT) &&
+ vma_keyid(a) == vma_keyid(b);
}
/*
--
2.17.1
Encrypted VMA will have KeyID stored in vma->vm_page_prot. This way we
don't need to do anything special to setup encrypted page table entries
and don't need to reserve space for KeyID in a VMA.
This patch changes _PAGE_CHG_MASK to include KeyID bits. Otherwise they
are going to be stripped from vm_page_prot on the first pgprot_modify().
Define PTE_PFN_MASK_MAX similar to PTE_PFN_MASK but based on
__PHYSICAL_MASK_SHIFT. This way we include whole range of bits
architecturally available for PFN without referencing physical_mask and
mktme_keyid_mask variables.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1e5a40673953..e8ebe760b88d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -121,8 +121,13 @@
* protection key is treated like _PAGE_RW, for
* instance, and is *not* included in this mask since
* pte_modify() does modify it.
+ *
+ * It includes full range of PFN bits regardless if they were claimed for KeyID
+ * or not: we want to preserve KeyID on pte_modify() and pgprot_modify().
*/
-#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
+#define PTE_PFN_MASK_MAX \
+ (((signed long)PAGE_MASK) & ((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
+#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
_PAGE_SOFT_DIRTY)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
--
2.17.1
Rename the option to CONFIG_MEMORY_PHYSICAL_PADDING. It will be used
not only for KASLR.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/mm/kaslr.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4992598ae411..4fa2cf807321 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2187,7 +2187,7 @@ config RANDOMIZE_MEMORY
If unsure, say Y.
-config RANDOMIZE_MEMORY_PHYSICAL_PADDING
+config MEMORY_PHYSICAL_PADDING
hex "Physical memory mapping padding" if EXPERT
depends on RANDOMIZE_MEMORY
default "0xa" if MEMORY_HOTPLUG
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 61db77b0eda9..4408cd9a3bef 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -102,7 +102,7 @@ void __init kernel_randomize_memory(void)
*/
BUG_ON(kaslr_regions[0].base != &page_offset_base);
memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
- CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
+ CONFIG_MEMORY_PHYSICAL_PADDING;
/* Adapt phyiscal memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
--
2.17.1
mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
KeyID zero which used by TME. MKTME KeyIDs start from 1.
mktme_keyid_shift holds shift of KeyID within physical address.
mktme_keyid_mask holds mask to extract KeyID from physical address.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 16 ++++++++++++++++
arch/x86/kernel/cpu/intel.c | 12 ++++++++----
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mktme.c | 5 +++++
4 files changed, 31 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mktme.c
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
new file mode 100644
index 000000000000..df31876ec48c
--- /dev/null
+++ b/arch/x86/include/asm/mktme.h
@@ -0,0 +1,16 @@
+#ifndef _ASM_X86_MKTME_H
+#define _ASM_X86_MKTME_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_X86_INTEL_MKTME
+extern phys_addr_t mktme_keyid_mask;
+extern int mktme_nr_keyids;
+extern int mktme_keyid_shift;
+#else
+#define mktme_keyid_mask ((phys_addr_t)0)
+#define mktme_nr_keyids 0
+#define mktme_keyid_shift 0
+#endif
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index bf2caf9d52dd..efc9e9fc47d4 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -573,6 +573,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
#ifdef CONFIG_X86_INTEL_MKTME
if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ mktme_nr_keyids = nr_keyids;
+ mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
+
/*
* Mask out bits claimed from KeyID from physical address mask.
*
@@ -580,10 +583,8 @@ static void detect_tme(struct cpuinfo_x86 *c)
* and number of bits claimed for KeyID is 6, bits 51:46 of
* physical address is unusable.
*/
- phys_addr_t keyid_mask;
-
- keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
- physical_mask &= ~keyid_mask;
+ mktme_keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, mktme_keyid_shift);
+ physical_mask &= ~mktme_keyid_mask;
} else {
/*
* Reset __PHYSICAL_MASK.
@@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
* between CPUs.
*/
physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ mktme_keyid_mask = 0;
+ mktme_keyid_shift = 0;
+ mktme_nr_keyids = 0;
}
#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..4ebee899c363 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
+obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
new file mode 100644
index 000000000000..467f1b26c737
--- /dev/null
+++ b/arch/x86/mm/mktme.c
@@ -0,0 +1,5 @@
+#include <asm/mktme.h>
+
+phys_addr_t mktme_keyid_mask;
+int mktme_nr_keyids;
+int mktme_keyid_shift;
--
2.17.1
Store KeyID in bits 31:16 of extended page flags. These bits are unused.
We don't yet set KeyID for the page. It will come in the following
patch that implements prep_encrypted_page(). All pages have KeyID-0 for
now.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 6 ++++++
arch/x86/include/asm/page.h | 1 +
arch/x86/mm/mktme.c | 21 +++++++++++++++++++++
include/linux/page_ext.h | 11 ++++++++++-
mm/page_ext.c | 3 +++
5 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 08f613953207..0fe0db424e48 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -2,6 +2,7 @@
#define _ASM_X86_MKTME_H
#include <linux/types.h>
+#include <linux/page_ext.h>
struct vm_area_struct;
@@ -16,6 +17,11 @@ bool vma_is_encrypted(struct vm_area_struct *vma);
#define vma_keyid vma_keyid
int vma_keyid(struct vm_area_struct *vma);
+extern struct page_ext_operations page_mktme_ops;
+
+#define page_keyid page_keyid
+int page_keyid(const struct page *page);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..39af59487d5f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,7 @@
struct page;
#include <linux/range.h>
+#include <asm/mktme.h>
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 3b2f28a21d99..b02d5b9d4339 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -5,6 +5,15 @@ phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
int mktme_keyid_shift;
+int page_keyid(const struct page *page)
+{
+ if (mktme_status != MKTME_ENABLED)
+ return 0;
+
+ return lookup_page_ext(page)->keyid;
+}
+EXPORT_SYMBOL(page_keyid);
+
bool vma_is_encrypted(struct vm_area_struct *vma)
{
return pgprot_val(vma->vm_page_prot) & mktme_keyid_mask;
@@ -20,3 +29,15 @@ int vma_keyid(struct vm_area_struct *vma)
prot = pgprot_val(vma->vm_page_prot);
return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
}
+
+static bool need_page_mktme(void)
+{
+ /* Make sure keyid doesn't collide with extended page flags */
+ BUILD_BUG_ON(__NR_PAGE_EXT_FLAGS > 16);
+
+ return true;
+}
+
+struct page_ext_operations page_mktme_ops = {
+ .need = need_page_mktme,
+};
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index f84f167ec04c..d9c5aae9523f 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -23,6 +23,7 @@ enum page_ext_flags {
PAGE_EXT_YOUNG,
PAGE_EXT_IDLE,
#endif
+ __NR_PAGE_EXT_FLAGS
};
/*
@@ -33,7 +34,15 @@ enum page_ext_flags {
* then the page_ext for pfn always exists.
*/
struct page_ext {
- unsigned long flags;
+ union {
+ unsigned long flags;
+#ifdef CONFIG_X86_INTEL_MKTME
+ struct {
+ unsigned short __pad;
+ unsigned short keyid;
+ };
+#endif
+ };
};
extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/mm/page_ext.c b/mm/page_ext.c
index a9826da84ccb..036658229842 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -68,6 +68,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_X86_INTEL_MKTME
+ &page_mktme_ops,
+#endif
};
static unsigned long total_usage;
--
2.17.1
We store KeyID in upper bits for vm_page_prot that match position of
KeyID in PTE. vma_keyid() extracts KeyID from vm_page_prot.
VMA is encrypted if KeyID is non-zero. vma_is_encrypted() checks that.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 9 +++++++++
arch/x86/mm/mktme.c | 17 +++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index df31876ec48c..08f613953207 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -3,10 +3,19 @@
#include <linux/types.h>
+struct vm_area_struct;
+
#ifdef CONFIG_X86_INTEL_MKTME
extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
extern int mktme_keyid_shift;
+
+#define vma_is_encrypted vma_is_encrypted
+bool vma_is_encrypted(struct vm_area_struct *vma);
+
+#define vma_keyid vma_keyid
+int vma_keyid(struct vm_area_struct *vma);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 467f1b26c737..3b2f28a21d99 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,5 +1,22 @@
+#include <linux/mm.h>
#include <asm/mktme.h>
phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
int mktme_keyid_shift;
+
+bool vma_is_encrypted(struct vm_area_struct *vma)
+{
+ return pgprot_val(vma->vm_page_prot) & mktme_keyid_mask;
+}
+
+int vma_keyid(struct vm_area_struct *vma)
+{
+ pgprotval_t prot;
+
+ if (!vma_is_anonymous(vma))
+ return 0;
+
+ prot = pgprot_val(vma->vm_page_prot);
+ return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
+}
--
2.17.1
The hardware/CPU does not enforce coherency between mappings of the same
physical page with different KeyIDs or encryption keys.
We are responsible for cache management.
Flush cache on allocating encrypted page and on returning the page to
the free pool.
prep_encrypted_page() also takes care about zeroing the page. We have to
do this after KeyID is set for the page.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 6 ++++++
arch/x86/mm/mktme.c | 39 ++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 0fe0db424e48..ec7036abdb3f 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -11,6 +11,12 @@ extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
extern int mktme_keyid_shift;
+#define prep_encrypted_page prep_encrypted_page
+void prep_encrypted_page(struct page *page, int order, int keyid, bool zero);
+
+#define HAVE_ARCH_FREE_PAGE
+void arch_free_page(struct page *page, int order);
+
#define vma_is_encrypted vma_is_encrypted
bool vma_is_encrypted(struct vm_area_struct *vma);
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index b02d5b9d4339..1821b87abb2f 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,4 +1,5 @@
#include <linux/mm.h>
+#include <linux/highmem.h>
#include <asm/mktme.h>
phys_addr_t mktme_keyid_mask;
@@ -30,6 +31,44 @@ int vma_keyid(struct vm_area_struct *vma)
return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
}
+void prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
+{
+ int i;
+
+ /*
+ * The hardware/CPU does not enforce coherency between mappings of the
+ * same physical page with different KeyIDs or encrypt ion keys.
+ * We are responsible for cache management.
+ *
+ * We flush cache before allocating encrypted page
+ */
+ clflush_cache_range(page_address(page), PAGE_SIZE << order);
+
+ for (i = 0; i < (1 << order); i++) {
+ WARN_ON_ONCE(lookup_page_ext(page)->keyid);
+ lookup_page_ext(page)->keyid = keyid;
+
+ /* Clear the page after the KeyID is set. */
+ if (zero)
+ clear_highpage(page);
+ }
+}
+
+void arch_free_page(struct page *page, int order)
+{
+ int i;
+
+ if (!page_keyid(page))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
+ lookup_page_ext(page)->keyid = 0;
+ }
+
+ clflush_cache_range(page_address(page), PAGE_SIZE << order);
+}
+
static bool need_page_mktme(void)
{
/* Make sure keyid doesn't collide with extended page flags */
--
2.17.1
We need to know number of KeyIDs before KALSR is initialized. Number of
KeyIDs would determinate how much address space would be needed for
per-KeyID direct mapping.
KALSR initialization happens before full CPU initizliation is complete.
Move detect_tme() call to early_init_intel().
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index fb58776513e6..3322b0125353 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -158,6 +158,8 @@ static bool bad_spectre_microcode(struct cpuinfo_x86 *c)
return false;
}
+static void detect_tme(struct cpuinfo_x86 *c);
+
static void early_init_intel(struct cpuinfo_x86 *c)
{
u64 misc_enable;
@@ -301,6 +303,9 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}
check_mpx_erratum(c);
+
+ if (cpu_has(c, X86_FEATURE_TME))
+ detect_tme(c);
}
#ifdef CONFIG_X86_32
@@ -762,9 +767,6 @@ static void init_intel(struct cpuinfo_x86 *c)
if (cpu_has(c, X86_FEATURE_VMX))
detect_vmx_virtcap(c);
- if (cpu_has(c, X86_FEATURE_TME))
- detect_tme(c);
-
init_intel_energy_perf(c);
init_intel_misc_features(c);
--
2.17.1
Separate MKTME enumaration from enabling. We need to postpone enabling
until initialization is complete.
The new helper mktme_disable() allows to disable MKTME even if it's
enumerated successfully. MKTME initialization may fail and this
functionallity allows system to boot regardless of the failure.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 12 ++++++++++++
arch/x86/kernel/cpu/intel.c | 15 ++++-----------
arch/x86/mm/mktme.c | 9 +++++++++
3 files changed, 25 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index ec7036abdb3f..9363b989a021 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -6,11 +6,21 @@
struct vm_area_struct;
+/* Values for mktme_status */
+#define MKTME_DISABLED 0
+#define MKTME_ENUMERATED 1
+#define MKTME_ENABLED 2
+#define MKTME_UNINITIALIZED 3
+
+extern int mktme_status;
+
#ifdef CONFIG_X86_INTEL_MKTME
extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
extern int mktme_keyid_shift;
+void mktme_disable(void);
+
#define prep_encrypted_page prep_encrypted_page
void prep_encrypted_page(struct page *page, int order, int keyid, bool zero);
@@ -28,6 +38,8 @@ extern struct page_ext_operations page_mktme_ops;
#define page_keyid page_keyid
int page_keyid(const struct page *page);
+void mktme_disable(void);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index efc9e9fc47d4..fb58776513e6 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -508,11 +508,7 @@ static void detect_vmx_virtcap(struct cpuinfo_x86 *c)
#define TME_ACTIVATE_CRYPTO_ALGS(x) ((x >> 48) & 0xffff) /* Bits 63:48 */
#define TME_ACTIVATE_CRYPTO_AES_XTS_128 1
-/* Values for mktme_status (SW only construct) */
-#define MKTME_ENABLED 0
-#define MKTME_DISABLED 1
-#define MKTME_UNINITIALIZED 2
-static int mktme_status = MKTME_UNINITIALIZED;
+int mktme_status __ro_after_init = MKTME_UNINITIALIZED;
static void detect_tme(struct cpuinfo_x86 *c)
{
@@ -568,11 +564,11 @@ static void detect_tme(struct cpuinfo_x86 *c)
if (mktme_status == MKTME_UNINITIALIZED) {
/* MKTME is usable */
- mktme_status = MKTME_ENABLED;
+ mktme_status = MKTME_ENUMERATED;
}
#ifdef CONFIG_X86_INTEL_MKTME
- if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ if (mktme_status == MKTME_ENUMERATED && nr_keyids) {
mktme_nr_keyids = nr_keyids;
mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
@@ -591,10 +587,7 @@ static void detect_tme(struct cpuinfo_x86 *c)
* Maybe needed if there's inconsistent configuation
* between CPUs.
*/
- physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
- mktme_keyid_mask = 0;
- mktme_keyid_shift = 0;
- mktme_nr_keyids = 0;
+ mktme_disable();
}
#endif
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 1821b87abb2f..43a44f0f2a2d 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -6,6 +6,15 @@ phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
int mktme_keyid_shift;
+void mktme_disable(void)
+{
+ physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ mktme_keyid_mask = 0;
+ mktme_keyid_shift = 0;
+ mktme_nr_keyids = 0;
+ mktme_status = MKTME_DISABLED;
+}
+
int page_keyid(const struct page *page)
{
if (mktme_status != MKTME_ENABLED)
--
2.17.1
Pages for encrypted VMAs have to be allocated in a special way:
we would need to propagate down not only desired NUMA node but also
whether the page is encrypted.
It complicates not-so-trivial routine of huge page allocation in
khugepaged even more. It also puts more pressure on page allocator:
we cannot re-use pages allocated for encrypted VMA to collapse
page in unencrypted one or vice versa.
I think for now it worth skipping encrypted VMAs. We can return
to this topic later.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/khugepaged.c | 2 ++
2 files changed, 9 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1c3c15f37ed6..435b053c457c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1492,6 +1492,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
return !vma->vm_ops;
}
+#ifndef vma_is_encrypted
+static inline bool vma_is_encrypted(struct vm_area_struct *vma)
+{
+ return false;
+}
+#endif
+
#ifndef vma_keyid
static inline int vma_keyid(struct vm_area_struct *vma)
{
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d7b2a4bf8671..a03b40bef033 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -835,6 +835,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
return false;
if (is_vma_temporary_stack(vma))
return false;
+ if (vma_is_encrypted(vma))
+ return false;
return !(vma->vm_flags & VM_NO_KHUGEPAGED);
}
--
2.17.1
MKTME claims several upper bits of the physical address in a page table
entry to encode KeyID. It effectively shrinks number of bits for
physical address. We should exclude KeyID bits from physical addresses.
For instance, if CPU enumerates 52 physical address bits and number of
bits claimed for KeyID is 6, bits 51:46 must not be threated as part
physical address.
This patch adjusts __PHYSICAL_MASK during MKTME enumeration.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index eb75564f2d25..bf2caf9d52dd 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -571,6 +571,29 @@ static void detect_tme(struct cpuinfo_x86 *c)
mktme_status = MKTME_ENABLED;
}
+#ifdef CONFIG_X86_INTEL_MKTME
+ if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ /*
+ * Mask out bits claimed from KeyID from physical address mask.
+ *
+ * For instance, if a CPU enumerates 52 physical address bits
+ * and number of bits claimed for KeyID is 6, bits 51:46 of
+ * physical address is unusable.
+ */
+ phys_addr_t keyid_mask;
+
+ keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
+ physical_mask &= ~keyid_mask;
+ } else {
+ /*
+ * Reset __PHYSICAL_MASK.
+ * Maybe needed if there's inconsistent configuation
+ * between CPUs.
+ */
+ physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ }
+#endif
+
/*
* KeyID bits effectively lower the number of physical address
* bits. Update cpuinfo_x86::x86_phys_bits accordingly.
--
2.17.1
Pages encrypted with different encryption keys are not subject to KSM
merge. Otherwise it would cross security boundary.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/ksm.c | 3 +++
2 files changed, 10 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 435b053c457c..ac1a8480284d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1506,6 +1506,13 @@ static inline int vma_keyid(struct vm_area_struct *vma)
}
#endif
+#ifndef page_keyid
+static inline int page_keyid(struct page *page)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/ksm.c b/mm/ksm.c
index 7d6558f3bac9..db94bd45fe66 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1201,6 +1201,9 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
if (!PageAnon(page))
goto out;
+ if (page_keyid(page) != page_keyid(kpage))
+ goto out;
+
/*
* We need the page lock to read a stable PageSwapCache in
* write_protect_page(). We use trylock_page() instead of
--
2.17.1
On 12.06.2018 17:39, Kirill A. Shutemov wrote:
> Kernel need to have a way to access encrypted memory. We are going to
> use per-KeyID direct mapping to facilitate the access with minimal
> overhead.
>
> Direct mapping for each KeyID will be put next to each other in the
> virtual address space. We need to have a way to find boundaries of
> direct mapping for particular KeyID.
>
> The new variable direct_mapping_size specifies the size of direct
> mapping. With the value, it's trivial to find direct mapping for
> KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
>
> Size of direct mapping is calculated during KASLR setup. If KALSR is
> disable it happens during MKTME initialization.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/include/asm/mktme.h | 2 ++
> arch/x86/include/asm/page_64.h | 1 +
> arch/x86/kernel/head64.c | 2 ++
> arch/x86/mm/kaslr.c | 21 ++++++++++++---
> arch/x86/mm/mktme.c | 48 ++++++++++++++++++++++++++++++++++
> 5 files changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> index 9363b989a021..3bf481fe3f56 100644
> --- a/arch/x86/include/asm/mktme.h
> +++ b/arch/x86/include/asm/mktme.h
> @@ -40,6 +40,8 @@ int page_keyid(const struct page *page);
>
> void mktme_disable(void);
>
> +void setup_direct_mapping_size(void);
> +
> #else
> #define mktme_keyid_mask ((phys_addr_t)0)
> #define mktme_nr_keyids 0
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 939b1cff4a7b..53c32af895ab 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -14,6 +14,7 @@ extern unsigned long phys_base;
> extern unsigned long page_offset_base;
> extern unsigned long vmalloc_base;
> extern unsigned long vmemmap_base;
> +extern unsigned long direct_mapping_size;
>
> static inline unsigned long __phys_addr_nodebug(unsigned long x)
> {
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index a21d6ace648e..b6175376b2e1 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -59,6 +59,8 @@ EXPORT_SYMBOL(vmalloc_base);
> unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
> EXPORT_SYMBOL(vmemmap_base);
> #endif
> +unsigned long direct_mapping_size __ro_after_init = -1UL;
> +EXPORT_SYMBOL(direct_mapping_size);
>
> #define __head __section(.head.text)
>
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 4408cd9a3bef..3d8ef8cb97e1 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
> return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
> }
>
> +#ifndef CONFIG_X86_INTEL_MKTME
> +static void __init setup_direct_mapping_size(void)
> +{
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> + direct_mapping_size += (1UL << TB_SHIFT) * CONFIG_MEMORY_PHYSICAL_PADDING;
> +}
> +#endif
> +
> /* Initialize base and padding for each memory region randomized with KASLR */
> void __init kernel_randomize_memory(void)
> {
> @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
> if (!kaslr_memory_enabled())
> return;
>
> - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> + /*
> + * Upper limit for direct mapping size is 1/4 of whole virtual
> + * address space
or 1/2?
> + */
> + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);
> kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
>
> /*
> @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
> * add padding if needed (especially for memory hotplug support).
> */
> BUG_ON(kaslr_regions[0].base != &page_offset_base);
> - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> - CONFIG_MEMORY_PHYSICAL_PADDING;
> +
> + setup_direct_mapping_size();
> +
> + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;
parenthesis ?
memory_tb = direct_mapping_size * (mktme_nr_keyids + 1);
>
> /* Adapt phyiscal memory region size based on available memory */
> if (memory_tb < kaslr_regions[0].size_tb)
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> index 43a44f0f2a2d..3e5322bf035e 100644
> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
> struct page_ext_operations page_mktme_ops = {
> .need = need_page_mktme,
> };
> +
> +void __init setup_direct_mapping_size(void)
> +{
> + unsigned long available_va;
> +
> + /* 1/4 of virtual address space is didicated for direct mapping */
> + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> +
> + /* How much memory the systrem has? */
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << 40);
> +
> + if (mktme_status != MKTME_ENUMERATED)
> + goto out;
> +
> + /*
> + * Not enough virtual address space to address all physical memory with
> + * MKTME enabled. Even without padding.
> + *
> + * Disable MKTME instead.
> + */
> + if (direct_mapping_size > available_va / mktme_nr_keyids + 1) {
parenthesis again?
if (direct_mapping_size > available_va / (mktme_nr_keyids + 1)) {
> + pr_err("x86/mktme: Disabled. Not enough virtual address space\n");
> + pr_err("x86/mktme: Consider switching to 5-level paging\n");
> + mktme_disable();
> + goto out;
> + }
> +
> + /*
> + * Virtual address space is divided between per-KeyID direct mappings.
> + */
> + available_va /= mktme_nr_keyids + 1
> +out:
> + /* Add padding, if there's enough virtual address space */
> + direct_mapping_size += (1UL << 40) * CONFIG_MEMORY_PHYSICAL_PADDING;
> + if (direct_mapping_size > available_va)
> + direct_mapping_size = available_va;
> +}
> +
> +static int __init mktme_init(void)
> +{
> + /* KASLR didn't initialized it for us. */
> + if (direct_mapping_size == -1UL)
> + setup_direct_mapping_size();
> +
> + return 0;
> +}
> +arch_initcall(mktme_init)
On Tue, Jun 12, 2018 at 02:58:38PM +0000, Mika Penttil? wrote:
>
>
> On 12.06.2018 17:39, Kirill A. Shutemov wrote:
> > Kernel need to have a way to access encrypted memory. We are going to
> > use per-KeyID direct mapping to facilitate the access with minimal
> > overhead.
> >
> > Direct mapping for each KeyID will be put next to each other in the
> > virtual address space. We need to have a way to find boundaries of
> > direct mapping for particular KeyID.
> >
> > The new variable direct_mapping_size specifies the size of direct
> > mapping. With the value, it's trivial to find direct mapping for
> > KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
> >
> > Size of direct mapping is calculated during KASLR setup. If KALSR is
> > disable it happens during MKTME initialization.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/include/asm/mktme.h | 2 ++
> > arch/x86/include/asm/page_64.h | 1 +
> > arch/x86/kernel/head64.c | 2 ++
> > arch/x86/mm/kaslr.c | 21 ++++++++++++---
> > arch/x86/mm/mktme.c | 48 ++++++++++++++++++++++++++++++++++
> > 5 files changed, 71 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> > index 9363b989a021..3bf481fe3f56 100644
> > --- a/arch/x86/include/asm/mktme.h
> > +++ b/arch/x86/include/asm/mktme.h
> > @@ -40,6 +40,8 @@ int page_keyid(const struct page *page);
> >
> > void mktme_disable(void);
> >
> > +void setup_direct_mapping_size(void);
> > +
> > #else
> > #define mktme_keyid_mask ((phys_addr_t)0)
> > #define mktme_nr_keyids 0
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index 939b1cff4a7b..53c32af895ab 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -14,6 +14,7 @@ extern unsigned long phys_base;
> > extern unsigned long page_offset_base;
> > extern unsigned long vmalloc_base;
> > extern unsigned long vmemmap_base;
> > +extern unsigned long direct_mapping_size;
> >
> > static inline unsigned long __phys_addr_nodebug(unsigned long x)
> > {
> > diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> > index a21d6ace648e..b6175376b2e1 100644
> > --- a/arch/x86/kernel/head64.c
> > +++ b/arch/x86/kernel/head64.c
> > @@ -59,6 +59,8 @@ EXPORT_SYMBOL(vmalloc_base);
> > unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
> > EXPORT_SYMBOL(vmemmap_base);
> > #endif
> > +unsigned long direct_mapping_size __ro_after_init = -1UL;
> > +EXPORT_SYMBOL(direct_mapping_size);
> >
> > #define __head __section(.head.text)
> >
> > diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> > index 4408cd9a3bef..3d8ef8cb97e1 100644
> > --- a/arch/x86/mm/kaslr.c
> > +++ b/arch/x86/mm/kaslr.c
> > @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
> > return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
> > }
> >
> > +#ifndef CONFIG_X86_INTEL_MKTME
> > +static void __init setup_direct_mapping_size(void)
> > +{
> > + direct_mapping_size = max_pfn << PAGE_SHIFT;
> > + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> > + direct_mapping_size += (1UL << TB_SHIFT) * CONFIG_MEMORY_PHYSICAL_PADDING;
> > +}
> > +#endif
> > +
> > /* Initialize base and padding for each memory region randomized with KASLR */
> > void __init kernel_randomize_memory(void)
> > {
> > @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
> > if (!kaslr_memory_enabled())
> > return;
> >
> > - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> > + /*
> > + * Upper limit for direct mapping size is 1/4 of whole virtual
> > + * address space
>
> or 1/2?
1/2 of kernel address space or 1/4 of whole address space.
>
> > + */
> > + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);
>
>
>
> > kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
> >
> > /*
> > @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
> > * add padding if needed (especially for memory hotplug support).
> > */
> > BUG_ON(kaslr_regions[0].base != &page_offset_base);
> > - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> > - CONFIG_MEMORY_PHYSICAL_PADDING;
> > +
> > + setup_direct_mapping_size();
> > +
> > + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;
>
> parenthesis ?
>
> memory_tb = direct_mapping_size * (mktme_nr_keyids + 1);
Ouch. Thanks for noticing this.
> >
> > /* Adapt phyiscal memory region size based on available memory */
> > if (memory_tb < kaslr_regions[0].size_tb)
> > diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> > index 43a44f0f2a2d..3e5322bf035e 100644
> > --- a/arch/x86/mm/mktme.c
> > +++ b/arch/x86/mm/mktme.c
> > @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
> > struct page_ext_operations page_mktme_ops = {
> > .need = need_page_mktme,
> > };
> > +
> > +void __init setup_direct_mapping_size(void)
> > +{
> > + unsigned long available_va;
> > +
> > + /* 1/4 of virtual address space is didicated for direct mapping */
> > + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> > +
> > + /* How much memory the systrem has? */
> > + direct_mapping_size = max_pfn << PAGE_SHIFT;
> > + direct_mapping_size = round_up(direct_mapping_size, 1UL << 40);
> > +
> > + if (mktme_status != MKTME_ENUMERATED)
> > + goto out;
> > +
> > + /*
> > + * Not enough virtual address space to address all physical memory with
> > + * MKTME enabled. Even without padding.
> > + *
> > + * Disable MKTME instead.
> > + */
> > + if (direct_mapping_size > available_va / mktme_nr_keyids + 1) {
>
> parenthesis again?
>
> if (direct_mapping_size > available_va / (mktme_nr_keyids + 1)) {
Yeah... :/
Here's fixup.
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 3d8ef8cb97e1..ef437c8d5f34 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -103,8 +103,8 @@ void __init kernel_randomize_memory(void)
return;
/*
- * Upper limit for direct mapping size is 1/4 of whole virtual
- * address space
+ * Upper limit for direct mapping size is half of kernel address
+ * space.
*/
kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);
kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
@@ -117,7 +117,7 @@ void __init kernel_randomize_memory(void)
setup_direct_mapping_size();
- memory_tb = direct_mapping_size * mktme_nr_keyids + 1;
+ memory_tb = direct_mapping_size * (mktme_nr_keyids + 1);
/* Adapt phyiscal memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 3e5322bf035e..70f6eff093d8 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -110,7 +110,7 @@ void __init setup_direct_mapping_size(void)
*
* Disable MKTME instead.
*/
- if (direct_mapping_size > available_va / mktme_nr_keyids + 1) {
+ if (direct_mapping_size > available_va / (mktme_nr_keyids + 1)) {
pr_err("x86/mktme: Disabled. Not enough virtual address space\n");
pr_err("x86/mktme: Consider switching to 5-level paging\n");
mktme_disable();
--
Kirill A. Shutemov
On 06/12/2018 07:38 AM, Kirill A. Shutemov wrote:
> VMAs with different KeyID do not mix together. Only VMAs with the same
> KeyID are compatible.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/mm.h | 7 +++++++
> mm/mmap.c | 3 ++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 02a616e2f17d..1c3c15f37ed6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1492,6 +1492,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
> return !vma->vm_ops;
> }
>
> +#ifndef vma_keyid
> +static inline int vma_keyid(struct vm_area_struct *vma)
> +{
> + return 0;
> +}
> +#endif
I'm generally not a fan of this #ifdef'ing method. It makes it hard to
figure out who is supposed to define it, and it's also substantially
more fragile in the face of #include ordering.
I'd much rather see some Kconfig involvement, like
CONFIG_ARCH_HAS_MEM_ENCRYPTION or something.
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> Pages for encrypted VMAs have to be allocated in a special way:
> we would need to propagate down not only desired NUMA node but also
> whether the page is encrypted.
>
> It complicates not-so-trivial routine of huge page allocation in
> khugepaged even more. It also puts more pressure on page allocator:
> we cannot re-use pages allocated for encrypted VMA to collapse
> page in unencrypted one or vice versa.
>
> I think for now it worth skipping encrypted VMAs. We can return
> to this topic later.
You're asking for this to be included, but without a major piece of THP
support. Is THP support unimportant for this feature?
Are we really asking the x86 maintainers to merge this feature with this
restriction in place?
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> Pages encrypted with different encryption keys are not subject to KSM
> merge. Otherwise it would cross security boundary.
This needs a much stronger explanation. Which KeyID would be used for
access in the new direct mappings? What actually happens without this
patch in place?
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> For encrypted memory, we need to allocated pages for a specific
> encryption KeyID.
"allocate" ^
> There are two cases when we need to allocate a page for encryption:
>
> - Allocation for an encrypted VMA;
>
> - Allocation for migration of encrypted page;
>
> The first case can be covered within alloc_page_vma().
... because we know the KeyID from the VMA?
> The second case requires few new page allocation routines that would
> allocate the page for a specific KeyID.
>
> Encrypted page has to be cleared after KeyID set. This is handled by
"An encrypted page has ... "
This description lacks a description of the performance impact of the
approach in this patch both when allocating encrypted and normal pages.
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -18,7 +18,7 @@ extern void clear_page(void *page);
> #define clear_user_page(page, vaddr, pg) clear_page(page)
>
> #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
> - alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vmaddr)
> + alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
> #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
Does this compile? Wouldn't "vmaddr" be undefined?
> +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
> + alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
The argument addition should be broken out into a preparatory patch.
> extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
> extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f2b4abbca55e..6da504bad841 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,9 +38,11 @@ static inline struct page *new_page_nodemask(struct page *page,
> unsigned int order = 0;
> struct page *new_page = NULL;
>
> - if (PageHuge(page))
> + if (PageHuge(page)) {
> + WARN_ON(page_keyid(page));
> return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> preferred_nid, nodemask);
> + }
Comment on the warning, please.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 9ac49ef17b4e..00bccbececea 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -920,22 +920,24 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> /* page allocation callback for NUMA node migration */
> struct page *alloc_new_node_page(struct page *page, unsigned long node)
> {
> - if (PageHuge(page))
> + if (PageHuge(page)) {
> + WARN_ON(page_keyid(page));
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> node);
Comments, please.
> @@ -2012,9 +2014,16 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> {
> struct mempolicy *pol;
> struct page *page;
> - int preferred_nid;
> + bool zero = false;
> + int keyid, preferred_nid;
> nodemask_t *nmask;
>
> + keyid = vma_keyid(vma);
> + if (keyid && gfp & __GFP_ZERO) {
> + zero = true;
> + gfp &= ~__GFP_ZERO;
> + }
I totally read that wrong.
"zero" needs to be named: "page_need_zeroing".
It also badly needs a comment.
> pol = get_vma_policy(vma, addr);
>
> if (pol->mode == MPOL_INTERLEAVE) {
> @@ -2057,6 +2066,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> mpol_cond_put(pol);
> out:
> + if (page && keyid)
> + prep_encrypted_page(page, order, keyid, zero);
> return page;
> }
I'd just have prep_encrypted_page() do the keyid-0 opt-out of the prep
work. It'll be less to patch when you
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f7cab1..eb8dea219dcb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1847,7 +1847,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> int nid = (int) data;
> struct page *newpage;
>
> - newpage = __alloc_pages_node(nid,
> + newpage = __alloc_pages_node_keyid(nid, page_keyid(page),
> (GFP_HIGHUSER_MOVABLE |
> __GFP_THISNODE | __GFP_NOMEMALLOC |
> __GFP_NORETRY | __GFP_NOWARN) &
I thought folks asked you not to change all of the calling conventions
across the page allocator. It seems like you're still doing that,
though. A reviewer might think you've ignored their earlier feedback.
Did you?
> +#ifndef CONFIG_NUMA
> +struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
> + struct vm_area_struct *vma, unsigned long addr,
> + int node, bool hugepage)
> +{
> + struct page *page;
> + bool zero = false;
> + int keyid = vma_keyid(vma);
> +
> + if (keyid && (gfp_mask & __GFP_ZERO)) {
Please at least do your parenthesis consistently. :)
> + zero = true;
> + gfp_mask &= ~__GFP_ZERO;
> + }
> +
> + page = alloc_pages(gfp_mask, order);
> + if (page && keyid)
> + prep_encrypted_page(page, order, keyid, zero);
> +
> + return page;
> +}
> +#endif
I'm also confused by the #ifdef. What is it for?
> +struct page * __alloc_pages_node_keyid(int nid, int keyid,
> + gfp_t gfp_mask, unsigned int order)
> +{
> + struct page *page;
> + bool zero = false;
> +
> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> + VM_WARN_ON(!node_online(nid));
> +
> + if (keyid && (gfp_mask & __GFP_ZERO)) {
> + zero = true;
> + gfp_mask &= ~__GFP_ZERO;
> + }
OK, so this is the third time I've seen that pattern. Are you *sure*
you don't want to consolidate the sites?
> + page = __alloc_pages(gfp_mask, order, nid);
> + if (page && keyid)
> + prep_encrypted_page(page, order, keyid, zero);
> +
> + return page;
> +}
> +
> #ifdef CONFIG_LOCKDEP
> struct lockdep_map __fs_reclaim_map =
> STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> @@ -4396,6 +4439,26 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> }
> EXPORT_SYMBOL(__alloc_pages_nodemask);
>
> +struct page *
> +__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
> + int preferred_nid, nodemask_t *nodemask, int keyid)
> +{
> + struct page *page;
> + bool zero = false;
> +
> + if (keyid && (gfp_mask & __GFP_ZERO)) {
> + zero = true;
> + gfp_mask &= ~__GFP_ZERO;
> + }
Fourth one. :)
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> Encrypted VMA will have KeyID stored in vma->vm_page_prot. This way we
"An encrypted VMA..."
> don't need to do anything special to setup encrypted page table entries
> and don't need to reserve space for KeyID in a VMA.
>
> This patch changes _PAGE_CHG_MASK to include KeyID bits. Otherwise they
> are going to be stripped from vm_page_prot on the first pgprot_modify().
>
> Define PTE_PFN_MASK_MAX similar to PTE_PFN_MASK but based on
> __PHYSICAL_MASK_SHIFT. This way we include whole range of bits
> architecturally available for PFN without referencing physical_mask and
> mktme_keyid_mask variables.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/include/asm/pgtable_types.h | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 1e5a40673953..e8ebe760b88d 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -121,8 +121,13 @@
> * protection key is treated like _PAGE_RW, for
> * instance, and is *not* included in this mask since
> * pte_modify() does modify it.
> + *
> + * It includes full range of PFN bits regardless if they were claimed for KeyID
> + * or not: we want to preserve KeyID on pte_modify() and pgprot_modify().
> */
> -#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
> +#define PTE_PFN_MASK_MAX \
> + (((signed long)PAGE_MASK) & ((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
"signed long" is really unusual to see. Was that intentional?
> +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
> _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
> _PAGE_SOFT_DIRTY)
> #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
This makes me a bit nervous. We have some places (here) where we
pretend that the KeyID is part of the paddr and then other places like
pte_pfn() where it's not.
Seems like something that will come back to bite us.
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> +bool vma_is_encrypted(struct vm_area_struct *vma)
> +{
> + return pgprot_val(vma->vm_page_prot) & mktme_keyid_mask;
> +}
> +
> +int vma_keyid(struct vm_area_struct *vma)
> +{
> + pgprotval_t prot;
> +
> + if (!vma_is_anonymous(vma))
> + return 0;
> +
> + prot = pgprot_val(vma->vm_page_prot);
> + return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
> +}
Why do we have a vma_is_anonymous() in one of these but not the other?
While this reuse of ->vm_page_prot is cute, is there any downside? It's
the first place I know of that we can't derive ->vm_page_prot from
->vm_flags on non-VM_IO/PFNMAP VMAs. Is that a problem?
> +int page_keyid(const struct page *page)
> +{
> + if (mktme_status != MKTME_ENABLED)
> + return 0;
> +
> + return lookup_page_ext(page)->keyid;
> +}
> +EXPORT_SYMBOL(page_keyid);
Please start using a proper X86_FEATURE_* flag for this. It will give
you all the fancy static patching that you are missing by doing it this way.
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> prep_encrypted_page() also takes care about zeroing the page. We have to
> do this after KeyID is set for the page.
This is an implementation detail that has gone unmentioned until now but
has impacted at least half a dozen locations in previous patches. Can
you rectify that, please?
> +void prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> +{
> + int i;
> +
> + /*
> + * The hardware/CPU does not enforce coherency between mappings of the
> + * same physical page with different KeyIDs or encrypt ion keys.
What are "encrypt ion"s? :)
> + * We are responsible for cache management.
> + *
> + * We flush cache before allocating encrypted page
> + */
> + clflush_cache_range(page_address(page), PAGE_SIZE << order);
> +
> + for (i = 0; i < (1 << order); i++) {
> + WARN_ON_ONCE(lookup_page_ext(page)->keyid);
/* All pages coming out of the allocator should have KeyID 0 */
> + lookup_page_ext(page)->keyid = keyid;
> + /* Clear the page after the KeyID is set. */
> + if (zero)
> + clear_highpage(page);
> + }
> +}
How expensive is this?
> +void arch_free_page(struct page *page, int order)
> +{
> + int i;
>
/* KeyId-0 pages were not used for MKTME and need no work */
... or something
> + if (!page_keyid(page))
> + return;
Is page_keyid() optimized so that all this goes away automatically when
MKTME is compiled out or unsupported?
> + for (i = 0; i < (1 << order); i++) {
> + WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
> + lookup_page_ext(page)->keyid = 0;
> + }
> +
> + clflush_cache_range(page_address(page), PAGE_SIZE << order);
> +}
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> Separate MKTME enumaration from enabling. We need to postpone enabling
> until initialization is complete.
^ enumeration
> The new helper mktme_disable() allows to disable MKTME even if it's
s/to disable/disabling/
> enumerated successfully. MKTME initialization may fail and this
> functionallity allows system to boot regardless of the failure.
What can make it fail?
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> Kernel need to have a way to access encrypted memory. We are going to
"The kernel needs"...
> use per-KeyID direct mapping to facilitate the access with minimal
> overhead.
What are the security implications of this approach?
> Direct mapping for each KeyID will be put next to each other in the
That needs to be "a direct mapping" or "the direct mapping". It's
missing an article to start the sentence.
> virtual address space. We need to have a way to find boundaries of
> direct mapping for particular KeyID.
>
> The new variable direct_mapping_size specifies the size of direct
> mapping. With the value, it's trivial to find direct mapping for
> KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
I think this deserves an update to Documentation/x86/x86_64/mm.txt, no?
> Size of direct mapping is calculated during KASLR setup. If KALSR is
> disable it happens during MKTME initialization.
"disabled"
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 4408cd9a3bef..3d8ef8cb97e1 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
> return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
> }
>
> +#ifndef CONFIG_X86_INTEL_MKTME
> +static void __init setup_direct_mapping_size(void)
> +{
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> + direct_mapping_size += (1UL << TB_SHIFT) * CONFIG_MEMORY_PHYSICAL_PADDING;
> +}
> +#endif
Comments, please.
> /* Initialize base and padding for each memory region randomized with KASLR */
> void __init kernel_randomize_memory(void)
> {
> @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
> if (!kaslr_memory_enabled())
> return;
>
> - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> + /*
> + * Upper limit for direct mapping size is 1/4 of whole virtual
> + * address space
> + */
> + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);
Is this a cleanup that can be separate?
> kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
>
> /*
> @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
> * add padding if needed (especially for memory hotplug support).
> */
> BUG_ON(kaslr_regions[0].base != &page_offset_base);
> - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> - CONFIG_MEMORY_PHYSICAL_PADDING;
> +
> + setup_direct_mapping_size();
> +
> + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;
What's the +1 for? Is "mktme_nr_keyids" 0 for "MKTME unsupported"?
That needs to be called out, I think.
> /* Adapt phyiscal memory region size based on available memory */
> if (memory_tb < kaslr_regions[0].size_tb)
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> index 43a44f0f2a2d..3e5322bf035e 100644
> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
> struct page_ext_operations page_mktme_ops = {
> .need = need_page_mktme,
> };
> +
> +void __init setup_direct_mapping_size(void)
> +{
> + unsigned long available_va;
> +
> + /* 1/4 of virtual address space is didicated for direct mapping */
> + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> +
> + /* How much memory the systrem has? */
> + direct_mapping_size = max_pfn << PAGE_SHIFT;
> + direct_mapping_size = round_up(direct_mapping_size, 1UL << 40);
> +
> + if (mktme_status != MKTME_ENUMERATED)
> + goto out;
> +
> + /*
> + * Not enough virtual address space to address all physical memory with
> + * MKTME enabled. Even without padding.
> + *
> + * Disable MKTME instead.
> + */
> + if (direct_mapping_size > available_va / mktme_nr_keyids + 1) {
> + pr_err("x86/mktme: Disabled. Not enough virtual address space\n");
> + pr_err("x86/mktme: Consider switching to 5-level paging\n");
> + mktme_disable();
> + goto out;
> + }
> +
> + /*
> + * Virtual address space is divided between per-KeyID direct mappings.
> + */
> + available_va /= mktme_nr_keyids + 1;
> +out:
> + /* Add padding, if there's enough virtual address space */
> + direct_mapping_size += (1UL << 40) * CONFIG_MEMORY_PHYSICAL_PADDING;
> + if (direct_mapping_size > available_va)
> + direct_mapping_size = available_va;
> +}
Do you really need two copies of this function? Shouldn't it see
mktme_status!=MKTME_ENUMERATED and just jump out? How is the code
before that "goto out" different from the CONFIG_MKTME=n case?
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> arch/x86/include/asm/mktme.h | 6 +
> arch/x86/mm/init_64.c | 6 +
> arch/x86/mm/mktme.c | 444 +++++++++++++++++++++++++++++++++++
> 3 files changed, 456 insertions(+)
Can we not do any better than 400 lines of new open-coded pagetable
hacking?
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> index efc0d4bb3b35..d6edcabacfc7 100644
> --- a/arch/x86/include/asm/mktme.h
> +++ b/arch/x86/include/asm/mktme.h
> @@ -43,6 +43,9 @@ void mktme_disable(void);
> void setup_direct_mapping_size(void);
> int sync_direct_mapping(void);
>
> +#define page_to_virt(x) \
> + (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
This looks like a super important memory management function being
defined in some obscure Intel-specific feature header. How does that work?
> #else
> #define mktme_keyid_mask ((phys_addr_t)0)
> #define mktme_nr_keyids 0
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 53c32af895ab..ffad496aadad 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -23,7 +23,7 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
> /* use the carry flag to determine if x was < __START_KERNEL_map */
> x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
>
> - return x;
> + return x % direct_mapping_size;
> }
What are the performance implications of this patch?
On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> Add new config option to enabled/disable Multi-Key Total Memory
> Encryption support.
>
> MKTME uses MEMORY_PHYSICAL_PADDING to reserve enough space in per-KeyID
> direct mappings for memory hotplug.
Isn't it really *the* direct mapping primarily? We make all of them
larger, but the direct mapping is impacted too. This makes it sound
like it applies only to the MKTME mappings.
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4fa2cf807321..d013495bb4ae 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1513,6 +1513,23 @@ config ARCH_USE_MEMREMAP_PROT
> def_bool y
> depends on AMD_MEM_ENCRYPT
>
> +config X86_INTEL_MKTME
> + bool "Intel Multi-Key Total Memory Encryption"
> + select DYNAMIC_PHYSICAL_MASK
> + select PAGE_EXTENSION
> + depends on X86_64 && CPU_SUP_INTEL
> + ---help---
> + Say yes to enable support for Multi-Key Total Memory Encryption.
> + This requires Intel processor that has support of the feature.
"requires an Intel processor"...
> + Multikey Total Memory Encryption (MKTME) is a technology that allows
> + transparent memory encryption in upcoming Intel platforms.
"in an upcoming"
> + MKTME is built on top of TME. TME allows encryption of the entirety
> + of system memory using a single key. MKTME allows to have multiple
"allows having multiple"...
> + encryption domains, each having own key -- different memory pages can
> + be encrypted with different keys.
> +
> # Common NUMA Features
> config NUMA
> bool "Numa Memory Allocation and Scheduler Support"
> @@ -2189,7 +2206,7 @@ config RANDOMIZE_MEMORY
>
> config MEMORY_PHYSICAL_PADDING
> hex "Physical memory mapping padding" if EXPERT
> - depends on RANDOMIZE_MEMORY
> + depends on RANDOMIZE_MEMORY || X86_INTEL_MKTME
> default "0xa" if MEMORY_HOTPLUG
> default "0x0"
> range 0x1 0x40 if MEMORY_HOTPLUG
>
On Wed, Jun 13, 2018 at 05:45:24PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:38 AM, Kirill A. Shutemov wrote:
> > VMAs with different KeyID do not mix together. Only VMAs with the same
> > KeyID are compatible.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > include/linux/mm.h | 7 +++++++
> > mm/mmap.c | 3 ++-
> > 2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 02a616e2f17d..1c3c15f37ed6 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1492,6 +1492,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
> > return !vma->vm_ops;
> > }
> >
> > +#ifndef vma_keyid
> > +static inline int vma_keyid(struct vm_area_struct *vma)
> > +{
> > + return 0;
> > +}
> > +#endif
>
> I'm generally not a fan of this #ifdef'ing method. It makes it hard to
> figure out who is supposed to define it, and it's also substantially
> more fragile in the face of #include ordering.
>
> I'd much rather see some Kconfig involvement, like
> CONFIG_ARCH_HAS_MEM_ENCRYPTION or something.
Well, it's matter of taste, I guess. I do prefer per-function #ifdef'ing.
It seems more flexible to me.
I can rework it if maintainers prefer to see config option instead.
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 05:50:24PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > Pages for encrypted VMAs have to be allocated in a special way:
> > we would need to propagate down not only desired NUMA node but also
> > whether the page is encrypted.
> >
> > It complicates not-so-trivial routine of huge page allocation in
> > khugepaged even more. It also puts more pressure on page allocator:
> > we cannot re-use pages allocated for encrypted VMA to collapse
> > page in unencrypted one or vice versa.
> >
> > I think for now it worth skipping encrypted VMAs. We can return
> > to this topic later.
>
> You're asking for this to be included, but without a major piece of THP
> support. Is THP support unimportant for this feature?
>
> Are we really asking the x86 maintainers to merge this feature with this
> restriction in place?
I gave it more thought after your comment and I think I see a way to get
khugepaged work with memory encryption.
Let me check it tomorrow.
--
Kirill A. Shutemov
On 06/13/2018 01:18 PM, Kirill A. Shutemov wrote:
>> Are we really asking the x86 maintainers to merge this feature with this
>> restriction in place?
> I gave it more thought after your comment and I think I see a way to get
> khugepaged work with memory encryption.
So should folks be reviewing this set, or skip it an wait on your new set?
On Wed, Jun 13, 2018 at 05:51:50PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > Pages encrypted with different encryption keys are not subject to KSM
> > merge. Otherwise it would cross security boundary.
>
> This needs a much stronger explanation.
Okay, fair enough.
> Which KeyID would be used for access in the new direct mappings?
New direct mapping?
Pages would be compared using direct mappings relevant for their KeyID.
They will be threated as identical if they plain-text is identical.
> What actually happens without this patch in place?
One of processes would get the page mapped with wrong KeyID and see
garbage. We setup mapping according to KeyID in vma->vm_page_prot.
--
Kirill A. Shutemov
On 06/13/2018 01:31 PM, Kirill A. Shutemov wrote:
>> What actually happens without this patch in place?
>
> One of processes would get the page mapped with wrong KeyID and see
> garbage.
OK, but what about two pages with the same KeyID? It's actually totally
possible for KSM to determine that two pages have the same plaintext and
merge them. Why don't we do that?
> We setup mapping according to KeyID in vma->vm_page_prot.
Then why do we bother with page_keyid() and the page_ext stuff?
On Wed, Jun 13, 2018 at 08:20:28PM +0000, Dave Hansen wrote:
> On 06/13/2018 01:18 PM, Kirill A. Shutemov wrote:
> >> Are we really asking the x86 maintainers to merge this feature with this
> >> restriction in place?
> > I gave it more thought after your comment and I think I see a way to get
> > khugepaged work with memory encryption.
>
> So should folks be reviewing this set, or skip it an wait on your new set?
You gave me fair bit of feedback to work on, but AFAICS it doesn't change
anything fundamentally.
More feedback is welcome.
--
Kirill A. Shutemov
On 06/13/2018 01:35 PM, Dave Hansen wrote:
> On 06/13/2018 01:31 PM, Kirill A. Shutemov wrote:
>>> What actually happens without this patch in place?
>> One of processes would get the page mapped with wrong KeyID and see
>> garbage.
> OK, but what about two pages with the same KeyID? It's actually totally
> possible for KSM to determine that two pages have the same plaintext and
> merge them. Why don't we do that?
/me goes back to look at the patch... which is doing exactly that
But, this still needs a stronger explanation of what goes wrong if this
patch is not in place.
On Wed, Jun 13, 2018 at 08:35:46PM +0000, Dave Hansen wrote:
> On 06/13/2018 01:31 PM, Kirill A. Shutemov wrote:
> >> What actually happens without this patch in place?
> >
> > One of processes would get the page mapped with wrong KeyID and see
> > garbage.
>
> OK, but what about two pages with the same KeyID? It's actually totally
> possible for KSM to determine that two pages have the same plaintext and
> merge them. Why don't we do that?
That's exactly what we do :)
> > We setup mapping according to KeyID in vma->vm_page_prot.
>
> Then why do we bother with page_keyid() and the page_ext stuff?
VMA is not always around.
Using KeyID in vma->vm_page_prot we don't need to change anything in PTE
setup functions. It just works.
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:07:40PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > For encrypted memory, we need to allocated pages for a specific
> > encryption KeyID.
>
> "allocate" ^
>
> > There are two cases when we need to allocate a page for encryption:
> >
> > - Allocation for an encrypted VMA;
> >
> > - Allocation for migration of encrypted page;
> >
> > The first case can be covered within alloc_page_vma().
>
> ... because we know the KeyID from the VMA?
Right. I'll update commit message.
> > The second case requires few new page allocation routines that would
> > allocate the page for a specific KeyID.
> >
> > Encrypted page has to be cleared after KeyID set. This is handled by
>
> "An encrypted page has ... "
>
> This description lacks a description of the performance impact of the
> approach in this patch both when allocating encrypted and normal pages.
You are right. I'll measure for the next iteration.
> > --- a/arch/alpha/include/asm/page.h
> > +++ b/arch/alpha/include/asm/page.h
> > @@ -18,7 +18,7 @@ extern void clear_page(void *page);
> > #define clear_user_page(page, vaddr, pg) clear_page(page)
> >
> > #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
> > - alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vmaddr)
> > + alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
> > #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
>
> Does this compile? Wouldn't "vmaddr" be undefined?
Yes, it compiles.
Before I reorganized macros around alloc_page_vma(), the argument was
ignored for non-NUMA systems. NUMA on Alpha marked BROKEN and never
enabled.
> > +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
> > + alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
>
> The argument addition should be broken out into a preparatory patch.
There's no new argument. I've just unified alloc_hugepage_vma() codepath
for NUMA and non-NUMA.
But sure I'll split it into a separate patch.
> > extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
> > extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index f2b4abbca55e..6da504bad841 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -38,9 +38,11 @@ static inline struct page *new_page_nodemask(struct page *page,
> > unsigned int order = 0;
> > struct page *new_page = NULL;
> >
> > - if (PageHuge(page))
> > + if (PageHuge(page)) {
> > + WARN_ON(page_keyid(page));
> > return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> > preferred_nid, nodemask);
> > + }
>
> Comment on the warning, please.
Sure.
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 9ac49ef17b4e..00bccbececea 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -920,22 +920,24 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> > /* page allocation callback for NUMA node migration */
> > struct page *alloc_new_node_page(struct page *page, unsigned long node)
> > {
> > - if (PageHuge(page))
> > + if (PageHuge(page)) {
> > + WARN_ON(page_keyid(page));
> > return alloc_huge_page_node(page_hstate(compound_head(page)),
> > node);
>
> Comments, please.
>
> > @@ -2012,9 +2014,16 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> > {
> > struct mempolicy *pol;
> > struct page *page;
> > - int preferred_nid;
> > + bool zero = false;
> > + int keyid, preferred_nid;
> > nodemask_t *nmask;
> >
> > + keyid = vma_keyid(vma);
> > + if (keyid && gfp & __GFP_ZERO) {
> > + zero = true;
> > + gfp &= ~__GFP_ZERO;
> > + }
>
> I totally read that wrong.
>
> "zero" needs to be named: "page_need_zeroing".
>
> It also badly needs a comment.
Got it.
> > pol = get_vma_policy(vma, addr);
> >
> > if (pol->mode == MPOL_INTERLEAVE) {
> > @@ -2057,6 +2066,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> > page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> > mpol_cond_put(pol);
> > out:
> > + if (page && keyid)
> > + prep_encrypted_page(page, order, keyid, zero);
> > return page;
> > }
>
> I'd just have prep_encrypted_page() do the keyid-0 opt-out of the prep
> work. It'll be less to patch when you
Makes sense.
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 8c0af0f7cab1..eb8dea219dcb 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1847,7 +1847,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> > int nid = (int) data;
> > struct page *newpage;
> >
> > - newpage = __alloc_pages_node(nid,
> > + newpage = __alloc_pages_node_keyid(nid, page_keyid(page),
> > (GFP_HIGHUSER_MOVABLE |
> > __GFP_THISNODE | __GFP_NOMEMALLOC |
> > __GFP_NORETRY | __GFP_NOWARN) &
>
> I thought folks asked you not to change all of the calling conventions
> across the page allocator. It seems like you're still doing that,
> though. A reviewer might think you've ignored their earlier feedback.
> Did you?
No. I asked to implement encrypted page allocation as a wrappers on top of
existing routines.
> > +#ifndef CONFIG_NUMA
> > +struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
> > + struct vm_area_struct *vma, unsigned long addr,
> > + int node, bool hugepage)
> > +{
> > + struct page *page;
> > + bool zero = false;
> > + int keyid = vma_keyid(vma);
> > +
> > + if (keyid && (gfp_mask & __GFP_ZERO)) {
>
> Please at least do your parenthesis consistently. :)
Okay.
> > + zero = true;
> > + gfp_mask &= ~__GFP_ZERO;
> > + }
> > +
> > + page = alloc_pages(gfp_mask, order);
> > + if (page && keyid)
> > + prep_encrypted_page(page, order, keyid, zero);
> > +
> > + return page;
> > +}
> > +#endif
>
> I'm also confused by the #ifdef. What is it for?
We already have alloc_pages_vma() for NUMA. See mm/mempolicy.c.
> > +struct page * __alloc_pages_node_keyid(int nid, int keyid,
> > + gfp_t gfp_mask, unsigned int order)
> > +{
> > + struct page *page;
> > + bool zero = false;
> > +
> > + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > + VM_WARN_ON(!node_online(nid));
> > +
> > + if (keyid && (gfp_mask & __GFP_ZERO)) {
> > + zero = true;
> > + gfp_mask &= ~__GFP_ZERO;
> > + }
>
> OK, so this is the third time I've seen that pattern. Are you *sure*
> you don't want to consolidate the sites?
I'll see what I can do here.
Not sure if a wrapper will be cleaner for a reader: we need to return two
values new gfp_mask and page_need_zeroing.
> > + page = __alloc_pages(gfp_mask, order, nid);
> > + if (page && keyid)
> > + prep_encrypted_page(page, order, keyid, zero);
> > +
> > + return page;
> > +}
> > +
> > #ifdef CONFIG_LOCKDEP
> > struct lockdep_map __fs_reclaim_map =
> > STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> > @@ -4396,6 +4439,26 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> > }
> > EXPORT_SYMBOL(__alloc_pages_nodemask);
> >
> > +struct page *
> > +__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
> > + int preferred_nid, nodemask_t *nodemask, int keyid)
> > +{
> > + struct page *page;
> > + bool zero = false;
> > +
> > + if (keyid && (gfp_mask & __GFP_ZERO)) {
> > + zero = true;
> > + gfp_mask &= ~__GFP_ZERO;
> > + }
>
> Fourth one. :)
>
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:13:03PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > Encrypted VMA will have KeyID stored in vma->vm_page_prot. This way we
>
> "An encrypted VMA..."
>
> > don't need to do anything special to setup encrypted page table entries
> > and don't need to reserve space for KeyID in a VMA.
> >
> > This patch changes _PAGE_CHG_MASK to include KeyID bits. Otherwise they
> > are going to be stripped from vm_page_prot on the first pgprot_modify().
> >
> > Define PTE_PFN_MASK_MAX similar to PTE_PFN_MASK but based on
> > __PHYSICAL_MASK_SHIFT. This way we include whole range of bits
> > architecturally available for PFN without referencing physical_mask and
> > mktme_keyid_mask variables.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/include/asm/pgtable_types.h | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > index 1e5a40673953..e8ebe760b88d 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -121,8 +121,13 @@
> > * protection key is treated like _PAGE_RW, for
> > * instance, and is *not* included in this mask since
> > * pte_modify() does modify it.
> > + *
> > + * It includes full range of PFN bits regardless if they were claimed for KeyID
> > + * or not: we want to preserve KeyID on pte_modify() and pgprot_modify().
> > */
> > -#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
> > +#define PTE_PFN_MASK_MAX \
> > + (((signed long)PAGE_MASK) & ((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
>
> "signed long" is really unusual to see. Was that intentional?
Yes. That's trick with sign-extension, borrowed from PHYSICAL_PAGE_MASK
definition. It helps on 32-bit with PAE properly expand the PAGE_MASK to
64-bit.
I'll add comment.
> > +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
> > _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
> > _PAGE_SOFT_DIRTY)
> > #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
>
> This makes me a bit nervous. We have some places (here) where we
> pretend that the KeyID is part of the paddr and then other places like
> pte_pfn() where it's not.
Other option is to include KeyID mask into _PAGE_CHG_MASK. But it means
_PAGE_CHG_MASK would need to reference *two* variables: physical_mask and
mktme_keyid_mask. I mentioned this in the commit message.
This is more efficient way to achieve the same compile-time without
referencing any variables.
> Seems like something that will come back to bite us.
Any suggestions?
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:18:05PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > +bool vma_is_encrypted(struct vm_area_struct *vma)
> > +{
> > + return pgprot_val(vma->vm_page_prot) & mktme_keyid_mask;
> > +}
> > +
> > +int vma_keyid(struct vm_area_struct *vma)
> > +{
> > + pgprotval_t prot;
> > +
> > + if (!vma_is_anonymous(vma))
> > + return 0;
> > +
> > + prot = pgprot_val(vma->vm_page_prot);
> > + return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
> > +}
>
> Why do we have a vma_is_anonymous() in one of these but not the other?
It shouldn't be there. It's from earlier approach to the function.
I'll fix this.
And I'll drop vma_is_encrypted(). It is not very useful.
> While this reuse of ->vm_page_prot is cute, is there any downside? It's
> the first place I know of that we can't derive ->vm_page_prot from
> ->vm_flags on non-VM_IO/PFNMAP VMAs. Is that a problem?
I don't think so.
It need to be covered in pte_modify() and such, but it's about it.
That's relatively isolated change and we can move KeyID into a standalone
field, if this approach proves to be problematic.
--
Kirill A. Shutemov
On 06/15/2018 05:57 AM, Kirill A. Shutemov wrote:
>>> +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
>>> _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
>>> _PAGE_SOFT_DIRTY)
>>> #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
>> This makes me a bit nervous. We have some places (here) where we
>> pretend that the KeyID is part of the paddr and then other places like
>> pte_pfn() where it's not.
> Other option is to include KeyID mask into _PAGE_CHG_MASK. But it means
> _PAGE_CHG_MASK would need to reference *two* variables: physical_mask and
> mktme_keyid_mask. I mentioned this in the commit message.
Why can't it be one variable with a different name that's populated by
OR'ing physical_mask and mktme_keyid_mask together?
My issue here is that it this approach adds confusion around the logical
separation between physical address and the bits immediately above the
physical address in the PTE that are stolen for the keyID.
Whatever you come up with will probably fine, as long as things that are
called "PFN" or physical address don't also get used for keyID bits.
On Fri, Jun 15, 2018 at 01:43:03PM +0000, Dave Hansen wrote:
> On 06/15/2018 05:57 AM, Kirill A. Shutemov wrote:
> >>> +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
> >>> _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
> >>> _PAGE_SOFT_DIRTY)
> >>> #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
> >> This makes me a bit nervous. We have some places (here) where we
> >> pretend that the KeyID is part of the paddr and then other places like
> >> pte_pfn() where it's not.
> > Other option is to include KeyID mask into _PAGE_CHG_MASK. But it means
> > _PAGE_CHG_MASK would need to reference *two* variables: physical_mask and
> > mktme_keyid_mask. I mentioned this in the commit message.
>
> Why can't it be one variable with a different name that's populated by
> OR'ing physical_mask and mktme_keyid_mask together?
My point is that we don't need variables at all here.
Architecture defines range of bits in PTE used for PFN. MKTME reduces the
number of bits for PFN. PTE_PFN_MASK_MAX represents the original
architectural range, before MKTME stole these bits.
PTE_PFN_MASK_MAX is constant -- on x86-64 bits 51:12 -- regardless of
MKTME support.
> My issue here is that it this approach adds confusion around the logical
> separation between physical address and the bits immediately above the
> physical address in the PTE that are stolen for the keyID.
Well, yes, with MKTME the meaning of PFN/physical address is not that clear
cut. This comes from hardware design. I don't think we will be able to get
rid of ambiguity completely.
If you have suggestions on how to make it clearer, I would be glad to
rework it.
> Whatever you come up with will probably fine, as long as things that are
> called "PFN" or physical address don't also get used for keyID bits.
We are arguing about macros used exactly once. Is it really so confusing?
--
Kirill A. Shutemov
On 06/15/2018 08:27 AM, Kirill A. Shutemov wrote:
> On Fri, Jun 15, 2018 at 01:43:03PM +0000, Dave Hansen wrote:
>> On 06/15/2018 05:57 AM, Kirill A. Shutemov wrote:
>>>>> +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
>>>>> _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
>>>>> _PAGE_SOFT_DIRTY)
>>>>> #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
>>>> This makes me a bit nervous. We have some places (here) where we
>>>> pretend that the KeyID is part of the paddr and then other places like
>>>> pte_pfn() where it's not.
>>> Other option is to include KeyID mask into _PAGE_CHG_MASK. But it means
>>> _PAGE_CHG_MASK would need to reference *two* variables: physical_mask and
>>> mktme_keyid_mask. I mentioned this in the commit message.
>>
>> Why can't it be one variable with a different name that's populated by
>> OR'ing physical_mask and mktme_keyid_mask together?
>
> My point is that we don't need variables at all here.
>
> Architecture defines range of bits in PTE used for PFN. MKTME reduces the
> number of bits for PFN. PTE_PFN_MASK_MAX represents the original
> architectural range, before MKTME stole these bits.
>
> PTE_PFN_MASK_MAX is constant -- on x86-64 bits 51:12 -- regardless of
> MKTME support.
Then please just rename the make PTE_<SOMETHING>_MASK where <SOMETHING>
includes both the concept of a physical address and a MKTME keyID. Just
don't call it a pfn if it is not used in physical addressing.
>> Whatever you come up with will probably fine, as long as things that are
>> called "PFN" or physical address don't also get used for keyID bits.
>
> We are arguing about macros used exactly once. Is it really so confusing?
Yes.
On Fri, Jun 15, 2018 at 03:31:57PM +0000, Dave Hansen wrote:
> On 06/15/2018 08:27 AM, Kirill A. Shutemov wrote:
> > On Fri, Jun 15, 2018 at 01:43:03PM +0000, Dave Hansen wrote:
> >> On 06/15/2018 05:57 AM, Kirill A. Shutemov wrote:
> >>>>> +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
> >>>>> _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
> >>>>> _PAGE_SOFT_DIRTY)
> >>>>> #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
> >>>> This makes me a bit nervous. We have some places (here) where we
> >>>> pretend that the KeyID is part of the paddr and then other places like
> >>>> pte_pfn() where it's not.
> >>> Other option is to include KeyID mask into _PAGE_CHG_MASK. But it means
> >>> _PAGE_CHG_MASK would need to reference *two* variables: physical_mask and
> >>> mktme_keyid_mask. I mentioned this in the commit message.
> >>
> >> Why can't it be one variable with a different name that's populated by
> >> OR'ing physical_mask and mktme_keyid_mask together?
> >
> > My point is that we don't need variables at all here.
> >
> > Architecture defines range of bits in PTE used for PFN. MKTME reduces the
> > number of bits for PFN. PTE_PFN_MASK_MAX represents the original
> > architectural range, before MKTME stole these bits.
> >
> > PTE_PFN_MASK_MAX is constant -- on x86-64 bits 51:12 -- regardless of
> > MKTME support.
>
> Then please just rename the make PTE_<SOMETHING>_MASK where <SOMETHING>
> includes both the concept of a physical address and a MKTME keyID. Just
> don't call it a pfn if it is not used in physical addressing.
I have no idea what such concept should be called. I cannot come with
anything better than PTE_PFN_MASK_MAX. Do you?
--
Kirill A. Shutemov
On 06/15/2018 09:06 AM, Kirill A. Shutemov wrote:
> I have no idea what such concept should be called. I cannot come with
> anything better than PTE_PFN_MASK_MAX. Do you?
PTE_PRESERVE_MASK
Or maybe:
PTE_MODIFY_PRESERVE_MASK
Maybe a comment to go along with it:
/*
* These are the bits that must be preserved during when doing a
* PTE permission modification operation, like taking a PTE from
* R/W->R/O. They include the physical address and the memory
* encryption keyID. The paddr and the keyID never occupy the same
* bits at the same time. But, a given bit might be used for the keyID
* on one system and used for the physical address on another. As an
* optimization, we manage them in one unit here since their combination
* always occupies the same hardware bits.
*/
On Fri, Jun 15, 2018 at 04:58:24PM +0000, Dave Hansen wrote:
> On 06/15/2018 09:06 AM, Kirill A. Shutemov wrote:
> > I have no idea what such concept should be called. I cannot come with
> > anything better than PTE_PFN_MASK_MAX. Do you?
>
> PTE_PRESERVE_MASK
>
> Or maybe:
>
> PTE_MODIFY_PRESERVE_MASK
It just replacing one confusion with another. Preserve what? How does it
differ from _PAGE_CHG_MASK?
I frankly think my name proposal convey more meaning.
> Maybe a comment to go along with it:
>
> /*
> * These are the bits that must be preserved during when doing a
> * PTE permission modification operation, like taking a PTE from
> * R/W->R/O. They include the physical address and the memory
> * encryption keyID. The paddr and the keyID never occupy the same
> * bits at the same time. But, a given bit might be used for the keyID
> * on one system and used for the physical address on another. As an
> * optimization, we manage them in one unit here since their combination
> * always occupies the same hardware bits.
> */
Thanks, this is useful.
--
Kirill A. Shutemov
On 06/15/2018 01:45 PM, Kirill A. Shutemov wrote:
> On Fri, Jun 15, 2018 at 04:58:24PM +0000, Dave Hansen wrote:
>> On 06/15/2018 09:06 AM, Kirill A. Shutemov wrote:
>>> I have no idea what such concept should be called. I cannot come with
>>> anything better than PTE_PFN_MASK_MAX. Do you?
>> PTE_PRESERVE_MASK
>>
>> Or maybe:
>>
>> PTE_MODIFY_PRESERVE_MASK
> It just replacing one confusion with another. Preserve what?
"Preserve this mask when modifying pte permission bits"
On Fri, Jun 15, 2018 at 08:45:21PM +0000, Dave Hansen wrote:
> On 06/15/2018 01:45 PM, Kirill A. Shutemov wrote:
> > On Fri, Jun 15, 2018 at 04:58:24PM +0000, Dave Hansen wrote:
> >> On 06/15/2018 09:06 AM, Kirill A. Shutemov wrote:
> >>> I have no idea what such concept should be called. I cannot come with
> >>> anything better than PTE_PFN_MASK_MAX. Do you?
> >> PTE_PRESERVE_MASK
> >>
> >> Or maybe:
> >>
> >> PTE_MODIFY_PRESERVE_MASK
> > It just replacing one confusion with another. Preserve what?
>
> "Preserve this mask when modifying pte permission bits"
We preserve more than these bits. See _PAGE_CHG_MASK.
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:26:10PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > prep_encrypted_page() also takes care about zeroing the page. We have to
> > do this after KeyID is set for the page.
>
> This is an implementation detail that has gone unmentioned until now but
> has impacted at least half a dozen locations in previous patches. Can
> you rectify that, please?
It was mentioned in commit message of 04/17.
> > +void prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> > +{
> > + int i;
> > +
> > + /*
> > + * The hardware/CPU does not enforce coherency between mappings of the
> > + * same physical page with different KeyIDs or encrypt ion keys.
>
> What are "encrypt ion"s? :)
:P
> > + * We are responsible for cache management.
> > + *
> > + * We flush cache before allocating encrypted page
> > + */
> > + clflush_cache_range(page_address(page), PAGE_SIZE << order);
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + WARN_ON_ONCE(lookup_page_ext(page)->keyid);
>
> /* All pages coming out of the allocator should have KeyID 0 */
>
Okay.
> > + lookup_page_ext(page)->keyid = keyid;
> > + /* Clear the page after the KeyID is set. */
> > + if (zero)
> > + clear_highpage(page);
> > + }
> > +}
>
> How expensive is this?
It just shifts cost of zeroing from page allocator here. It should not
have huge effect.
> > +void arch_free_page(struct page *page, int order)
> > +{
> > + int i;
> >
>
> /* KeyId-0 pages were not used for MKTME and need no work */
>
> ... or something
Okay.
> > + if (!page_keyid(page))
> > + return;
>
> Is page_keyid() optimized so that all this goes away automatically when
> MKTME is compiled out or unsupported?
If MKTME is not enabled compile-time, this translation unit doesn't
compile at all.
I have not yet optimized for run-time unsupported case. I'll optimized it
based on performance measurements.
> > + for (i = 0; i < (1 << order); i++) {
> > + WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
> > + lookup_page_ext(page)->keyid = 0;
> > + }
> > +
> > + clflush_cache_range(page_address(page), PAGE_SIZE << order);
> > +}
>
>
>
>
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:30:02PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > Separate MKTME enumaration from enabling. We need to postpone enabling
> > until initialization is complete.
>
> ^ enumeration
Nope.
I want to differentiate enumeration in detect_tme() and the point where
MKTME is usable: after mktme_init().
> > The new helper mktme_disable() allows to disable MKTME even if it's
>
> s/to disable/disabling/
> > enumerated successfully. MKTME initialization may fail and this
> > functionallity allows system to boot regardless of the failure.
>
> What can make it fail?
I'll add this to commit message:
MKTME needs per-KeyID direct mapping. It requires a lot more virtual
address space which may be a problem in 4-level paging mode. If the
system has more physical memory than we can handle with MKTME.
The feature allows to fail MKTME, but boot the system successfully.
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:20:10PM +0000, Dave Hansen wrote:
> > +int page_keyid(const struct page *page)
> > +{
> > + if (mktme_status != MKTME_ENABLED)
> > + return 0;
> > +
> > + return lookup_page_ext(page)->keyid;
> > +}
> > +EXPORT_SYMBOL(page_keyid);
>
> Please start using a proper X86_FEATURE_* flag for this. It will give
> you all the fancy static patching that you are missing by doing it this way.
There's no MKTME CPU feature.
Well, I guess we can invent syntactic one or just use static key directly.
Let's see how it behaves performance-wise before optimizing this.
--
Kirill A. Shutemov
On 06/18/2018 03:07 AM, Kirill A. Shutemov wrote:
> On Wed, Jun 13, 2018 at 06:20:10PM +0000, Dave Hansen wrote:
>>> +int page_keyid(const struct page *page)
>>> +{
>>> + if (mktme_status != MKTME_ENABLED)
>>> + return 0;
>>> +
>>> + return lookup_page_ext(page)->keyid;
>>> +}
>>> +EXPORT_SYMBOL(page_keyid);
>> Please start using a proper X86_FEATURE_* flag for this. It will give
>> you all the fancy static patching that you are missing by doing it this way.
> There's no MKTME CPU feature.
Right. We have tons of synthetic features that have no basis in the
hardware CPUID feature.
> Well, I guess we can invent syntactic one or just use static key directly.
Did you mean synthetic?
> Let's see how it behaves performance-wise before optimizing this.
It's not an optimization, it's how we do things in arch/x86, and it has
a *ton* of optimization infrastructure behind it that you get for free
if you use it.
I'm just trying to save Thomas's tired fingers from having to say the
same thing in a week or two when he looks at this.
On Wed, Jun 13, 2018 at 06:37:07PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > Kernel need to have a way to access encrypted memory. We are going to
> "The kernel needs"...
>
> > use per-KeyID direct mapping to facilitate the access with minimal
> > overhead.
>
> What are the security implications of this approach?
I'll add this to the message:
Per-KeyID mappings require a lot more virtual address space. On 4-level
machine with 64 KeyIDs we max out 46-bit virtual address space dedicated
for direct mapping with 1TiB of RAM. Given that we round up any
calculation on direct mapping size to 1TiB, we effectively claim all
46-bit address space for direct mapping on such machine regardless of
RAM size.
Increased usage of virtual address space has implications for KASLR:
we have less space for randomization. With 64 TiB claimed for direct
mapping with 4-level we left with 27 TiB of entropy to place
page_offset_base, vmalloc_base and vmemmap_base.
5-level paging provides much wider virtual address space and KASLR
doesn't suffer significantly from per-KeyID direct mappings.
It's preferred to run MKTME with 5-level paging.
> > Direct mapping for each KeyID will be put next to each other in the
>
> That needs to be "a direct mapping" or "the direct mapping". It's
> missing an article to start the sentence.
Okay.
> > virtual address space. We need to have a way to find boundaries of
> > direct mapping for particular KeyID.
> >
> > The new variable direct_mapping_size specifies the size of direct
> > mapping. With the value, it's trivial to find direct mapping for
> > KeyID-N: PAGE_OFFSET + N * direct_mapping_size.
>
> I think this deserves an update to Documentation/x86/x86_64/mm.txt, no?
Right, I'll update it.
> > Size of direct mapping is calculated during KASLR setup. If KALSR is
> > disable it happens during MKTME initialization.
>
> "disabled"
>
> > diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> > index 4408cd9a3bef..3d8ef8cb97e1 100644
> > --- a/arch/x86/mm/kaslr.c
> > +++ b/arch/x86/mm/kaslr.c
> > @@ -69,6 +69,15 @@ static inline bool kaslr_memory_enabled(void)
> > return kaslr_enabled() && !IS_ENABLED(CONFIG_KASAN);
> > }
> >
> > +#ifndef CONFIG_X86_INTEL_MKTME
> > +static void __init setup_direct_mapping_size(void)
> > +{
> > + direct_mapping_size = max_pfn << PAGE_SHIFT;
> > + direct_mapping_size = round_up(direct_mapping_size, 1UL << TB_SHIFT);
> > + direct_mapping_size += (1UL << TB_SHIFT) * CONFIG_MEMORY_PHYSICAL_PADDING;
> > +}
> > +#endif
>
> Comments, please.
Okay.
> > /* Initialize base and padding for each memory region randomized with KASLR */
> > void __init kernel_randomize_memory(void)
> > {
> > @@ -93,7 +102,11 @@ void __init kernel_randomize_memory(void)
> > if (!kaslr_memory_enabled())
> > return;
> >
> > - kaslr_regions[0].size_tb = 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT);
> > + /*
> > + * Upper limit for direct mapping size is 1/4 of whole virtual
> > + * address space
> > + */
> > + kaslr_regions[0].size_tb = 1 << (__VIRTUAL_MASK_SHIFT - 1 - TB_SHIFT);
>
> Is this a cleanup that can be separate?
Right. I'll split it up.
> > kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
> >
> > /*
> > @@ -101,8 +114,10 @@ void __init kernel_randomize_memory(void)
> > * add padding if needed (especially for memory hotplug support).
> > */
> > BUG_ON(kaslr_regions[0].base != &page_offset_base);
> > - memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
> > - CONFIG_MEMORY_PHYSICAL_PADDING;
> > +
> > + setup_direct_mapping_size();
> > +
> > + memory_tb = direct_mapping_size * mktme_nr_keyids + 1;
>
> What's the +1 for? Is "mktme_nr_keyids" 0 for "MKTME unsupported"?
> That needs to be called out, I think.
I'll add a comment.
> > /* Adapt phyiscal memory region size based on available memory */
> > if (memory_tb < kaslr_regions[0].size_tb)
> > diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> > index 43a44f0f2a2d..3e5322bf035e 100644
> > --- a/arch/x86/mm/mktme.c
> > +++ b/arch/x86/mm/mktme.c
> > @@ -89,3 +89,51 @@ static bool need_page_mktme(void)
> > struct page_ext_operations page_mktme_ops = {
> > .need = need_page_mktme,
> > };
> > +
> > +void __init setup_direct_mapping_size(void)
> > +{
> > + unsigned long available_va;
> > +
> > + /* 1/4 of virtual address space is didicated for direct mapping */
> > + available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
> > +
> > + /* How much memory the systrem has? */
> > + direct_mapping_size = max_pfn << PAGE_SHIFT;
> > + direct_mapping_size = round_up(direct_mapping_size, 1UL << 40);
> > +
> > + if (mktme_status != MKTME_ENUMERATED)
> > + goto out;
> > +
> > + /*
> > + * Not enough virtual address space to address all physical memory with
> > + * MKTME enabled. Even without padding.
> > + *
> > + * Disable MKTME instead.
> > + */
> > + if (direct_mapping_size > available_va / mktme_nr_keyids + 1) {
> > + pr_err("x86/mktme: Disabled. Not enough virtual address space\n");
> > + pr_err("x86/mktme: Consider switching to 5-level paging\n");
> > + mktme_disable();
> > + goto out;
> > + }
> > +
> > + /*
> > + * Virtual address space is divided between per-KeyID direct mappings.
> > + */
> > + available_va /= mktme_nr_keyids + 1;
> > +out:
> > + /* Add padding, if there's enough virtual address space */
> > + direct_mapping_size += (1UL << 40) * CONFIG_MEMORY_PHYSICAL_PADDING;
> > + if (direct_mapping_size > available_va)
> > + direct_mapping_size = available_va;
> > +}
>
> Do you really need two copies of this function? Shouldn't it see
> mktme_status!=MKTME_ENUMERATED and just jump out? How is the code
> before that "goto out" different from the CONFIG_MKTME=n case?
mktme.c is not compiled for CONFIG_MKTME=n.
--
Kirill A. Shutemov
On Mon, Jun 18, 2018 at 12:54:29PM +0000, Dave Hansen wrote:
> On 06/18/2018 03:07 AM, Kirill A. Shutemov wrote:
> > On Wed, Jun 13, 2018 at 06:20:10PM +0000, Dave Hansen wrote:
> >>> +int page_keyid(const struct page *page)
> >>> +{
> >>> + if (mktme_status != MKTME_ENABLED)
> >>> + return 0;
> >>> +
> >>> + return lookup_page_ext(page)->keyid;
> >>> +}
> >>> +EXPORT_SYMBOL(page_keyid);
> >> Please start using a proper X86_FEATURE_* flag for this. It will give
> >> you all the fancy static patching that you are missing by doing it this way.
> > There's no MKTME CPU feature.
>
> Right. We have tons of synthetic features that have no basis in the
> hardware CPUID feature.
>
> > Well, I guess we can invent syntactic one or just use static key directly.
>
> Did you mean synthetic?
Right.
> > Let's see how it behaves performance-wise before optimizing this.
>
> It's not an optimization, it's how we do things in arch/x86, and it has
> a *ton* of optimization infrastructure behind it that you get for free
> if you use it.
>
> I'm just trying to save Thomas's tired fingers from having to say the
> same thing in a week or two when he looks at this.
Okay, I'll look into this.
--
Kirill A. Shutemov
On 06/18/2018 06:12 AM, Kirill A. Shutemov wrote:
> On Wed, Jun 13, 2018 at 06:37:07PM +0000, Dave Hansen wrote:
>> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
>>> Kernel need to have a way to access encrypted memory. We are going to
>> "The kernel needs"...
>>
>>> use per-KeyID direct mapping to facilitate the access with minimal
>>> overhead.
>>
>> What are the security implications of this approach?
>
> I'll add this to the message:
>
> Per-KeyID mappings require a lot more virtual address space. On 4-level
> machine with 64 KeyIDs we max out 46-bit virtual address space dedicated
> for direct mapping with 1TiB of RAM. Given that we round up any
> calculation on direct mapping size to 1TiB, we effectively claim all
> 46-bit address space for direct mapping on such machine regardless of
> RAM size.
...
I was thinking more in terms of the exposure of keeping the plaintext
mapped all the time.
Imagine Meltdown if the decrypted page is not mapped into the kernel:
this feature could actually have protected user data.
But, with this scheme, it exposes the data... all the data... with all
possible keys... all the time. That's one heck of an attack surface.
Can we do better?
>>> struct page_ext_operations page_mktme_ops = {
>>> .need = need_page_mktme,
>>> };
>>> +
>>> +void __init setup_direct_mapping_size(void)
>>> +{
...
>>> +}
>>
>> Do you really need two copies of this function? Shouldn't it see
>> mktme_status!=MKTME_ENUMERATED and just jump out? How is the code
>> before that "goto out" different from the CONFIG_MKTME=n case?
>
> mktme.c is not compiled for CONFIG_MKTME=n.
I'd rather have one copy in shared code which is mosty optimized away
when CONFIG_MKTME=n than two copies.
On Wed, Jun 13, 2018 at 06:41:21PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > arch/x86/include/asm/mktme.h | 6 +
> > arch/x86/mm/init_64.c | 6 +
> > arch/x86/mm/mktme.c | 444 +++++++++++++++++++++++++++++++++++
> > 3 files changed, 456 insertions(+)
>
> Can we not do any better than 400 lines of new open-coded pagetable
> hacking?
It's not pretty, but I don't see much options.
I first tried to modify routines that initialize/modify/remove parts of
direct mapping to keep all per-KeyID direct mappings in sync from start.
But it didn't really fly. We need to initialize direct mapping very early
when we don't have a way to allocated page in a usual way. We have very
limited pool of pre-allocated pages to allocate page tables from and it's
not able to satisfy demand for multiple direct mappings.
So I had to go with syncing it later on. When we have working page
allocator.
Regarding open-codeness, we need to walk two subtrees in lock steps.
I don't see how get mm/pagewalk.c to work in such use case. (And I don't
really like callback-based pagewalker.)
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:43:08PM +0000, Dave Hansen wrote:
> > diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> > index efc0d4bb3b35..d6edcabacfc7 100644
> > --- a/arch/x86/include/asm/mktme.h
> > +++ b/arch/x86/include/asm/mktme.h
> > @@ -43,6 +43,9 @@ void mktme_disable(void);
> > void setup_direct_mapping_size(void);
> > int sync_direct_mapping(void);
> >
> > +#define page_to_virt(x) \
> > + (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
>
> This looks like a super important memory management function being
> defined in some obscure Intel-specific feature header. How does that work?
No magic. It overwrites define in <linux/mm.h>.
> > #else
> > #define mktme_keyid_mask ((phys_addr_t)0)
> > #define mktme_nr_keyids 0
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index 53c32af895ab..ffad496aadad 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -23,7 +23,7 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
> > /* use the carry flag to determine if x was < __START_KERNEL_map */
> > x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
> >
> > - return x;
> > + return x % direct_mapping_size;
> > }
>
> What are the performance implications of this patch?
Let me collect the numbers.
--
Kirill A. Shutemov
On Wed, Jun 13, 2018 at 06:46:34PM +0000, Dave Hansen wrote:
> On 06/12/2018 07:39 AM, Kirill A. Shutemov wrote:
> > Add new config option to enabled/disable Multi-Key Total Memory
> > Encryption support.
> >
> > MKTME uses MEMORY_PHYSICAL_PADDING to reserve enough space in per-KeyID
> > direct mappings for memory hotplug.
>
> Isn't it really *the* direct mapping primarily? We make all of them
> larger, but the direct mapping is impacted too. This makes it sound
> like it applies only to the MKTME mappings.
We only cares about the padding in two cases: MKTME and KALSR.
If none of them enabled padding doesn't have meaning. We have PAGE_OFFSET
at fixed address and size is also fixed.
--
Kirill A. Shutemov
On 06/18/2018 06:34 AM, Kirill A. Shutemov wrote:
> On Wed, Jun 13, 2018 at 06:43:08PM +0000, Dave Hansen wrote:
>>> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
>>> index efc0d4bb3b35..d6edcabacfc7 100644
>>> --- a/arch/x86/include/asm/mktme.h
>>> +++ b/arch/x86/include/asm/mktme.h
>>> @@ -43,6 +43,9 @@ void mktme_disable(void);
>>> void setup_direct_mapping_size(void);
>>> int sync_direct_mapping(void);
>>>
>>> +#define page_to_virt(x) \
>>> + (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
>>
>> This looks like a super important memory management function being
>> defined in some obscure Intel-specific feature header. How does that work?
>
> No magic. It overwrites define in <linux/mm.h>.
It frankly looks like magic to me. How can this possibly work without
ensuring that asm/mktme.h is #included everywhere on every file compiled
for the entire architecture?
If we look at every definition of page_to_virt() on every architecture
in the kernel, we see it uniquely defined in headers that look rather
generic. I don't see any precedent for feature-specific definitions.
> arch/arm64/include/asm/memory.h:#define page_to_virt(page) ((void *)((__page_to_voff(page)) | PAGE_OFFSET))
> arch/hexagon/include/asm/page.h:#define page_to_virt(page) __va(page_to_phys(page))
> arch/m68k/include/asm/page_mm.h:#define page_to_virt(page) ({ \
> arch/m68k/include/asm/page_no.h:#define page_to_virt(page) __va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET))
> arch/microblaze/include/asm/page.h:# define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
> arch/microblaze/include/asm/page.h:# define page_to_virt(page) (pfn_to_virt(page_to_pfn(page)))
> arch/nios2/include/asm/page.h:#define page_to_virt(page) \
> arch/riscv/include/asm/page.h:#define page_to_virt(page) (pfn_to_virt(page_to_pfn(page)))
> arch/s390/include/asm/page.h:#define page_to_virt(page) pfn_to_virt(page_to_pfn(page))
> arch/xtensa/include/asm/page.h:#define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
*If* you do this, I think it 100% *HAS* to be done in a central header,
like x86's page.h. We need a single x86 macro for this, not something
which can and will change based on #include ordering and Kconfig.
On Mon, Jun 18, 2018 at 01:59:18PM +0000, Dave Hansen wrote:
> On 06/18/2018 06:34 AM, Kirill A. Shutemov wrote:
> > On Wed, Jun 13, 2018 at 06:43:08PM +0000, Dave Hansen wrote:
> >>> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> >>> index efc0d4bb3b35..d6edcabacfc7 100644
> >>> --- a/arch/x86/include/asm/mktme.h
> >>> +++ b/arch/x86/include/asm/mktme.h
> >>> @@ -43,6 +43,9 @@ void mktme_disable(void);
> >>> void setup_direct_mapping_size(void);
> >>> int sync_direct_mapping(void);
> >>>
> >>> +#define page_to_virt(x) \
> >>> + (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
> >>
> >> This looks like a super important memory management function being
> >> defined in some obscure Intel-specific feature header. How does that work?
> >
> > No magic. It overwrites define in <linux/mm.h>.
>
> It frankly looks like magic to me. How can this possibly work without
> ensuring that asm/mktme.h is #included everywhere on every file compiled
> for the entire architecture?
asm/mktme.h is included from asm/page.h. It is functionally identical
other architectures.
> If we look at every definition of page_to_virt() on every architecture
> in the kernel, we see it uniquely defined in headers that look rather
> generic. I don't see any precedent for feature-specific definitions.
I do.
m68k and microblaze have different definitions of the macro depending
on CONFIG_MMU.
On arm64 it depends on CONFIG_SPARSEMEM_VMEMMAP.
> > arch/arm64/include/asm/memory.h:#define page_to_virt(page) ((void *)((__page_to_voff(page)) | PAGE_OFFSET))
> > arch/hexagon/include/asm/page.h:#define page_to_virt(page) __va(page_to_phys(page))
> > arch/m68k/include/asm/page_mm.h:#define page_to_virt(page) ({ \
> > arch/m68k/include/asm/page_no.h:#define page_to_virt(page) __va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET))
> > arch/microblaze/include/asm/page.h:# define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
> > arch/microblaze/include/asm/page.h:# define page_to_virt(page) (pfn_to_virt(page_to_pfn(page)))
> > arch/nios2/include/asm/page.h:#define page_to_virt(page) \
> > arch/riscv/include/asm/page.h:#define page_to_virt(page) (pfn_to_virt(page_to_pfn(page)))
> > arch/s390/include/asm/page.h:#define page_to_virt(page) pfn_to_virt(page_to_pfn(page))
> > arch/xtensa/include/asm/page.h:#define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
>
> *If* you do this, I think it 100% *HAS* to be done in a central header,
> like x86's page.h. We need a single x86 macro for this, not something
> which can and will change based on #include ordering and Kconfig.
I don't agree.
asm/mktme.h included from the single header -- asm/page.h -- and has clear
path to linux/mm.h where the default page_to_virt() is defined.
I don't see a reason to move it out of feature-specific header. The
default page_to_virt() is perfectly fine without MKTME. And it will be
obvious on grep.
--
Kirill A. Shutemov
> index 17383f9677fa..032b9a1ba8e1 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -731,6 +731,8 @@ kernel_physical_mapping_init(unsigned long paddr_start,
> pgd_changed = true;
> }
>
> + sync_direct_mapping();
> +
> if (pgd_changed)
> sync_global_pgds(vaddr_start, vaddr_end - 1);
>
> @@ -1142,10 +1144,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
> static void __meminit
> kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> {
> + int ret;
> start = (unsigned long)__va(start);
> end = (unsigned long)__va(end);
>
> remove_pagetable(start, end, true, NULL);
> + ret = sync_direct_mapping();
> + WARN_ON(ret);
> }
I understand why you implemented it this way, I really do. It's
certainly the quickest way to hack something together and make a
standalone piece of code. But, I don't think it's maintainable.
For instance, this call to sync_direct_mapping() could be entirely
replaced by a call to:
for_each_keyid(k)...
remove_pagetable(start + offset_per_keyid * k,
end + offset_per_keyid * k,
true, NULL);
No?
> int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> @@ -1290,6 +1295,7 @@ void mark_rodata_ro(void)
> (unsigned long) __va(__pa_symbol(rodata_end)),
> (unsigned long) __va(__pa_symbol(_sdata)));
>
> + sync_direct_mapping();
> debug_checkwx();
Huh, checking the return code in some cases and not others. Curious.
Why is it that way?
On Mon, Jun 18, 2018 at 12:54:29PM +0000, Dave Hansen wrote:
> On 06/18/2018 03:07 AM, Kirill A. Shutemov wrote:
> > On Wed, Jun 13, 2018 at 06:20:10PM +0000, Dave Hansen wrote:
> >>> +int page_keyid(const struct page *page)
> >>> +{
> >>> + if (mktme_status != MKTME_ENABLED)
> >>> + return 0;
> >>> +
> >>> + return lookup_page_ext(page)->keyid;
> >>> +}
> >>> +EXPORT_SYMBOL(page_keyid);
> >> Please start using a proper X86_FEATURE_* flag for this. It will give
> >> you all the fancy static patching that you are missing by doing it this way.
> > There's no MKTME CPU feature.
>
> Right. We have tons of synthetic features that have no basis in the
> hardware CPUID feature.
I've tried the approach, but it doesn't fit here.
We enable MKTME relatively late during boot process -- after page_ext as
page_keyid() depends on it. Enabling it earlier would make page_keyid()
return garbage.
By the time page_ext initialized, CPU features is already handled and
setup_force_cpu_cap() doesn't do anything.
I've implemented the enabling with static key instead.
--
Kirill A. Shutemov
On Mon, Jun 18, 2018 at 04:28:27PM +0000, Dave Hansen wrote:
> > index 17383f9677fa..032b9a1ba8e1 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -731,6 +731,8 @@ kernel_physical_mapping_init(unsigned long paddr_start,
> > pgd_changed = true;
> > }
> >
> > + sync_direct_mapping();
> > +
> > if (pgd_changed)
> > sync_global_pgds(vaddr_start, vaddr_end - 1);
> >
> > @@ -1142,10 +1144,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
> > static void __meminit
> > kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> > {
> > + int ret;
> > start = (unsigned long)__va(start);
> > end = (unsigned long)__va(end);
> >
> > remove_pagetable(start, end, true, NULL);
> > + ret = sync_direct_mapping();
> > + WARN_ON(ret);
> > }
>
> I understand why you implemented it this way, I really do. It's
> certainly the quickest way to hack something together and make a
> standalone piece of code. But, I don't think it's maintainable.
>
> For instance, this call to sync_direct_mapping() could be entirely
> replaced by a call to:
>
> for_each_keyid(k)...
> remove_pagetable(start + offset_per_keyid * k,
> end + offset_per_keyid * k,
> true, NULL);
>
> No?
Yes. But what's the point if we need to have the sync routine anyway for
the add path?
> > int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> > @@ -1290,6 +1295,7 @@ void mark_rodata_ro(void)
> > (unsigned long) __va(__pa_symbol(rodata_end)),
> > (unsigned long) __va(__pa_symbol(_sdata)));
> >
> > + sync_direct_mapping();
> > debug_checkwx();
>
> Huh, checking the return code in some cases and not others. Curious.
> Why is it that way?
There's no sensible way to handle failure in any of these path. But in
remove path we don't expect the failure -- no allocation required.
It can only happen if we missed sync_direct_mapping() somewhere else.
--
Kirill A. Shutemov
On 06/25/2018 02:29 AM, Kirill A. Shutemov wrote:
> On Mon, Jun 18, 2018 at 04:28:27PM +0000, Dave Hansen wrote:
>>>
>>> remove_pagetable(start, end, true, NULL);
>>> + ret = sync_direct_mapping();
>>> + WARN_ON(ret);
>>> }
>>
>> I understand why you implemented it this way, I really do. It's
>> certainly the quickest way to hack something together and make a
>> standalone piece of code. But, I don't think it's maintainable.
>>
>> For instance, this call to sync_direct_mapping() could be entirely
>> replaced by a call to:
>>
>> for_each_keyid(k)...
>> remove_pagetable(start + offset_per_keyid * k,
>> end + offset_per_keyid * k,
>> true, NULL);
>>
>> No?
>
> Yes. But what's the point if we need to have the sync routine anyway for
> the add path?
Because you are working to remove the sync routine and make an effort to
share more code with the regular direct map manipulation. Right?
My point is that this patch did not even make an _effort_ to reuse code
where it would have been quite trivial to do so. I think such an effort
needs to be put forth before we add 400 more lines of page table
manipulation.
>>> int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
>>> @@ -1290,6 +1295,7 @@ void mark_rodata_ro(void)
>>> (unsigned long) __va(__pa_symbol(rodata_end)),
>>> (unsigned long) __va(__pa_symbol(_sdata)));
>>>
>>> + sync_direct_mapping();
>>> debug_checkwx();
>>
>> Huh, checking the return code in some cases and not others. Curious.
>> Why is it that way?
>
> There's no sensible way to handle failure in any of these path. But in
> remove path we don't expect the failure -- no allocation required.
> It can only happen if we missed sync_direct_mapping() somewhere else.
So, should we just silently drop the error? Or, would it be sensible to
make this a WARN_ON_ONCE()?
On Mon, Jun 25, 2018 at 04:36:43PM +0000, Dave Hansen wrote:
> On 06/25/2018 02:29 AM, Kirill A. Shutemov wrote:
> > On Mon, Jun 18, 2018 at 04:28:27PM +0000, Dave Hansen wrote:
> >>>
> >>> remove_pagetable(start, end, true, NULL);
> >>> + ret = sync_direct_mapping();
> >>> + WARN_ON(ret);
> >>> }
> >>
> >> I understand why you implemented it this way, I really do. It's
> >> certainly the quickest way to hack something together and make a
> >> standalone piece of code. But, I don't think it's maintainable.
> >>
> >> For instance, this call to sync_direct_mapping() could be entirely
> >> replaced by a call to:
> >>
> >> for_each_keyid(k)...
> >> remove_pagetable(start + offset_per_keyid * k,
> >> end + offset_per_keyid * k,
> >> true, NULL);
> >>
> >> No?
> >
> > Yes. But what's the point if we need to have the sync routine anyway for
> > the add path?
>
> Because you are working to remove the sync routine and make an effort to
> share more code with the regular direct map manipulation. Right?
We need sync operation for the reason I've described before: we cannot
keep it in sync from very start due to limited pool of memory to allocate
page tables from.
If sync operation covers remove too, why do we need to handle it in a
special way?
> My point is that this patch did not even make an _effort_ to reuse code
> where it would have been quite trivial to do so. I think such an effort
> needs to be put forth before we add 400 more lines of page table
> manipulation.
The fact that I didn't reuse code here doesn't mean I have not tried.
I hope I've explain my reasoning clear enough.
> >>> int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> >>> @@ -1290,6 +1295,7 @@ void mark_rodata_ro(void)
> >>> (unsigned long) __va(__pa_symbol(rodata_end)),
> >>> (unsigned long) __va(__pa_symbol(_sdata)));
> >>>
> >>> + sync_direct_mapping();
> >>> debug_checkwx();
> >>
> >> Huh, checking the return code in some cases and not others. Curious.
> >> Why is it that way?
> >
> > There's no sensible way to handle failure in any of these path. But in
> > remove path we don't expect the failure -- no allocation required.
> > It can only happen if we missed sync_direct_mapping() somewhere else.
>
> So, should we just silently drop the error? Or, would it be sensible to
> make this a WARN_ON_ONCE()?
Ignoring errors is in style for this code :P
I'll add WARN_ON_ONCE() there.
--
Kirill A. Shutemov
Hi!
> MKTME is built on top of TME. TME allows encryption of the entirety of
> system memory using a single key. MKTME allows to have multiple encryption
> domains, each having own key -- different memory pages can be encrypted
> with different keys.
>
> Key design points of Intel MKTME:
>
> - Initial HW implementation would support upto 63 keys (plus one
> default
"up to"
> TME key). But the number of keys may be as low as 3, depending to SKU
> and BIOS settings
>
> - To access encrypted memory you need to use mapping with proper KeyID
> int the page table entry. KeyID is encoded in upper bits of PFN in page
"in the"
> table entry.
>
> - CPU does not enforce coherency between mappings of the same physical
> page with different KeyIDs or encryption keys. We wound need to take
"would need"
> care about flushing cache on allocation of encrypted page and on
> returning it back to free pool.
>
> - For managing keys, there's MKTME_KEY_PROGRAM leaf of the new PCONFIG
> (platform configuration) instruction. It allows load and clear keys
> associated with a KeyID. You can also ask CPU to generate a key for
> you or disable memory encryption when a KeyID is used.
Should this go to Documentation somewhere?
And next question is -- what is it good for? Prevents attack where
DRAM is frozen by liquid nitrogen and moved to another system to
extract encryption keys? Does it prevent any attacks that don't
involve manipulating hardware?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html