2018-07-17 11:24:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 00/19] MKTME enabling

Multikey Total Memory Encryption (MKTME)[1] is a technology that allows
transparent memory encryption in upcoming Intel platforms. See overview
below.

Here's updated version of my patchset that brings support of MKTME.
Please review and consider applying.

The patchset provides in-kernel infrastructure for MKTME, but doesn't yet
have userspace interface.

First 8 patches are for core-mm. The rest is x86-specific.

The patchset is on top of tip- tree plus page_ext cleanups I've posted
earlier[2]. page_ext cleanups are in -mm tree now.

Below is performance numbers for kernel build. Enabling MKTME doesn't
affect performance of non-encrypted memory allocation.

For encrypted memory allocation requires cache flush on allocation and
freeing encrypted memory. For kernel build it results in ~20% performance
degradation if we allocate all anonymous memory as encrypted.

We would need to maintain per-KeyID pool of free pages to minimize cache
flushing. I'm going to work on the optimization on top of this patchset.

The patchset also can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git mktme/wip

v5:
- Do not merge VMAs with different KeyID (for real).

- Do not use zero page in encrypted VMAs.

- Avoid division in __pa(). The division is replaced with masking which
makes it near-free. I was not able to measure difference comparing to
base line. direct_mapping_size now has to be power-of-2 if MKTME
enabled. Only in this case we can use masking there.

v4:
- Address Dave's feedback.

- Add performance numbers.

v3:
- Kernel now can access encrypted pages via per-KeyID direct mapping.

- Rework page allocation for encrypted memory to minimize overhead on
non-encrypted pages. It comes with cost for allocation of encrypted
pages: we have to flush cache on every time we allocate *and* free
encrypted page. We will need to optimize it later.

v2:
- Store KeyID of page in page_ext->flags rather than in anon_vma.
anon_vma approach turned out to be problematic. The main problem is
that anon_vma of the page is no longer stable after last mapcount has
gone. We would like to preserve last used KeyID even for freed
pages as it allows to avoid unnecessary cache flushing on allocation
of an encrypted page. page_ext serves this well enough.

- KeyID is now propagated through page allocator. No need in GFP_ENCRYPT
anymore.

- Patch "Decouple dynamic __PHYSICAL_MASK from AMD SME" has been fix to
work with AMD SEV (need to be confirmed by AMD folks).

------------------------------------------------------------------------------

MKTME is built on top of TME. TME allows encryption of the entirety of
system memory using a single key. MKTME allows to have multiple encryption
domains, each having own key -- different memory pages can be encrypted
with different keys.

Key design points of Intel MKTME:

- Initial HW implementation would support upto 63 keys (plus one default
TME key). But the number of keys may be as low as 3, depending to SKU
and BIOS settings

- To access encrypted memory you need to use mapping with proper KeyID
int the page table entry. KeyID is encoded in upper bits of PFN in page
table entry.

- CPU does not enforce coherency between mappings of the same physical
page with different KeyIDs or encryption keys. We wound need to take
care about flushing cache on allocation of encrypted page and on
returning it back to free pool.

- For managing keys, there's MKTME_KEY_PROGRAM leaf of the new PCONFIG
(platform configuration) instruction. It allows load and clear keys
associated with a KeyID. You can also ask CPU to generate a key for
you or disable memory encryption when a KeyID is used.

Performance numbers for kernel build:

Base (tip- tree):

Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):

5664711.936917 task-clock (msec) # 34.815 CPUs utilized ( +- 0.02% )
1,033,886 context-switches # 0.183 K/sec ( +- 0.37% )
189,308 cpu-migrations # 0.033 K/sec ( +- 0.39% )
104,951,554 page-faults # 0.019 M/sec ( +- 0.01% )
16,907,670,543,945 cycles # 2.985 GHz ( +- 0.01% )
12,662,345,427,578 stalled-cycles-frontend # 74.89% frontend cycles idle ( +- 0.02% )
9,936,469,878,830 instructions # 0.59 insn per cycle
# 1.27 stalled cycles per insn ( +- 0.00% )
2,179,100,082,611 branches # 384.680 M/sec ( +- 0.00% )
91,235,200,652 branch-misses # 4.19% of all branches ( +- 0.01% )

162.706797586 seconds time elapsed ( +- 0.04% )

CONFIG_X86_INTEL_MKTME=y, no encrypted memory:

Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):

5668508.245004 task-clock (msec) # 34.872 CPUs utilized ( +- 0.02% )
1,032,034 context-switches # 0.182 K/sec ( +- 0.90% )
188,098 cpu-migrations # 0.033 K/sec ( +- 1.15% )
104,964,084 page-faults # 0.019 M/sec ( +- 0.01% )
16,919,270,913,026 cycles # 2.985 GHz ( +- 0.02% )
12,672,067,815,805 stalled-cycles-frontend # 74.90% frontend cycles idle ( +- 0.02% )
9,942,560,135,477 instructions # 0.59 insn per cycle
# 1.27 stalled cycles per insn ( +- 0.00% )
2,180,800,745,687 branches # 384.722 M/sec ( +- 0.00% )
91,167,857,700 branch-misses # 4.18% of all branches ( +- 0.02% )

162.552503629 seconds time elapsed ( +- 0.10% )

CONFIG_X86_INTEL_MKTME=y, all anonymous memory encrypted with KeyID-1, pay
cache flush overhead on allocation and free:

Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):

7041851.999259 task-clock (msec) # 35.915 CPUs utilized ( +- 0.01% )
1,118,938 context-switches # 0.159 K/sec ( +- 0.49% )
197,039 cpu-migrations # 0.028 K/sec ( +- 0.80% )
104,970,021 page-faults # 0.015 M/sec ( +- 0.00% )
21,025,639,251,627 cycles # 2.986 GHz ( +- 0.01% )
16,729,451,765,492 stalled-cycles-frontend # 79.57% frontend cycles idle ( +- 0.02% )
10,010,727,735,588 instructions # 0.48 insn per cycle
# 1.67 stalled cycles per insn ( +- 0.00% )
2,197,110,181,421 branches # 312.007 M/sec ( +- 0.00% )
91,119,463,513 branch-misses # 4.15% of all branches ( +- 0.01% )

196.072361087 seconds time elapsed ( +- 0.14% )

[1] https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf
[2] https://lkml.kernel.org/r/[email protected]

Kirill A. Shutemov (19):
mm: Do no merge VMAs with different encryption KeyIDs
mm: Do not use zero page in encrypted pages
mm/ksm: Do not merge pages with different KeyIDs
mm/page_alloc: Unify alloc_hugepage_vma()
mm/page_alloc: Handle allocation for encrypted memory
mm/khugepaged: Handle encrypted pages
x86/mm: Mask out KeyID bits from page table entry pfn
x86/mm: Introduce variables to store number, shift and mask of KeyIDs
x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()
x86/mm: Implement page_keyid() using page_ext
x86/mm: Implement vma_keyid()
x86/mm: Implement prep_encrypted_page() and arch_free_page()
x86/mm: Rename CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING
x86/mm: Allow to disable MKTME after enumeration
x86/mm: Detect MKTME early
x86/mm: Calculate direct mapping size
x86/mm: Implement sync_direct_mapping()
x86/mm: Handle encrypted memory in page_to_virt() and __pa()
x86: Introduce CONFIG_X86_INTEL_MKTME

Documentation/x86/x86_64/mm.txt | 4 +
arch/alpha/include/asm/page.h | 2 +-
arch/s390/include/asm/pgtable.h | 2 +-
arch/x86/Kconfig | 21 +-
arch/x86/include/asm/mktme.h | 47 +++
arch/x86/include/asm/page.h | 1 +
arch/x86/include/asm/page_64.h | 4 +-
arch/x86/include/asm/pgtable_types.h | 15 +-
arch/x86/include/asm/setup.h | 6 +
arch/x86/kernel/cpu/intel.c | 32 +-
arch/x86/kernel/head64.c | 4 +
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/init_64.c | 68 ++++
arch/x86/mm/kaslr.c | 11 +-
arch/x86/mm/mktme.c | 546 +++++++++++++++++++++++++++
fs/userfaultfd.c | 7 +-
include/linux/gfp.h | 54 ++-
include/linux/migrate.h | 12 +-
include/linux/mm.h | 20 +-
include/linux/page_ext.h | 11 +-
mm/compaction.c | 1 +
mm/huge_memory.c | 3 +-
mm/khugepaged.c | 10 +
mm/ksm.c | 3 +
mm/madvise.c | 2 +-
mm/memory.c | 3 +-
mm/mempolicy.c | 31 +-
mm/migrate.c | 4 +-
mm/mlock.c | 2 +-
mm/mmap.c | 31 +-
mm/mprotect.c | 2 +-
mm/page_alloc.c | 47 +++
mm/page_ext.c | 3 +
34 files changed, 954 insertions(+), 60 deletions(-)
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mktme.c

--
2.18.0



2018-07-17 11:22:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 07/19] x86/mm: Mask out KeyID bits from page table entry pfn

MKTME claims several upper bits of the physical address in a page table
entry to encode KeyID. It effectively shrinks number of bits for
physical address. We should exclude KeyID bits from physical addresses.

For instance, if CPU enumerates 52 physical address bits and number of
bits claimed for KeyID is 6, bits 51:46 must not be threated as part
physical address.

This patch adjusts __PHYSICAL_MASK during MKTME enumeration.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index eb75564f2d25..bf2caf9d52dd 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -571,6 +571,29 @@ static void detect_tme(struct cpuinfo_x86 *c)
mktme_status = MKTME_ENABLED;
}

+#ifdef CONFIG_X86_INTEL_MKTME
+ if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ /*
+ * Mask out bits claimed from KeyID from physical address mask.
+ *
+ * For instance, if a CPU enumerates 52 physical address bits
+ * and number of bits claimed for KeyID is 6, bits 51:46 of
+ * physical address is unusable.
+ */
+ phys_addr_t keyid_mask;
+
+ keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
+ physical_mask &= ~keyid_mask;
+ } else {
+ /*
+ * Reset __PHYSICAL_MASK.
+ * Maybe needed if there's inconsistent configuation
+ * between CPUs.
+ */
+ physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ }
+#endif
+
/*
* KeyID bits effectively lower the number of physical address
* bits. Update cpuinfo_x86::x86_phys_bits accordingly.
--
2.18.0


2018-07-17 11:22:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 16/19] x86/mm: Calculate direct mapping size

The kernel needs to have a way to access encrypted memory. We have two
option on how approach it:

- Create temporary mappings every time kernel needs access to encrypted
memory. That's basically brings highmem and its overhead back.

- Create multiple direct mappings, one per-KeyID. In this setup we
don't need to create temporary mappings on the fly -- encrypted
memory is permanently available in kernel address space.

We take the second approach as it has lower overhead.

It's worth noting that with per-KeyID direct mappings compromised kernel
would give access to decrypted data right away without additional tricks
to get memory mapped with the correct KeyID.

Per-KeyID mappings require a lot more virtual address space. On 4-level
machine with 64 KeyIDs we max out 46-bit virtual address space dedicated
for direct mapping with 1TiB of RAM. Given that we round up any
calculation on direct mapping size to 1TiB, we effectively claim all
46-bit address space for direct mapping on such machine regardless of
RAM size.

Increased usage of virtual address space has implications for KASLR:
we have less space for randomization. With 64 TiB claimed for direct
mapping with 4-level we left with 27 TiB of entropy to place
page_offset_base, vmalloc_base and vmemmap_base.

5-level paging provides much wider virtual address space and KASLR
doesn't suffer significantly from per-KeyID direct mappings.

It's preferred to run MKTME with 5-level paging.

A direct mapping for each KeyID will be put next to each other in the
virtual address space. We need to have a way to find boundaries of
direct mapping for particular KeyID.

The new variable direct_mapping_size specifies the size of direct
mapping. With the value, it's trivial to find direct mapping for
KeyID-N: PAGE_OFFSET + N * direct_mapping_size.

Size of direct mapping is calculated during KASLR setup. If KALSR is
disabled it happens during MKTME initialization.

With MKTME size of direct mapping has to be power-of-2. It makes
implementation of __pa() efficient.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/x86/x86_64/mm.txt | 4 +++
arch/x86/include/asm/page_64.h | 2 ++
arch/x86/include/asm/setup.h | 6 ++++
arch/x86/kernel/head64.c | 4 +++
arch/x86/kernel/setup.c | 3 ++
arch/x86/mm/init_64.c | 58 +++++++++++++++++++++++++++++++++
arch/x86/mm/kaslr.c | 11 +++++--
7 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 5432a96d31ff..c5b92904090f 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -61,6 +61,10 @@ The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
holes).

+With MKTME, we have multiple direct mappings. One per-KeyID. They are put
+next to each other. PAGE_OFFSET + N * direct_mapping_size can be used to
+find direct mapping for KeyID-N.
+
vmalloc space is lazily synchronized into the different PML4/PML5 pages of
the processes using the page fault handler, with init_top_pgt as
reference.
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..f57fc3cc2246 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -14,6 +14,8 @@ extern unsigned long phys_base;
extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;
+extern unsigned long direct_mapping_size;
+extern unsigned long direct_mapping_mask;

static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ae13bc974416..bcac5080cca5 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -59,6 +59,12 @@ extern void x86_ce4100_early_setup(void);
static inline void x86_ce4100_early_setup(void) { }
#endif

+#ifdef CONFIG_MEMORY_PHYSICAL_PADDING
+void calculate_direct_mapping_size(void);
+#else
+static inline void calculate_direct_mapping_size(void) { }
+#endif
+
#ifndef _SETUP

#include <asm/espfix.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 8047379e575a..79b92518ceee 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -59,6 +59,10 @@ EXPORT_SYMBOL(vmalloc_base);
unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
#endif
+unsigned long direct_mapping_size __ro_after_init = -1UL;
+EXPORT_SYMBOL(direct_mapping_size);
+unsigned long direct_mapping_mask __ro_after_init = -1UL;
+EXPORT_SYMBOL(direct_mapping_mask);

#define __head __section(.head.text)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..09ddbd142e3c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1053,6 +1053,9 @@ void __init setup_arch(char **cmdline_p)
*/
init_cache_modes();

+ /* direct_mapping_size has to be initialized before KASLR and MKTME */
+ calculate_direct_mapping_size();
+
/*
* Define random base addresses for memory sections after max_pfn is
* defined and before each memory section base is used.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a688617c727e..d3a93ca69c67 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1399,6 +1399,64 @@ unsigned long memory_block_size_bytes(void)
return memory_block_size_probed;
}

+#ifdef CONFIG_MEMORY_PHYSICAL_PADDING
+void __init calculate_direct_mapping_size(void)
+{
+ unsigned long available_va;
+
+ /* 1/4 of virtual address space is didicated for direct mapping */
+ available_va = 1UL << (__VIRTUAL_MASK_SHIFT - 1);
+
+ /* How much memory the system has? */
+ direct_mapping_size = max_pfn << PAGE_SHIFT;
+ direct_mapping_size = round_up(direct_mapping_size, 1UL << 40);
+
+ if (!mktme_nr_keyids)
+ goto out;
+
+ /*
+ * For MKTME we need direct_mapping_size to be power-of-2.
+ * It makes __pa() implementation efficient.
+ */
+ direct_mapping_size = roundup_pow_of_two(direct_mapping_size);
+
+ /*
+ * Not enough virtual address space to address all physical memory with
+ * MKTME enabled. Even without padding.
+ *
+ * Disable MKTME instead.
+ */
+ if (direct_mapping_size > available_va / (mktme_nr_keyids + 1)) {
+ pr_err("x86/mktme: Disabled. Not enough virtual address space\n");
+ pr_err("x86/mktme: Consider switching to 5-level paging\n");
+ mktme_disable();
+ goto out;
+ }
+
+ /*
+ * Virtual address space is divided between per-KeyID direct mappings.
+ */
+ available_va /= mktme_nr_keyids + 1;
+out:
+ /* Add padding, if there's enough virtual address space */
+ direct_mapping_size += (1UL << 40) * CONFIG_MEMORY_PHYSICAL_PADDING;
+ if (mktme_nr_keyids)
+ direct_mapping_size = roundup_pow_of_two(direct_mapping_size);
+
+ if (direct_mapping_size > available_va)
+ direct_mapping_size = available_va;
+
+ /*
+ * For MKTME, make sure direct_mapping_size is still power-of-2
+ * after adding padding and calculate mask that is used in __pa().
+ */
+ if (mktme_nr_keyids) {
+ direct_mapping_size = rounddown_pow_of_two(direct_mapping_size);
+ direct_mapping_mask = direct_mapping_size - 1;
+ }
+}
+#endif
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
/*
* Initialise the sparsemem vmemmap using huge-pages at the PMD level.
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 4408cd9a3bef..bf044ff50ec0 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -101,10 +101,15 @@ void __init kernel_randomize_memory(void)
* add padding if needed (especially for memory hotplug support).
*/
BUG_ON(kaslr_regions[0].base != &page_offset_base);
- memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
- CONFIG_MEMORY_PHYSICAL_PADDING;

- /* Adapt phyiscal memory region size based on available memory */
+ /*
+ * Calculate space required to map all physical memory.
+ * In case of MKTME, we map physical memory multiple times, one for
+ * each KeyID. If MKTME is disabled mktme_nr_keyids is 0.
+ */
+ memory_tb = (direct_mapping_size * (mktme_nr_keyids + 1)) >> TB_SHIFT;
+
+ /* Adapt physical memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
kaslr_regions[0].size_tb = memory_tb;

--
2.18.0


2018-07-17 11:22:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 06/19] mm/khugepaged: Handle encrypted pages

khugepaged allocates page in advance, before we found a VMA for
collapse. We don't yet know which KeyID to use for the allocation.

The page is allocated with KeyID-0. Once we know that the VMA is
suitable for collapsing, we prepare the page for KeyID we need, based on
vma_keyid().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/khugepaged.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5ae34097aed1..d116f4ebb622 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1056,6 +1056,16 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock_write(vma->anon_vma);

+ /*
+ * At this point new_page is allocated as non-encrypted.
+ * If VMA's KeyID is non-zero, we need to prepare it to be encrypted
+ * before coping data.
+ */
+ if (vma_keyid(vma)) {
+ prep_encrypted_page(new_page, HPAGE_PMD_ORDER,
+ vma_keyid(vma), false);
+ }
+
__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
pte_unmap(pte);
__SetPageUptodate(new_page);
--
2.18.0


2018-07-17 11:23:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 15/19] x86/mm: Detect MKTME early

We need to know number of KeyIDs before KALSR is initialized. Number of
KeyIDs would determinate how much address space would be needed for
per-KeyID direct mapping.

KALSR initialization happens before full CPU initizliation is complete.
Move detect_tme() call to early_init_intel().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 75e3b2602b4a..39830806dd42 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -158,6 +158,8 @@ static bool bad_spectre_microcode(struct cpuinfo_x86 *c)
return false;
}

+static void detect_tme(struct cpuinfo_x86 *c);
+
static void early_init_intel(struct cpuinfo_x86 *c)
{
u64 misc_enable;
@@ -301,6 +303,9 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}

check_mpx_erratum(c);
+
+ if (cpu_has(c, X86_FEATURE_TME))
+ detect_tme(c);
}

#ifdef CONFIG_X86_32
@@ -766,9 +771,6 @@ static void init_intel(struct cpuinfo_x86 *c)
if (cpu_has(c, X86_FEATURE_VMX))
detect_vmx_virtcap(c);

- if (cpu_has(c, X86_FEATURE_TME))
- detect_tme(c);
-
init_intel_energy_perf(c);

init_intel_misc_features(c);
--
2.18.0


2018-07-17 11:23:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 09/19] x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()

An encrypted VMA will have KeyID stored in vma->vm_page_prot. This way
we don't need to do anything special to setup encrypted page table
entries and don't need to reserve space for KeyID in a VMA.

This patch changes _PAGE_CHG_MASK to include KeyID bits. Otherwise they
are going to be stripped from vm_page_prot on the first pgprot_modify().

Define PTE_PFN_MASK_MAX similar to PTE_PFN_MASK but based on
__PHYSICAL_MASK_SHIFT. This way we include whole range of bits
architecturally available for PFN without referencing physical_mask and
mktme_keyid_mask variables.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 99fff853c944..3731f7e08757 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -120,8 +120,21 @@
* protection key is treated like _PAGE_RW, for
* instance, and is *not* included in this mask since
* pte_modify() does modify it.
+ *
+ * They include the physical address and the memory encryption keyID.
+ * The paddr and the keyID never occupy the same bits at the same time.
+ * But, a given bit might be used for the keyID on one system and used for
+ * the physical address on another. As an optimization, we manage them in
+ * one unit here since their combination always occupies the same hardware
+ * bits. PTE_PFN_MASK_MAX stores combined mask.
+ *
+ * Cast PAGE_MASK to a signed type so that it is sign-extended if
+ * virtual addresses are 32-bits but physical addresses are larger
+ * (ie, 32-bit PAE).
*/
-#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
+#define PTE_PFN_MASK_MAX \
+ (((signed long)PAGE_MASK) & ((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
+#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
_PAGE_SOFT_DIRTY)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
--
2.18.0


2018-07-17 11:23:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 17/19] x86/mm: Implement sync_direct_mapping()

For MKTME we use per-KeyID direct mappings. This allows kernel to have
access to encrypted memory.

sync_direct_mapping() sync per-KeyID direct mappings with a canonical
one -- KeyID-0.

The function tracks changes in the canonical mapping:
- creating or removing chunks of the translation tree;
- changes in mapping flags (i.e. protection bits);
- splitting huge page mapping into a page table;
- replacing page table with a huge page mapping;

The function need to be called on every change to the direct mapping:
hotplug, hotremove, changes in permissions bits, etc.

The function is nop until MKTME is enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 8 +
arch/x86/mm/init_64.c | 10 +
arch/x86/mm/mktme.c | 437 +++++++++++++++++++++++++++++++++++
3 files changed, 455 insertions(+)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index ebbee6a0c495..ba83fba4f9b3 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -27,10 +27,18 @@ void prep_encrypted_page(struct page *page, int order, int keyid, bool zero);
#define HAVE_ARCH_FREE_PAGE
void arch_free_page(struct page *page, int order);

+int sync_direct_mapping(void);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
#define mktme_keyid_shift 0
+
+static inline int sync_direct_mapping(void)
+{
+ return 0;
+}
+
#endif

#endif
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index d3a93ca69c67..04a5e08f01b3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -698,6 +698,7 @@ kernel_physical_mapping_init(unsigned long paddr_start,
{
bool pgd_changed = false;
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
+ int ret;

paddr_last = paddr_end;
vaddr = (unsigned long)__va(paddr_start);
@@ -731,6 +732,9 @@ kernel_physical_mapping_init(unsigned long paddr_start,
pgd_changed = true;
}

+ ret = sync_direct_mapping();
+ WARN_ON(ret);
+
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);

@@ -1142,10 +1146,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
static void __meminit
kernel_physical_mapping_remove(unsigned long start, unsigned long end)
{
+ int ret;
start = (unsigned long)__va(start);
end = (unsigned long)__va(end);

remove_pagetable(start, end, true, NULL);
+ ret = sync_direct_mapping();
+ WARN_ON(ret);
}

int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
@@ -1253,6 +1260,7 @@ void mark_rodata_ro(void)
unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
unsigned long all_end;
+ int ret;

printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
(end - start) >> 10);
@@ -1290,6 +1298,8 @@ void mark_rodata_ro(void)
(unsigned long) __va(__pa_symbol(rodata_end)),
(unsigned long) __va(__pa_symbol(_sdata)));

+ ret = sync_direct_mapping();
+ WARN_ON(ret);
debug_checkwx();

/*
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index bb6210dbcf0e..660caf6a5ce1 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,6 +1,8 @@
#include <linux/mm.h>
#include <linux/highmem.h>
#include <asm/mktme.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>

phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
@@ -42,6 +44,7 @@ static bool need_page_mktme(void)
static void init_page_mktme(void)
{
static_branch_enable(&mktme_enabled_key);
+ sync_direct_mapping();
}

struct page_ext_operations page_mktme_ops = {
@@ -107,3 +110,437 @@ void arch_free_page(struct page *page, int order)
page++;
}
}
+
+static int sync_direct_mapping_pte(unsigned long keyid,
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ unsigned long addr, unsigned long end)
+{
+ pte_t *src_pte, *dst_pte;
+ pte_t *new_pte = NULL;
+ bool remove_pte;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_pte = !src_pmd && PAGE_ALIGNED(addr) && PAGE_ALIGNED(end);
+
+ /*
+ * PMD page got split into page table.
+ * Clear PMD mapping. Page table will be established instead.
+ */
+ if (pmd_large(*dst_pmd)) {
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(dst_pmd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Allocate a new page table if needed. */
+ if (pmd_none(*dst_pmd)) {
+ new_pte = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_pte)
+ return -ENOMEM;
+ dst_pte = new_pte + pte_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_pte = pte_offset_map(dst_pmd, addr + keyid * direct_mapping_size);
+ }
+ src_pte = src_pmd ? pte_offset_map(src_pmd, addr) : NULL;
+
+ spin_lock(&init_mm.page_table_lock);
+
+ do {
+ pteval_t val;
+
+ if (!src_pte || pte_none(*src_pte)) {
+ set_pte(dst_pte, __pte(0));
+ goto next;
+ }
+
+ if (!pte_none(*dst_pte)) {
+ /*
+ * Sanity check: PFNs must match between source
+ * and destination even if the rest doesn't.
+ */
+ BUG_ON(pte_pfn(*dst_pte) != pte_pfn(*src_pte));
+ }
+
+ /* Copy entry, but set KeyID. */
+ val = pte_val(*src_pte) | keyid << mktme_keyid_shift;
+ set_pte(dst_pte, __pte(val));
+next:
+ addr += PAGE_SIZE;
+ dst_pte++;
+ if (src_pte)
+ src_pte++;
+ } while (addr != end);
+
+ if (new_pte)
+ pmd_populate_kernel(&init_mm, dst_pmd, new_pte);
+
+ if (remove_pte) {
+ __free_page(pmd_page(*dst_pmd));
+ pmd_clear(dst_pmd);
+ }
+
+ spin_unlock(&init_mm.page_table_lock);
+
+ return 0;
+}
+
+static int sync_direct_mapping_pmd(unsigned long keyid,
+ pud_t *dst_pud, pud_t *src_pud,
+ unsigned long addr, unsigned long end)
+{
+ pmd_t *src_pmd, *dst_pmd;
+ pmd_t *new_pmd = NULL;
+ bool remove_pmd = false;
+ unsigned long next;
+ int ret;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_pmd = !src_pud && IS_ALIGNED(addr, PUD_SIZE) && IS_ALIGNED(end, PUD_SIZE);
+
+ /*
+ * PUD page got split into page table.
+ * Clear PUD mapping. Page table will be established instead.
+ */
+ if (pud_large(*dst_pud)) {
+ spin_lock(&init_mm.page_table_lock);
+ pud_clear(dst_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Allocate a new page table if needed. */
+ if (pud_none(*dst_pud)) {
+ new_pmd = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_pmd)
+ return -ENOMEM;
+ dst_pmd = new_pmd + pmd_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_pmd = pmd_offset(dst_pud, addr + keyid * direct_mapping_size);
+ }
+ src_pmd = src_pud ? pmd_offset(src_pud, addr) : NULL;
+
+ do {
+ pmd_t *__src_pmd = src_pmd;
+
+ next = pmd_addr_end(addr, end);
+ if (!__src_pmd || pmd_none(*__src_pmd)) {
+ if (pmd_none(*dst_pmd))
+ goto next;
+ if (pmd_large(*dst_pmd)) {
+ spin_lock(&init_mm.page_table_lock);
+ set_pmd(dst_pmd, __pmd(0));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+ __src_pmd = NULL;
+ }
+
+ if (__src_pmd && pmd_large(*__src_pmd)) {
+ pmdval_t val;
+
+ if (pmd_large(*dst_pmd)) {
+ /*
+ * Sanity check: PFNs must match between source
+ * and destination even if the rest doesn't.
+ */
+ BUG_ON(pmd_pfn(*dst_pmd) != pmd_pfn(*__src_pmd));
+ } else if (!pmd_none(*dst_pmd)) {
+ /*
+ * Page table is replaced with a PMD page.
+ * Free and unmap the page table.
+ */
+ __free_page(pmd_page(*dst_pmd));
+ spin_lock(&init_mm.page_table_lock);
+ pmd_clear(dst_pmd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Copy entry, but set KeyID. */
+ val = pmd_val(*__src_pmd) | keyid << mktme_keyid_shift;
+ spin_lock(&init_mm.page_table_lock);
+ set_pmd(dst_pmd, __pmd(val));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+
+ ret = sync_direct_mapping_pte(keyid, dst_pmd, __src_pmd,
+ addr, next);
+next:
+ addr = next;
+ dst_pmd++;
+ if (src_pmd)
+ src_pmd++;
+ } while (addr != end && !ret);
+
+ if (new_pmd) {
+ spin_lock(&init_mm.page_table_lock);
+ pud_populate(&init_mm, dst_pud, new_pmd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ if (remove_pmd) {
+ spin_lock(&init_mm.page_table_lock);
+ __free_page(pud_page(*dst_pud));
+ pud_clear(dst_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ return ret;
+}
+
+static int sync_direct_mapping_pud(unsigned long keyid,
+ p4d_t *dst_p4d, p4d_t *src_p4d,
+ unsigned long addr, unsigned long end)
+{
+ pud_t *src_pud, *dst_pud;
+ pud_t *new_pud = NULL;
+ bool remove_pud = false;
+ unsigned long next;
+ int ret;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_pud = !src_p4d && IS_ALIGNED(addr, P4D_SIZE) && IS_ALIGNED(end, P4D_SIZE);
+
+ /*
+ * P4D page got split into page table.
+ * Clear P4D mapping. Page table will be established instead.
+ */
+ if (p4d_large(*dst_p4d)) {
+ spin_lock(&init_mm.page_table_lock);
+ p4d_clear(dst_p4d);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Allocate a new page table if needed. */
+ if (p4d_none(*dst_p4d)) {
+ new_pud = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_pud)
+ return -ENOMEM;
+ dst_pud = new_pud + pud_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_pud = pud_offset(dst_p4d, addr + keyid * direct_mapping_size);
+ }
+ src_pud = src_p4d ? pud_offset(src_p4d, addr) : NULL;
+
+ do {
+ pud_t *__src_pud = src_pud;
+
+ next = pud_addr_end(addr, end);
+ if (!__src_pud || pud_none(*__src_pud)) {
+ if (pud_none(*dst_pud))
+ goto next;
+ if (pud_large(*dst_pud)) {
+ spin_lock(&init_mm.page_table_lock);
+ set_pud(dst_pud, __pud(0));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+ __src_pud = NULL;
+ }
+
+ if (__src_pud && pud_large(*__src_pud)) {
+ pudval_t val;
+
+ if (pud_large(*dst_pud)) {
+ /*
+ * Sanity check: PFNs must match between source
+ * and destination even if the rest doesn't.
+ */
+ BUG_ON(pud_pfn(*dst_pud) != pud_pfn(*__src_pud));
+ } else if (!pud_none(*dst_pud)) {
+ /*
+ * Page table is replaced with a pud page.
+ * Free and unmap the page table.
+ */
+ __free_page(pud_page(*dst_pud));
+ spin_lock(&init_mm.page_table_lock);
+ pud_clear(dst_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ /* Copy entry, but set KeyID. */
+ val = pud_val(*__src_pud) | keyid << mktme_keyid_shift;
+ spin_lock(&init_mm.page_table_lock);
+ set_pud(dst_pud, __pud(val));
+ spin_unlock(&init_mm.page_table_lock);
+ goto next;
+ }
+
+ ret = sync_direct_mapping_pmd(keyid, dst_pud, __src_pud,
+ addr, next);
+next:
+ addr = next;
+ dst_pud++;
+ if (src_pud)
+ src_pud++;
+ } while (addr != end && !ret);
+
+ if (new_pud) {
+ spin_lock(&init_mm.page_table_lock);
+ p4d_populate(&init_mm, dst_p4d, new_pud);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ if (remove_pud) {
+ spin_lock(&init_mm.page_table_lock);
+ __free_page(p4d_page(*dst_p4d));
+ p4d_clear(dst_p4d);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ return ret;
+}
+
+static int sync_direct_mapping_p4d(unsigned long keyid,
+ pgd_t *dst_pgd, pgd_t *src_pgd,
+ unsigned long addr, unsigned long end)
+{
+ p4d_t *src_p4d, *dst_p4d;
+ p4d_t *new_p4d_1 = NULL, *new_p4d_2 = NULL;
+ bool remove_p4d = false;
+ unsigned long next;
+ int ret = 0;
+
+ /*
+ * We want to unmap and free the page table if the source is empty and
+ * the range covers whole page table.
+ */
+ remove_p4d = !src_pgd && IS_ALIGNED(addr, PGDIR_SIZE) && IS_ALIGNED(end, PGDIR_SIZE);
+
+ /* Allocate a new page table if needed. */
+ if (pgd_none(*dst_pgd)) {
+ new_p4d_1 = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_p4d_1)
+ return -ENOMEM;
+ dst_p4d = new_p4d_1 + p4d_index(addr + keyid * direct_mapping_size);
+ } else {
+ dst_p4d = p4d_offset(dst_pgd, addr + keyid * direct_mapping_size);
+ }
+ src_p4d = src_pgd ? p4d_offset(src_pgd, addr) : NULL;
+
+ do {
+ p4d_t *__src_p4d = src_p4d;
+
+ next = p4d_addr_end(addr, end);
+ if (!__src_p4d || p4d_none(*__src_p4d)) {
+ if (p4d_none(*dst_p4d))
+ goto next;
+ __src_p4d = NULL;
+ }
+
+ ret = sync_direct_mapping_pud(keyid, dst_p4d, __src_p4d,
+ addr, next);
+next:
+ addr = next;
+ dst_p4d++;
+
+ /*
+ * Direct mappings are 1TiB-aligned. With 5-level paging it
+ * means that on PGD level there can be misalignment between
+ * source and distiantion.
+ *
+ * Allocate the new page table if dst_p4d crosses page table
+ * boundary.
+ */
+ if (!((unsigned long)dst_p4d & ~PAGE_MASK) && addr != end) {
+ if (pgd_none(dst_pgd[1])) {
+ new_p4d_2 = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ if (!new_p4d_2)
+ ret = -ENOMEM;
+ dst_p4d = new_p4d_2;
+ } else {
+ dst_p4d = p4d_offset(dst_pgd + 1, 0);
+ }
+ }
+ if (src_p4d)
+ src_p4d++;
+ } while (addr != end && !ret);
+
+ if (new_p4d_1 || new_p4d_2) {
+ spin_lock(&init_mm.page_table_lock);
+ if (new_p4d_1)
+ pgd_populate(&init_mm, dst_pgd, new_p4d_1);
+ if (new_p4d_2)
+ pgd_populate(&init_mm, dst_pgd + 1, new_p4d_2);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ if (remove_p4d) {
+ spin_lock(&init_mm.page_table_lock);
+ __free_page(pgd_page(*dst_pgd));
+ pgd_clear(dst_pgd);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ return ret;
+}
+
+static int sync_direct_mapping_keyid(unsigned long keyid)
+{
+ pgd_t *src_pgd, *dst_pgd;
+ unsigned long addr, end, next;
+ int ret;
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + direct_mapping_size;
+
+ dst_pgd = pgd_offset_k(addr + keyid * direct_mapping_size);
+ src_pgd = pgd_offset_k(addr);
+
+ do {
+ pgd_t *__src_pgd = src_pgd;
+
+ next = pgd_addr_end(addr, end);
+ if (pgd_none(*__src_pgd)) {
+ if (pgd_none(*dst_pgd))
+ continue;
+ __src_pgd = NULL;
+ }
+
+ ret = sync_direct_mapping_p4d(keyid, dst_pgd, __src_pgd,
+ addr, next);
+ } while (dst_pgd++, src_pgd++, addr = next, addr != end && !ret);
+
+ return ret;
+}
+
+/*
+ * For MKTME we maintain per-KeyID direct mappings. This allows kernel to have
+ * access to encrypted memory.
+ *
+ * sync_direct_mapping() sync per-KeyID direct mappings with a canonical
+ * one -- KeyID-0.
+ *
+ * The function tracks changes in the canonical mapping:
+ * - creating or removing chunks of the translation tree;
+ * - changes in mapping flags (i.e. protection bits);
+ * - splitting huge page mapping into a page table;
+ * - replacing page table with a huge page mapping;
+ *
+ * The function need to be called on every change to the direct mapping:
+ * hotplug, hotremove, changes in permissions bits, etc.
+ *
+ * The function is nop until MKTME is enabled.
+ */
+int sync_direct_mapping(void)
+{
+ int i, ret = 0;
+
+ if (!mktme_enabled())
+ return 0;
+
+ for (i = 1; !ret && i <= mktme_nr_keyids; i++)
+ ret = sync_direct_mapping_keyid(i);
+
+ __flush_tlb_all();
+
+ return ret;
+}
--
2.18.0


2018-07-17 11:23:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 05/19] mm/page_alloc: Handle allocation for encrypted memory

For encrypted memory, we need to allocate pages for a specific
encryption KeyID.

There are two cases when we need to allocate a page for encryption:

- Allocation for an encrypted VMA;

- Allocation for migration of encrypted page;

The first case can be covered within alloc_page_vma(). We know KeyID
from the VMA.

The second case requires few new page allocation routines that would
allocate the page for a specific KeyID.

An encrypted page has to be cleared after KeyID set. This is handled
in prep_encrypted_page() that will be provided by arch-specific code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/gfp.h | 48 ++++++++++++++++++++++++++++++++++++-----
include/linux/migrate.h | 12 ++++++++---
mm/compaction.c | 1 +
mm/mempolicy.c | 28 ++++++++++++++++++------
mm/migrate.c | 4 ++--
mm/page_alloc.c | 47 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 123 insertions(+), 17 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 66f395737990..347a40558cfc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -446,16 +446,46 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif

+#ifndef prep_encrypted_page
+static inline void prep_encrypted_page(struct page *page, int order,
+ int keyid, bool zero)
+{
+}
+#endif
+
+/*
+ * Encrypted page has to be cleared once keyid is set, not on allocation.
+ */
+static inline bool encrypted_page_needs_zero(int keyid, gfp_t *gfp_mask)
+{
+ if (!keyid)
+ return false;
+
+ if (*gfp_mask & __GFP_ZERO) {
+ *gfp_mask &= ~__GFP_ZERO;
+ return true;
+ }
+
+ return false;
+}
+
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);

+struct page *
+__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
+ int preferred_nid, nodemask_t *nodemask, int keyid);
+
static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
{
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
}

+struct page *__alloc_pages_node_keyid(int nid, int keyid,
+ gfp_t gfp_mask, unsigned int order);
+
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
@@ -483,6 +513,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages_node(nid, gfp_mask, order);
}

+static inline struct page *alloc_pages_node_keyid(int nid, int keyid,
+ gfp_t gfp_mask, unsigned int order)
+{
+ if (nid == NUMA_NO_NODE)
+ nid = numa_mem_id();
+
+ return __alloc_pages_node_keyid(nid, keyid, gfp_mask, order);
+}
+
+extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ int node, bool hugepage);
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

@@ -491,14 +534,9 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_current(gfp_mask, order);
}
-extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
- struct vm_area_struct *vma, unsigned long addr,
- int node, bool hugepage);
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
- alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..fede9bfa89d9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -38,9 +38,15 @@ static inline struct page *new_page_nodemask(struct page *page,
unsigned int order = 0;
struct page *new_page = NULL;

- if (PageHuge(page))
+ if (PageHuge(page)) {
+ /*
+ * HugeTLB doesn't support encryption. We shouldn't see
+ * such pages.
+ */
+ WARN_ON(page_keyid(page));
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);
+ }

if (PageTransHuge(page)) {
gfp_mask |= GFP_TRANSHUGE;
@@ -50,8 +56,8 @@ static inline struct page *new_page_nodemask(struct page *page,
if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
gfp_mask |= __GFP_HIGHMEM;

- new_page = __alloc_pages_nodemask(gfp_mask, order,
- preferred_nid, nodemask);
+ new_page = __alloc_pages_nodemask_keyid(gfp_mask, order,
+ preferred_nid, nodemask, page_keyid(page));

if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..fd51aa32ad96 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1187,6 +1187,7 @@ static struct page *compaction_alloc(struct page *migratepage,
list_del(&freepage->lru);
cc->nr_freepages--;

+ prep_encrypted_page(freepage, 0, page_keyid(migratepage), false);
return freepage;
}

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 581b729e05a0..ce7b436444b5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -921,22 +921,28 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
/* page allocation callback for NUMA node migration */
struct page *alloc_new_node_page(struct page *page, unsigned long node)
{
- if (PageHuge(page))
+ if (PageHuge(page)) {
+ /*
+ * HugeTLB doesn't support encryption. We shouldn't see
+ * such pages.
+ */
+ WARN_ON(page_keyid(page));
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else if (PageTransHuge(page)) {
+ } else if (PageTransHuge(page)) {
struct page *thp;

- thp = alloc_pages_node(node,
+ thp = alloc_pages_node_keyid(node, page_keyid(page),
(GFP_TRANSHUGE | __GFP_THISNODE),
HPAGE_PMD_ORDER);
if (!thp)
return NULL;
prep_transhuge_page(thp);
return thp;
- } else
- return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
- __GFP_THISNODE, 0);
+ } else {
+ return __alloc_pages_node_keyid(node, page_keyid(page),
+ GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
+ }
}

/*
@@ -2013,9 +2019,16 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
{
struct mempolicy *pol;
struct page *page;
- int preferred_nid;
+ bool zero = false;
+ int keyid, preferred_nid;
nodemask_t *nmask;

+ keyid = vma_keyid(vma);
+ if (keyid && (gfp & __GFP_ZERO)) {
+ zero = true;
+ gfp &= ~__GFP_ZERO;
+ }
+
pol = get_vma_policy(vma, addr);

if (pol->mode == MPOL_INTERLEAVE) {
@@ -2058,6 +2071,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
mpol_cond_put(pol);
out:
+ prep_encrypted_page(page, order, keyid, zero);
return page;
}

diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..eb8dea219dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1847,7 +1847,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
int nid = (int) data;
struct page *newpage;

- newpage = __alloc_pages_node(nid,
+ newpage = __alloc_pages_node_keyid(nid, page_keyid(page),
(GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE | __GFP_NOMEMALLOC |
__GFP_NORETRY | __GFP_NOWARN) &
@@ -2030,7 +2030,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
goto out_dropref;

- new_page = alloc_pages_node(node,
+ new_page = alloc_pages_node_keyid(node, page_keyid(page),
(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
HPAGE_PMD_ORDER);
if (!new_page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d800d61ddb7..d7dc54b75f5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3697,6 +3697,39 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
}
#endif /* CONFIG_COMPACTION */

+#ifndef CONFIG_NUMA
+struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
+ struct vm_area_struct *vma, unsigned long addr,
+ int node, bool hugepage)
+{
+ struct page *page;
+ bool need_zero;
+ int keyid = vma_keyid(vma);
+
+ need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
+ page = alloc_pages(gfp_mask, order);
+ prep_encrypted_page(page, order, keyid, need_zero);
+
+ return page;
+}
+#endif
+
+struct page * __alloc_pages_node_keyid(int nid, int keyid,
+ gfp_t gfp_mask, unsigned int order)
+{
+ struct page *page;
+ bool need_zero;
+
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+ VM_WARN_ON(!node_online(nid));
+
+ need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
+ page = __alloc_pages(gfp_mask, order, nid);
+ prep_encrypted_page(page, order, keyid, need_zero);
+
+ return page;
+}
+
#ifdef CONFIG_LOCKDEP
static struct lockdep_map __fs_reclaim_map =
STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
@@ -4401,6 +4434,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
}
EXPORT_SYMBOL(__alloc_pages_nodemask);

+struct page *
+__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
+ int preferred_nid, nodemask_t *nodemask, int keyid)
+{
+ struct page *page;
+ bool need_zero;
+
+ need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
+ page = __alloc_pages_nodemask(gfp_mask, order, preferred_nid, nodemask);
+ prep_encrypted_page(page, order, keyid, need_zero);
+ return page;
+}
+EXPORT_SYMBOL(__alloc_pages_nodemask_keyid);
+
/*
* Common helper functions.
*/
--
2.18.0


2018-07-17 11:23:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 19/19] x86: Introduce CONFIG_X86_INTEL_MKTME

Add new config option to enabled/disable Multi-Key Total Memory
Encryption support.

MKTME uses MEMORY_PHYSICAL_PADDING to reserve enough space in per-KeyID
direct mappings for memory hotplug.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b6f1785c2176..023a22568c06 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1523,6 +1523,23 @@ config ARCH_USE_MEMREMAP_PROT
def_bool y
depends on AMD_MEM_ENCRYPT

+config X86_INTEL_MKTME
+ bool "Intel Multi-Key Total Memory Encryption"
+ select DYNAMIC_PHYSICAL_MASK
+ select PAGE_EXTENSION
+ depends on X86_64 && CPU_SUP_INTEL
+ ---help---
+ Say yes to enable support for Multi-Key Total Memory Encryption.
+ This requires an Intel processor that has support of the feature.
+
+ Multikey Total Memory Encryption (MKTME) is a technology that allows
+ transparent memory encryption in upcoming Intel platforms.
+
+ MKTME is built on top of TME. TME allows encryption of the entirety
+ of system memory using a single key. MKTME allows having multiple
+ encryption domains, each having own key -- different memory pages can
+ be encrypted with different keys.
+
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
@@ -2199,7 +2216,7 @@ config RANDOMIZE_MEMORY

config MEMORY_PHYSICAL_PADDING
hex "Physical memory mapping padding" if EXPERT
- depends on RANDOMIZE_MEMORY
+ depends on RANDOMIZE_MEMORY || X86_INTEL_MKTME
default "0xa" if MEMORY_HOTPLUG
default "0x0"
range 0x1 0x40 if MEMORY_HOTPLUG
--
2.18.0


2018-07-17 11:23:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 18/19] x86/mm: Handle encrypted memory in page_to_virt() and __pa()

Per-KeyID direct mappings require changes into how we find the right
virtual address for a page and virt-to-phys address translations.

page_to_virt() definition overwrites default macros provided by
<linux/mm.h>. We only overwrite the macros if MTKME is enabled
compile-time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 3 +++
arch/x86/include/asm/page_64.h | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index ba83fba4f9b3..dbfbd955da98 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -29,6 +29,9 @@ void arch_free_page(struct page *page, int order);

int sync_direct_mapping(void);

+#define page_to_virt(x) \
+ (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index f57fc3cc2246..a4f394e3471d 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -24,7 +24,7 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
/* use the carry flag to determine if x was < __START_KERNEL_map */
x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));

- return x;
+ return x & direct_mapping_mask;
}

#ifdef CONFIG_DEBUG_VIRTUAL
--
2.18.0


2018-07-17 11:24:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 12/19] x86/mm: Implement prep_encrypted_page() and arch_free_page()

The hardware/CPU does not enforce coherency between mappings of the same
physical page with different KeyIDs or encryption keys.
We are responsible for cache management.

Flush cache on allocating encrypted page and on returning the page to
the free pool.

prep_encrypted_page() also takes care about zeroing the page. We have to
do this after KeyID is set for the page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 6 +++++
arch/x86/mm/mktme.c | 49 ++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index f0b7844e36a4..44409b8bbaca 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -19,6 +19,12 @@ int page_keyid(const struct page *page);
#define vma_keyid vma_keyid
int vma_keyid(struct vm_area_struct *vma);

+#define prep_encrypted_page prep_encrypted_page
+void prep_encrypted_page(struct page *page, int order, int keyid, bool zero);
+
+#define HAVE_ARCH_FREE_PAGE
+void arch_free_page(struct page *page, int order);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index a1f40ee61b25..1194496633ce 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,4 +1,5 @@
#include <linux/mm.h>
+#include <linux/highmem.h>
#include <asm/mktme.h>

phys_addr_t mktme_keyid_mask;
@@ -49,3 +50,51 @@ int vma_keyid(struct vm_area_struct *vma)
prot = pgprot_val(vma->vm_page_prot);
return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
}
+
+void prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
+{
+ int i;
+
+ /* It's not encrypted page: nothing to do */
+ if (!keyid)
+ return;
+
+ /*
+ * The hardware/CPU does not enforce coherency between mappings of the
+ * same physical page with different KeyIDs or encryption keys.
+ * We are responsible for cache management.
+ *
+ * We flush cache before allocating encrypted page
+ */
+ clflush_cache_range(page_address(page), PAGE_SIZE << order);
+
+ for (i = 0; i < (1 << order); i++) {
+ /* All pages coming out of the allocator should have KeyID 0 */
+ WARN_ON_ONCE(lookup_page_ext(page)->keyid);
+ lookup_page_ext(page)->keyid = keyid;
+
+ /* Clear the page after the KeyID is set. */
+ if (zero)
+ clear_highpage(page);
+
+ page++;
+ }
+}
+
+void arch_free_page(struct page *page, int order)
+{
+ int i;
+
+ /* It's not encrypted page: nothing to do */
+ if (!page_keyid(page))
+ return;
+
+ clflush_cache_range(page_address(page), PAGE_SIZE << order);
+
+ for (i = 0; i < (1 << order); i++) {
+ /* Check if the page has reasonable KeyID */
+ WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
+ lookup_page_ext(page)->keyid = 0;
+ page++;
+ }
+}
--
2.18.0


2018-07-17 11:24:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 10/19] x86/mm: Implement page_keyid() using page_ext

Store KeyID in bits 31:16 of extended page flags. These bits are unused.

page_keyid() returns zero until page_ext is ready. page_ext initializer
enables static branch to indicate that page_keyid() can use page_ext.
The same static branch will gate MKTME readiness in general.

We don't yet set KeyID for the page. It will come in the following
patch that implements prep_encrypted_page(). All pages have KeyID-0 for
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 7 +++++++
arch/x86/include/asm/page.h | 1 +
arch/x86/mm/mktme.c | 34 ++++++++++++++++++++++++++++++++++
include/linux/page_ext.h | 11 ++++++++++-
mm/page_ext.c | 3 +++
5 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index df31876ec48c..7266494b4f0a 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -2,11 +2,18 @@
#define _ASM_X86_MKTME_H

#include <linux/types.h>
+#include <linux/page_ext.h>

#ifdef CONFIG_X86_INTEL_MKTME
extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
extern int mktme_keyid_shift;
+
+extern struct page_ext_operations page_mktme_ops;
+
+#define page_keyid page_keyid
+int page_keyid(const struct page *page);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..39af59487d5f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,7 @@
struct page;

#include <linux/range.h>
+#include <asm/mktme.h>
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;

diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 467f1b26c737..09cbff678b9f 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -3,3 +3,37 @@
phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
int mktme_keyid_shift;
+
+static DEFINE_STATIC_KEY_FALSE(mktme_enabled_key);
+
+static inline bool mktme_enabled(void)
+{
+ return static_branch_unlikely(&mktme_enabled_key);
+}
+
+int page_keyid(const struct page *page)
+{
+ if (!mktme_enabled())
+ return 0;
+
+ return lookup_page_ext(page)->keyid;
+}
+EXPORT_SYMBOL(page_keyid);
+
+static bool need_page_mktme(void)
+{
+ /* Make sure keyid doesn't collide with extended page flags */
+ BUILD_BUG_ON(__NR_PAGE_EXT_FLAGS > 16);
+
+ return !!mktme_nr_keyids;
+}
+
+static void init_page_mktme(void)
+{
+ static_branch_enable(&mktme_enabled_key);
+}
+
+struct page_ext_operations page_mktme_ops = {
+ .need = need_page_mktme,
+ .init = init_page_mktme,
+};
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index f84f167ec04c..d9c5aae9523f 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -23,6 +23,7 @@ enum page_ext_flags {
PAGE_EXT_YOUNG,
PAGE_EXT_IDLE,
#endif
+ __NR_PAGE_EXT_FLAGS
};

/*
@@ -33,7 +34,15 @@ enum page_ext_flags {
* then the page_ext for pfn always exists.
*/
struct page_ext {
- unsigned long flags;
+ union {
+ unsigned long flags;
+#ifdef CONFIG_X86_INTEL_MKTME
+ struct {
+ unsigned short __pad;
+ unsigned short keyid;
+ };
+#endif
+ };
};

extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/mm/page_ext.c b/mm/page_ext.c
index a9826da84ccb..036658229842 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -68,6 +68,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_X86_INTEL_MKTME
+ &page_mktme_ops,
+#endif
};

static unsigned long total_usage;
--
2.18.0


2018-07-17 11:24:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 01/19] mm: Do no merge VMAs with different encryption KeyIDs

VMAs with different KeyID do not mix together. Only VMAs with the same
KeyID are compatible.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/userfaultfd.c | 7 ++++---
include/linux/mm.h | 9 ++++++++-
mm/madvise.c | 2 +-
mm/mempolicy.c | 3 ++-
mm/mlock.c | 2 +-
mm/mmap.c | 31 +++++++++++++++++++------------
mm/mprotect.c | 2 +-
7 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 594d192b2331..bb0db9f9d958 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -890,7 +890,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
new_flags, vma->anon_vma,
vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX, vma_keyid(vma));
if (prev)
vma = prev;
else
@@ -1423,7 +1423,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- ((struct vm_userfaultfd_ctx){ ctx }));
+ ((struct vm_userfaultfd_ctx){ ctx }),
+ vma_keyid(vma));
if (prev) {
vma = prev;
goto next;
@@ -1581,7 +1582,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX, vma_keyid(vma));
if (prev) {
vma = prev;
goto next;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a35362bbc92..c8780c5835ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1544,6 +1544,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
return vma->vm_ops == &anon_vm_ops;
}

+#ifndef vma_keyid
+static inline int vma_keyid(struct vm_area_struct *vma)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
@@ -2219,7 +2226,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *, struct vm_userfaultfd_ctx);
+ struct mempolicy *, struct vm_userfaultfd_ctx, int keyid);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..c88fb12be6e5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -138,7 +138,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_keyid(vma));
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f0fcf70bcec7..581b729e05a0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -705,7 +705,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff,
- new_pol, vma->vm_userfaultfd_ctx);
+ new_pol, vma->vm_userfaultfd_ctx,
+ vma_keyid(vma));
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 74e5a6547c3d..3c96321b66bb 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -534,7 +534,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_keyid(vma));
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 180f19dfb83f..4c604eb644b4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -984,7 +984,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
static inline int is_mergeable_vma(struct vm_area_struct *vma,
struct file *file, unsigned long vm_flags,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ int keyid)
{
/*
* VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -998,6 +999,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
return 0;
if (vma->vm_file != file)
return 0;
+ if (vma_keyid(vma) != keyid)
+ return 0;
if (vma->vm_ops->close)
return 0;
if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
@@ -1034,9 +1037,10 @@ static int
can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t vm_pgoff,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ int keyid)
{
- if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, keyid) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
return 1;
@@ -1055,9 +1059,10 @@ static int
can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t vm_pgoff,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ int keyid)
{
- if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, keyid) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
pgoff_t vm_pglen;
vm_pglen = vma_pages(vma);
@@ -1112,7 +1117,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ int keyid)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1145,7 +1151,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags,
anon_vma, file, pgoff,
- vm_userfaultfd_ctx)) {
+ vm_userfaultfd_ctx, keyid)) {
/*
* OK, it can. Can we now merge in the successor as well?
*/
@@ -1154,7 +1160,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
can_vma_merge_before(next, vm_flags,
anon_vma, file,
pgoff+pglen,
- vm_userfaultfd_ctx) &&
+ vm_userfaultfd_ctx,
+ keyid) &&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma, NULL)) {
/* cases 1, 6 */
@@ -1177,7 +1184,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
anon_vma, file, pgoff+pglen,
- vm_userfaultfd_ctx)) {
+ vm_userfaultfd_ctx, keyid)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = __vma_adjust(prev, prev->vm_start,
addr, prev->vm_pgoff, NULL, next);
@@ -1722,7 +1729,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* Can we just expand an old mapping?
*/
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, 0);
if (vma)
goto out;

@@ -2987,7 +2994,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla

/* Can we just expand an old private anonymous mapping? */
vma = vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, 0);
if (vma)
goto out;

@@ -3189,7 +3196,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
return NULL; /* should never get here */
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_keyid(vma));
if (new_vma) {
/*
* Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 625608bc8962..68dc476310c0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -349,7 +349,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*pprev = vma_merge(mm, *pprev, start, end, newflags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_keyid(vma));
if (*pprev) {
vma = *pprev;
VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
--
2.18.0


2018-07-17 11:24:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 04/19] mm/page_alloc: Unify alloc_hugepage_vma()

We don't need to have separate implementations of alloc_hugepage_vma()
for NUMA and non-NUMA. Using variant based on alloc_pages_vma() we would
cover both cases.

This is preparation patch for allocation encrypted pages.

alloc_pages_vma() will handle allocation of encrypted pages. With this
change we don' t need to cover alloc_hugepage_vma() separately.

The change makes typo in Alpha's implementation of
__alloc_zeroed_user_highpage() visible. Fix it too.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/alpha/include/asm/page.h | 2 +-
include/linux/gfp.h | 6 ++----
2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index f3fb2848470a..9a6fbb5269f3 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -18,7 +18,7 @@ extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)

#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
- alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vmaddr)
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

extern void copy_page(void * _to, void * _from);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a6afcec53795..66f395737990 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -494,21 +494,19 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
int node, bool hugepage);
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
alloc_pages(gfp_mask, order)
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+ alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)

extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
--
2.18.0


2018-07-17 11:24:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 13/19] x86/mm: Rename CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING

Rename the option to CONFIG_MEMORY_PHYSICAL_PADDING. It will be used
not only for KASLR.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/mm/kaslr.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6d4774f203d0..b6f1785c2176 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2197,7 +2197,7 @@ config RANDOMIZE_MEMORY

If unsure, say Y.

-config RANDOMIZE_MEMORY_PHYSICAL_PADDING
+config MEMORY_PHYSICAL_PADDING
hex "Physical memory mapping padding" if EXPERT
depends on RANDOMIZE_MEMORY
default "0xa" if MEMORY_HOTPLUG
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 61db77b0eda9..4408cd9a3bef 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -102,7 +102,7 @@ void __init kernel_randomize_memory(void)
*/
BUG_ON(kaslr_regions[0].base != &page_offset_base);
memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
- CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
+ CONFIG_MEMORY_PHYSICAL_PADDING;

/* Adapt phyiscal memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
--
2.18.0


2018-07-17 11:24:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 11/19] x86/mm: Implement vma_keyid()

We store KeyID in upper bits for vm_page_prot that match position of
KeyID in PTE. vma_keyid() extracts KeyID from vm_page_prot.

With KeyID in vm_page_prot we don't need to modify any page table helper
to propagate the KeyID to page table entires.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 5 +++++
arch/x86/mm/mktme.c | 12 ++++++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 7266494b4f0a..f0b7844e36a4 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -4,6 +4,8 @@
#include <linux/types.h>
#include <linux/page_ext.h>

+struct vm_area_struct;
+
#ifdef CONFIG_X86_INTEL_MKTME
extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
@@ -14,6 +16,9 @@ extern struct page_ext_operations page_mktme_ops;
#define page_keyid page_keyid
int page_keyid(const struct page *page);

+#define vma_keyid vma_keyid
+int vma_keyid(struct vm_area_struct *vma);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 09cbff678b9f..a1f40ee61b25 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,3 +1,4 @@
+#include <linux/mm.h>
#include <asm/mktme.h>

phys_addr_t mktme_keyid_mask;
@@ -37,3 +38,14 @@ struct page_ext_operations page_mktme_ops = {
.need = need_page_mktme,
.init = init_page_mktme,
};
+
+int vma_keyid(struct vm_area_struct *vma)
+{
+ pgprotval_t prot;
+
+ if (!mktme_enabled())
+ return 0;
+
+ prot = pgprot_val(vma->vm_page_prot);
+ return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
+}
--
2.18.0


2018-07-17 11:25:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 14/19] x86/mm: Allow to disable MKTME after enumeration

The new helper mktme_disable() allows to disable MKTME even if it's
enumerated successfully. MKTME initialization may fail and this
functionality allows system to boot regardless of the failure.

MKTME needs per-KeyID direct mapping. It requires a lot more virtual
address space which may be a problem in 4-level paging mode. If the
system has more physical memory than we can handle with MKTME the
feature allows to fail MKTME, but boot the system successfully.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 2 ++
arch/x86/kernel/cpu/intel.c | 5 +----
arch/x86/mm/mktme.c | 9 +++++++++
3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 44409b8bbaca..ebbee6a0c495 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -6,6 +6,8 @@

struct vm_area_struct;

+void mktme_disable(void);
+
#ifdef CONFIG_X86_INTEL_MKTME
extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index efc9e9fc47d4..75e3b2602b4a 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -591,10 +591,7 @@ static void detect_tme(struct cpuinfo_x86 *c)
* Maybe needed if there's inconsistent configuation
* between CPUs.
*/
- physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
- mktme_keyid_mask = 0;
- mktme_keyid_shift = 0;
- mktme_nr_keyids = 0;
+ mktme_disable();
}
#endif

diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 1194496633ce..bb6210dbcf0e 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -13,6 +13,15 @@ static inline bool mktme_enabled(void)
return static_branch_unlikely(&mktme_enabled_key);
}

+void mktme_disable(void)
+{
+ physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ mktme_keyid_mask = 0;
+ mktme_keyid_shift = 0;
+ mktme_nr_keyids = 0;
+ static_branch_disable(&mktme_enabled_key);
+}
+
int page_keyid(const struct page *page)
{
if (!mktme_enabled())
--
2.18.0


2018-07-17 11:25:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 03/19] mm/ksm: Do not merge pages with different KeyIDs

Pages encrypted with different encryption keys are not allowed to be
merged by KSM. Otherwise it would cross security boundary.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/ksm.c | 3 +++
2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 151d6e6b16e5..a4ce26aa0b65 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1551,6 +1551,13 @@ static inline int vma_keyid(struct vm_area_struct *vma)
}
#endif

+#ifndef page_keyid
+static inline int page_keyid(struct page *page)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/ksm.c b/mm/ksm.c
index a6d43cf9a982..1bd7b9710e29 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1214,6 +1214,9 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
if (!PageAnon(page))
goto out;

+ if (page_keyid(page) != page_keyid(kpage))
+ goto out;
+
/*
* We need the page lock to read a stable PageSwapCache in
* write_protect_page(). We use trylock_page() instead of
--
2.18.0


2018-07-17 11:25:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
KeyID zero which used by TME. MKTME KeyIDs start from 1.

mktme_keyid_shift holds shift of KeyID within physical address.

mktme_keyid_mask holds mask to extract KeyID from physical address.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 16 ++++++++++++++++
arch/x86/kernel/cpu/intel.c | 12 ++++++++----
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mktme.c | 5 +++++
4 files changed, 31 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mktme.c

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
new file mode 100644
index 000000000000..df31876ec48c
--- /dev/null
+++ b/arch/x86/include/asm/mktme.h
@@ -0,0 +1,16 @@
+#ifndef _ASM_X86_MKTME_H
+#define _ASM_X86_MKTME_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_X86_INTEL_MKTME
+extern phys_addr_t mktme_keyid_mask;
+extern int mktme_nr_keyids;
+extern int mktme_keyid_shift;
+#else
+#define mktme_keyid_mask ((phys_addr_t)0)
+#define mktme_nr_keyids 0
+#define mktme_keyid_shift 0
+#endif
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index bf2caf9d52dd..efc9e9fc47d4 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -573,6 +573,9 @@ static void detect_tme(struct cpuinfo_x86 *c)

#ifdef CONFIG_X86_INTEL_MKTME
if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ mktme_nr_keyids = nr_keyids;
+ mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
+
/*
* Mask out bits claimed from KeyID from physical address mask.
*
@@ -580,10 +583,8 @@ static void detect_tme(struct cpuinfo_x86 *c)
* and number of bits claimed for KeyID is 6, bits 51:46 of
* physical address is unusable.
*/
- phys_addr_t keyid_mask;
-
- keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
- physical_mask &= ~keyid_mask;
+ mktme_keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, mktme_keyid_shift);
+ physical_mask &= ~mktme_keyid_mask;
} else {
/*
* Reset __PHYSICAL_MASK.
@@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
* between CPUs.
*/
physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ mktme_keyid_mask = 0;
+ mktme_keyid_shift = 0;
+ mktme_nr_keyids = 0;
}
#endif

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..4ebee899c363 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
+obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
new file mode 100644
index 000000000000..467f1b26c737
--- /dev/null
+++ b/arch/x86/mm/mktme.c
@@ -0,0 +1,5 @@
+#include <asm/mktme.h>
+
+phys_addr_t mktme_keyid_mask;
+int mktme_nr_keyids;
+int mktme_keyid_shift;
--
2.18.0


2018-07-17 11:25:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 02/19] mm: Do not use zero page in encrypted pages

Zero page is not encrypted and putting it into encrypted VMA produces
garbage.

We can map zero page with KeyID-0 into an encrypted VMA, but this would
be violation security boundary between encryption domains.

Forbid zero pages in encrypted VMAs.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/s390/include/asm/pgtable.h | 2 +-
include/linux/mm.h | 4 ++--
mm/huge_memory.c | 3 +--
mm/memory.c | 3 +--
4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 5ab636089c60..2e8658962aae 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -505,7 +505,7 @@ static inline int mm_alloc_pgste(struct mm_struct *mm)
* In the case that a guest uses storage keys
* faults should no longer be backed by zero pages
*/
-#define mm_forbids_zeropage mm_has_pgste
+#define vma_forbids_zeropage(vma) mm_has_pgste(vma->vm_mm)
static inline int mm_uses_skeys(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8780c5835ad..151d6e6b16e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -92,8 +92,8 @@ extern int mmap_rnd_compat_bits __read_mostly;
* s390 does this to prevent multiplexing of hardware bits
* related to the physical page in case of virtualization.
*/
-#ifndef mm_forbids_zeropage
-#define mm_forbids_zeropage(X) (0)
+#ifndef vma_forbids_zeropage
+#define vma_forbids_zeropage(vma) vma_keyid(vma)
#endif

/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..83f096c7299b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -676,8 +676,7 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(vmf->flags & FAULT_FLAG_WRITE) &&
- !mm_forbids_zeropage(vma->vm_mm) &&
+ if (!(vmf->flags & FAULT_FLAG_WRITE) && !vma_forbids_zeropage(vma) &&
transparent_hugepage_use_zero_page()) {
pgtable_t pgtable;
struct page *zero_page;
diff --git a/mm/memory.c b/mm/memory.c
index 02fbef2bd024..a705637d2ded 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,8 +3139,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
return 0;

/* Use the zero-page for reads */
- if (!(vmf->flags & FAULT_FLAG_WRITE) &&
- !mm_forbids_zeropage(vma->vm_mm)) {
+ if (!(vmf->flags & FAULT_FLAG_WRITE) && !vma_forbids_zeropage(vma)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
--
2.18.0


2018-07-18 17:39:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 02/19] mm: Do not use zero page in encrypted pages

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> Zero page is not encrypted and putting it into encrypted VMA produces
> garbage.
>
> We can map zero page with KeyID-0 into an encrypted VMA, but this would
> be violation security boundary between encryption domains.

Why? How is it a violation?

It only matters if they write secrets. They can't write secrets to the
zero page.

Is this only because you accidentally inherited ->vm_page_prot on the
zero page PTE?

2018-07-18 17:40:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 03/19] mm/ksm: Do not merge pages with different KeyIDs

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> Pages encrypted with different encryption keys are not allowed to be
> merged by KSM. Otherwise it would cross security boundary.

Let's say I'm using plain AES (not AES-XTS). I use the same key in two
keyid slots. I map a page with the first keyid and another with the
other keyid.

Won't they have the same cipertext? Why shouldn't we KSM them?

2018-07-18 17:44:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 04/19] mm/page_alloc: Unify alloc_hugepage_vma()

A grammar error or two is probably OK in these descriptions, but these
are just riddled with them in a way that makes them hard to read.
Suggestions below.

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> We don't need to have separate implementations of alloc_hugepage_vma()
> for NUMA and non-NUMA. Using variant based on alloc_pages_vma() we would
> cover both cases.

"Using the"

> This is preparation patch for allocation encrypted pages.

"a preparation"

"allocation encrypted pages" -> "allocation of encrypted pages" or
"allocation encrypted pages" -> "allocating encrypted pages" or

> alloc_pages_vma() will handle allocation of encrypted pages. With this
> change we don' t need to cover alloc_hugepage_vma() separately.

"don' t" -> "don't"

> The change makes typo in Alpha's implementation of

"a typo"

2018-07-18 22:24:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv5 18/19] x86/mm: Handle encrypted memory in page_to_virt() and __pa()

On Tue, 17 Jul 2018, Kirill A. Shutemov wrote:

> Per-KeyID direct mappings require changes into how we find the right
> virtual address for a page and virt-to-phys address translations.
>
> page_to_virt() definition overwrites default macros provided by
> <linux/mm.h>. We only overwrite the macros if MTKME is enabled
> compile-time.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/include/asm/mktme.h | 3 +++
> arch/x86/include/asm/page_64.h | 2 +-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> index ba83fba4f9b3..dbfbd955da98 100644
> --- a/arch/x86/include/asm/mktme.h
> +++ b/arch/x86/include/asm/mktme.h
> @@ -29,6 +29,9 @@ void arch_free_page(struct page *page, int order);
>
> int sync_direct_mapping(void);
>
> +#define page_to_virt(x) \
> + (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)

This really does not belong into the mktme header.

Please make this the unconditional x86 page_to_virt() implementation in
asm/page.h, which is the canonical and obvious place for it.

The page_keyid() name is quite generic as well. Can this please have some
kind of reference to the underlying mechanism, i.e. mktme?

Please hide the multiplication with direct_mapping_size in the mktme header
as well. It's non interesting for the !MKTME case. Something like:

#define page_to_virt(x) \
(__va(PFN_PHYS(page_to_pfn(x))) + mktme_page_to_virt_offset(x))

makes it immediately clear where to look and also makes it clear that the
offset will be 0 for a !MKTME enabled kernel and (hopefully) for all !MKTME
enabled processors as well.

And then have a proper implementation of mktme_page_to_virt_offset() with a
proper comment what on earth this is doing. It might be all obvious to you
now, but it's completely non obvious for the casual reader and you will
have to twist your brain around it 6 month from now as well.

> #else
> #define mktme_keyid_mask ((phys_addr_t)0)
> #define mktme_nr_keyids 0
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index f57fc3cc2246..a4f394e3471d 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -24,7 +24,7 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
> /* use the carry flag to determine if x was < __START_KERNEL_map */
> x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
>
> - return x;
> + return x & direct_mapping_mask;

This hunk also lacks any explanation both in the changelog and in form of a
comment.

> Per-KeyID direct mappings require changes into how we find the right
> virtual address for a page and virt-to-phys address translations.

That's pretty useless as it does just tell about 'changes', but not at all
about what kind of changes and why these changes are required. It's really
not helpful to assume that everyone stumbling over this will know the whole
story especially not 6 month after this has been merged and then someone
ends up with a bisect on that change.

While at it, please get rid of the 'we'. We are neither CPUs nor code.

Thanks,

tglx

2018-07-18 23:05:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 05/19] mm/page_alloc: Handle allocation for encrypted memory

I asked about this before and it still isn't covered in the description:
You were specifically asked (maybe in person at LSF/MM?) not to modify
allocator to pass the keyid around. Please specifically mention how
this design addresses that feedback in the patch description.

You were told, "don't change the core allocator", so I think you just
added new functions that wrap the core allocator and called them from
the majority of sites that call into the core allocator. Personally, I
think that misses the point of the original request.

Do I have a better way? Nope, not really.

> +/*
> + * Encrypted page has to be cleared once keyid is set, not on allocation.
> + */
> +static inline bool encrypted_page_needs_zero(int keyid, gfp_t *gfp_mask)
> +{
> + if (!keyid)
> + return false;
> +
> + if (*gfp_mask & __GFP_ZERO) {
> + *gfp_mask &= ~__GFP_ZERO;
> + return true;
> + }
> +
> + return false;
> +}

Shouldn't this be zero_page_at_alloc()?

Otherwise, it gets confusing about whether the page needs zeroing at
*all*, vs at alloc vs. free.

> +static inline struct page *alloc_pages_node_keyid(int nid, int keyid,
> + gfp_t gfp_mask, unsigned int order)
> +{
> + if (nid == NUMA_NO_NODE)
> + nid = numa_mem_id();
> +
> + return __alloc_pages_node_keyid(nid, keyid, gfp_mask, order);
> +}

We have an innumerable number of (__)?alloc_pages* functions. This adds
two more. I'm not a big fan of making this worse.

Do I have a better idea? Not really. The best I have is to start being
more careful about all of the arguments and actually formalize the list
of things that we need to succeed in an allocation in a struct
alloc_args or something.

> #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
> #define alloc_page_vma(gfp_mask, vma, addr) \
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f2b4abbca55e..fede9bfa89d9 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -38,9 +38,15 @@ static inline struct page *new_page_nodemask(struct page *page,
> unsigned int order = 0;
> struct page *new_page = NULL;
>
> - if (PageHuge(page))
> + if (PageHuge(page)) {
> + /*
> + * HugeTLB doesn't support encryption. We shouldn't see
> + * such pages.
> + */
> + WARN_ON(page_keyid(page));
> return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> preferred_nid, nodemask);
> + }

Shouldn't we be returning NULL? Seems like failing the allocation is
much less likely to result in bad things happening.

> if (PageTransHuge(page)) {
> gfp_mask |= GFP_TRANSHUGE;
> @@ -50,8 +56,8 @@ static inline struct page *new_page_nodemask(struct page *page,
> if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> gfp_mask |= __GFP_HIGHMEM;
>
> - new_page = __alloc_pages_nodemask(gfp_mask, order,
> - preferred_nid, nodemask);
> + new_page = __alloc_pages_nodemask_keyid(gfp_mask, order,
> + preferred_nid, nodemask, page_keyid(page));

Needs a comment please. It's totally non-obvious that this is the
migration case from the context, new_page_nodemask()'s name, or the name
of 'page'.

/* Allocate a page with the same KeyID as the source page */

> diff --git a/mm/compaction.c b/mm/compaction.c
> index faca45ebe62d..fd51aa32ad96 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1187,6 +1187,7 @@ static struct page *compaction_alloc(struct page *migratepage,
> list_del(&freepage->lru);
> cc->nr_freepages--;
>
> + prep_encrypted_page(freepage, 0, page_keyid(migratepage), false);
> return freepage;
> }

Comments, please.

Why is this here? What other code might need prep_encrypted_page()?

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 581b729e05a0..ce7b436444b5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -921,22 +921,28 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> /* page allocation callback for NUMA node migration */
> struct page *alloc_new_node_page(struct page *page, unsigned long node)
> {
> - if (PageHuge(page))
> + if (PageHuge(page)) {
> + /*
> + * HugeTLB doesn't support encryption. We shouldn't see
> + * such pages.
> + */
> + WARN_ON(page_keyid(page));
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> node);
> - else if (PageTransHuge(page)) {
> + } else if (PageTransHuge(page)) {
> struct page *thp;
>
> - thp = alloc_pages_node(node,
> + thp = alloc_pages_node_keyid(node, page_keyid(page),
> (GFP_TRANSHUGE | __GFP_THISNODE),
> HPAGE_PMD_ORDER);
> if (!thp)
> return NULL;
> prep_transhuge_page(thp);
> return thp;
> - } else
> - return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
> - __GFP_THISNODE, 0);
> + } else {
> + return __alloc_pages_node_keyid(node, page_keyid(page),
> + GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> + }
> }
>
> /*
> @@ -2013,9 +2019,16 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> {
> struct mempolicy *pol;
> struct page *page;
> - int preferred_nid;
> + bool zero = false;
> + int keyid, preferred_nid;
> nodemask_t *nmask;
>
> + keyid = vma_keyid(vma);
> + if (keyid && (gfp & __GFP_ZERO)) {
> + zero = true;
> + gfp &= ~__GFP_ZERO;
> + }

Comments, please. 'zero' should be 'deferred_zero', at least.

Also, can't we hide this a _bit_ better?

if (deferred_page_zero(vma))
gfp &= ~__GFP_ZERO;

Then, later:

deferred_page_prep(vma, page, order);

and hide everything in deferred_page_zero() and deferred_page_prep().


> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3697,6 +3697,39 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> }
> #endif /* CONFIG_COMPACTION */
>
> +#ifndef CONFIG_NUMA
> +struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
> + struct vm_area_struct *vma, unsigned long addr,
> + int node, bool hugepage)
> +{
> + struct page *page;
> + bool need_zero;
> + int keyid = vma_keyid(vma);
> +
> + need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
> + page = alloc_pages(gfp_mask, order);
> + prep_encrypted_page(page, order, keyid, need_zero);
> +
> + return page;
> +}
> +#endif

Is there *ever* a VMA-based allocation that doesn't need zeroing?

> +struct page * __alloc_pages_node_keyid(int nid, int keyid,
> + gfp_t gfp_mask, unsigned int order)
> +{
> + struct page *page;
> + bool need_zero;
> +
> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> + VM_WARN_ON(!node_online(nid));
> +
> + need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
> + page = __alloc_pages(gfp_mask, order, nid);
> + prep_encrypted_page(page, order, keyid, need_zero);
> +
> + return page;
> +}
> +
> #ifdef CONFIG_LOCKDEP
> static struct lockdep_map __fs_reclaim_map =
> STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> @@ -4401,6 +4434,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> }
> EXPORT_SYMBOL(__alloc_pages_nodemask);
>
> +struct page *
> +__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
> + int preferred_nid, nodemask_t *nodemask, int keyid)
> +{
> + struct page *page;
> + bool need_zero;
> +
> + need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
> + page = __alloc_pages_nodemask(gfp_mask, order, preferred_nid, nodemask);
> + prep_encrypted_page(page, order, keyid, need_zero);
> + return page;
> +}
> +EXPORT_SYMBOL(__alloc_pages_nodemask_keyid);

That looks like three duplicates of the same code, wrapping three more
allocator variants. Do we really have no other alternatives? Can you
please go ask the folks that gave you the feedback about the allocator
modifications and ask them if this is OK explicitly?

2018-07-18 23:14:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 06/19] mm/khugepaged: Handle encrypted pages

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> khugepaged allocates page in advance, before we found a VMA for
> collapse. We don't yet know which KeyID to use for the allocation.

That's not really true. We have the VMA and the address in the caller
(khugepaged_scan_pmd()), but we drop the lock and have to revalidate the
VMA.


> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5ae34097aed1..d116f4ebb622 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1056,6 +1056,16 @@ static void collapse_huge_page(struct mm_struct *mm,
> */
> anon_vma_unlock_write(vma->anon_vma);
>
> + /*
> + * At this point new_page is allocated as non-encrypted.
> + * If VMA's KeyID is non-zero, we need to prepare it to be encrypted
> + * before coping data.
> + */
> + if (vma_keyid(vma)) {
> + prep_encrypted_page(new_page, HPAGE_PMD_ORDER,
> + vma_keyid(vma), false);
> + }

I guess this isn't horribly problematic now, but if we ever keep pools
of preassigned-keyids, this won't work any more.

2018-07-18 23:15:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 07/19] x86/mm: Mask out KeyID bits from page table entry pfn

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> + } else {
> + /*
> + * Reset __PHYSICAL_MASK.
> + * Maybe needed if there's inconsistent configuation
> + * between CPUs.
> + */
> + physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> + }

This seems like an appropriate place for a WARN_ON(). Either that, or
axe this code.

2018-07-18 23:20:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
> KeyID zero which used by TME. MKTME KeyIDs start from 1.
>
> mktme_keyid_shift holds shift of KeyID within physical address.

I know what all these words mean, but the combination of them makes no
sense to me. I still don't know what the variable does after reading this.

Is this the lowest bit in the physical address which is used for the
KeyID? How many bits you must shift up a KeyID to get to the location
at which it can be masked into the physical address?

> mktme_keyid_mask holds mask to extract KeyID from physical address.

Good descriptions, wrong place. Please put these in the code.

Also, the grammar constantly needs some work. "holds mask" needs to be
"holds the mask".

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/include/asm/mktme.h | 16 ++++++++++++++++
> arch/x86/kernel/cpu/intel.c | 12 ++++++++----
> arch/x86/mm/Makefile | 2 ++
> arch/x86/mm/mktme.c | 5 +++++
> 4 files changed, 31 insertions(+), 4 deletions(-)
> create mode 100644 arch/x86/include/asm/mktme.h
> create mode 100644 arch/x86/mm/mktme.c
>
> diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> new file mode 100644
> index 000000000000..df31876ec48c
> --- /dev/null
> +++ b/arch/x86/include/asm/mktme.h
> @@ -0,0 +1,16 @@
> +#ifndef _ASM_X86_MKTME_H
> +#define _ASM_X86_MKTME_H
> +
> +#include <linux/types.h>
> +
> +#ifdef CONFIG_X86_INTEL_MKTME
> +extern phys_addr_t mktme_keyid_mask;
> +extern int mktme_nr_keyids;
> +extern int mktme_keyid_shift;
> +#else
> +#define mktme_keyid_mask ((phys_addr_t)0)
> +#define mktme_nr_keyids 0
> +#define mktme_keyid_shift 0
> +#endif
> +
> +#endif
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index bf2caf9d52dd..efc9e9fc47d4 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -573,6 +573,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
>
> #ifdef CONFIG_X86_INTEL_MKTME
> if (mktme_status == MKTME_ENABLED && nr_keyids) {
> + mktme_nr_keyids = nr_keyids;
> + mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
> +
> /*
> * Mask out bits claimed from KeyID from physical address mask.
> *
> @@ -580,10 +583,8 @@ static void detect_tme(struct cpuinfo_x86 *c)
> * and number of bits claimed for KeyID is 6, bits 51:46 of
> * physical address is unusable.
> */
> - phys_addr_t keyid_mask;
> -
> - keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
> - physical_mask &= ~keyid_mask;
> + mktme_keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, mktme_keyid_shift);
> + physical_mask &= ~mktme_keyid_mask;

Seems a bit silly that we introduce keyid_mask only to make it global a
few patches later.

> } else {
> /*
> * Reset __PHYSICAL_MASK.
> @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> * between CPUs.
> */
> physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> + mktme_keyid_mask = 0;
> + mktme_keyid_shift = 0;
> + mktme_nr_keyids = 0;
> }

Should be unnecessary. These are zeroed by the compiler.

> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 4b101dd6e52f..4ebee899c363 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
> +
> +obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> new file mode 100644
> index 000000000000..467f1b26c737
> --- /dev/null
> +++ b/arch/x86/mm/mktme.c
> @@ -0,0 +1,5 @@
> +#include <asm/mktme.h>
> +
> +phys_addr_t mktme_keyid_mask;
> +int mktme_nr_keyids;
> +int mktme_keyid_shift;

Descriptions should be here, please, not in the changelog.


2018-07-18 23:31:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 09/19] x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> An encrypted VMA will have KeyID stored in vma->vm_page_prot. This way
> we don't need to do anything special to setup encrypted page table
> entries

We don't do anything special for protection keys, either. They just
work too.

> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 99fff853c944..3731f7e08757 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -120,8 +120,21 @@
> * protection key is treated like _PAGE_RW, for
> * instance, and is *not* included in this mask since
> * pte_modify() does modify it.
> + *
> + * They include the physical address and the memory encryption keyID.
> + * The paddr and the keyID never occupy the same bits at the same time.
> + * But, a given bit might be used for the keyID on one system and used for
> + * the physical address on another. As an optimization, we manage them in
> + * one unit here since their combination always occupies the same hardware
> + * bits. PTE_PFN_MASK_MAX stores combined mask.
> + *
> + * Cast PAGE_MASK to a signed type so that it is sign-extended if
> + * virtual addresses are 32-bits but physical addresses are larger
> + * (ie, 32-bit PAE).
> */

Could you please make the comment block consistent? You're a lot wider
than the comment above.

> -#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
> +#define PTE_PFN_MASK_MAX \
> + (((signed long)PAGE_MASK) & ((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
> +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
> _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
> _PAGE_SOFT_DIRTY)

Man, I'm not a fan of this. This saves us from consuming 6 VM_HIGH bits
(which we are not short on). But, at the cost of complexity.

Protection keys eat up PTE space and have an interface called
pkey_mprotect(). MKTME KeyIDs take up PTE space and will probably have
an interface called something_mprotect(). Yet, the implementations are
going to be _very_ different with pkeys being excluded from
_PAGE_CHG_MASK and KeyIDs being included.

I think you're saved here because we don't _actually_ do pte_modify() on
an existing PTE: we blow the old one away upon encrypted_mprotect() and
replace the PTE with a new one.

But, this is incompatible with any case where we want to change the
KeyID and keep the old PTE target. With AES-XTS, I guess this is a safe
assumption, but it's worrying.

Are there scenarios where we want to keep PTE contents, but change the
KeyID?

2018-07-18 23:39:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 10/19] x86/mm: Implement page_keyid() using page_ext

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> Store KeyID in bits 31:16 of extended page flags. These bits are unused.

I'd love a two sentence remind of what page_ext is and why you chose to
use it. Yes, you need this. No, not everybody that you want to review
this patch set knows what it is or why you chose it.

> page_keyid() returns zero until page_ext is ready.

Is there any implication of this? Or does it not matter because we
don't run userspace until after page_ext initialization is done?

> page_ext initializer enables static branch to indicate that

"enables a static branch"

> page_keyid() can use page_ext. The same static branch will gate MKTME
> readiness in general.

Can you elaborate on this a bit? It would also be a nice place to hint
to the folks working hard on the APIs to ensure she checks this.

> We don't yet set KeyID for the page. It will come in the following
> patch that implements prep_encrypted_page(). All pages have KeyID-0 for
> now.

It also wouldn't hurt to mention why you don't use an X86_FEATURE_* for
this rather than an explicit static branch. I'm sure the x86
maintainers will be curious.

2018-07-18 23:41:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 11/19] x86/mm: Implement vma_keyid()

> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -1,3 +1,4 @@
> +#include <linux/mm.h>
> #include <asm/mktme.h>
>
> phys_addr_t mktme_keyid_mask;
> @@ -37,3 +38,14 @@ struct page_ext_operations page_mktme_ops = {
> .need = need_page_mktme,
> .init = init_page_mktme,
> };
> +
> +int vma_keyid(struct vm_area_struct *vma)
> +{
> + pgprotval_t prot;
> +
> + if (!mktme_enabled())
> + return 0;
> +
> + prot = pgprot_val(vma->vm_page_prot);
> + return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
> +}

I'm a bit surprised this isn't inlined. Not that function calls are
expensive, but we *could* entirely avoid them using the normal pattern of:

// In the header:
static inline vma_keyid(...)
{
if (!mktme_enabled())
return 0;

return __vma_keyid(...); // <- the .c file version
}

2018-07-18 23:55:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 12/19] x86/mm: Implement prep_encrypted_page() and arch_free_page()

The description doesn't mention the potential performance implications
of this patch. That's criminal at this point.

> --- a/arch/x86/mm/mktme.c
> +++ b/arch/x86/mm/mktme.c
> @@ -1,4 +1,5 @@
> #include <linux/mm.h>
> +#include <linux/highmem.h>
> #include <asm/mktme.h>
>
> phys_addr_t mktme_keyid_mask;
> @@ -49,3 +50,51 @@ int vma_keyid(struct vm_area_struct *vma)
> prot = pgprot_val(vma->vm_page_prot);
> return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
> }
> +
> +void prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> +{
> + int i;
> +
> + /* It's not encrypted page: nothing to do */
> + if (!keyid)
> + return;

prep_encrypted_page() is called in the fast path in the page allocator.
This out-of-line copy costs a function call for all users and this is
also out of the reach of the compiler to understand that keyid!=0 is
unlikely.

I think this needs to be treated to the inline-in-the-header treatment.

> + /*
> + * The hardware/CPU does not enforce coherency between mappings of the
> + * same physical page with different KeyIDs or encryption keys.
> + * We are responsible for cache management.
> + *
> + * We flush cache before allocating encrypted page
> + */
> + clflush_cache_range(page_address(page), PAGE_SIZE << order);

It's also worth pointing out that this must be done on the keyid alias
direct map, not the normal one.

Wait a sec... How do we know which direct map to use?

> + for (i = 0; i < (1 << order); i++) {
> + /* All pages coming out of the allocator should have KeyID 0 */
> + WARN_ON_ONCE(lookup_page_ext(page)->keyid);
> + lookup_page_ext(page)->keyid = keyid;
> +
> + /* Clear the page after the KeyID is set. */
> + if (zero)
> + clear_highpage(page);
> +
> + page++;
> + }
> +}
> +
> +void arch_free_page(struct page *page, int order)
> +{
> + int i;
> +
> + /* It's not encrypted page: nothing to do */
> + if (!page_keyid(page))
> + return;

Ditto on pushing this to a header.

> + clflush_cache_range(page_address(page), PAGE_SIZE << order);

OK, how do we know which copy of the direct map to use, here?

> + for (i = 0; i < (1 << order); i++) {
> + /* Check if the page has reasonable KeyID */
> + WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
> + lookup_page_ext(page)->keyid = 0;
> + page++;
> + }
> +}
>


2018-07-19 00:02:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 17/19] x86/mm: Implement sync_direct_mapping()

On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> arch/x86/include/asm/mktme.h | 8 +
> arch/x86/mm/init_64.c | 10 +
> arch/x86/mm/mktme.c | 437 +++++++++++++++++++++++++++++++++++
> 3 files changed, 455 insertions(+)

I'm not the maintainer. But, NAK from me on this on the diffstat alone.

There is simply too much technical debt here. There is no way this code
is not riddled with bugs and I would bet lots of beer on the fact that
this has received little to know testing with all the combinations that
matter, like memory hotplug. I'd love to be proven wrong, so I eagerly
await to be dazzled with the test results that have so far escaped
mention in the changelog.

Please make an effort to refactor this to reuse the code that we already
have to manage the direct mapping. We can't afford 455 new lines of
page table manipulation that nobody tests or runs.

How _was_ this tested?

2018-07-19 07:23:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 02/19] mm: Do not use zero page in encrypted pages

On Wed, Jul 18, 2018 at 10:36:24AM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > Zero page is not encrypted and putting it into encrypted VMA produces
> > garbage.
> >
> > We can map zero page with KeyID-0 into an encrypted VMA, but this would
> > be violation security boundary between encryption domains.
>
> Why? How is it a violation?
>
> It only matters if they write secrets. They can't write secrets to the
> zero page.

I believe usage of zero page is wrong here. It would indirectly reveal
content of supposedly encrypted memory region.

I can see argument why it should be okay and I don't have very strong
opinion on this.

If folks see it's okay to use zero page in encrypted VMAs I can certainly
make it work.

> Is this only because you accidentally inherited ->vm_page_prot on the
> zero page PTE?

Yes, in previous patchset I mapped zero page with wrong KeyID. This is one
of possible fixes for this.

--
Kirill A. Shutemov

2018-07-19 07:33:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 03/19] mm/ksm: Do not merge pages with different KeyIDs

On Wed, Jul 18, 2018 at 10:38:27AM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > Pages encrypted with different encryption keys are not allowed to be
> > merged by KSM. Otherwise it would cross security boundary.
>
> Let's say I'm using plain AES (not AES-XTS). I use the same key in two
> keyid slots. I map a page with the first keyid and another with the
> other keyid.
>
> Won't they have the same cipertext? Why shouldn't we KSM them?

We compare plain text, not ciphertext. And for good reason.

Comparing ciphertext would only make KSM successful for AES-ECB that
doesn't dependent on physical address of the page.

MKTME only supports AES-XTS (no plans to support AES-ECB). It effectively
disables KSM if we go with comparing ciphertext.

--
Kirill A. Shutemov

2018-07-19 08:28:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 05/19] mm/page_alloc: Handle allocation for encrypted memory

On Wed, Jul 18, 2018 at 04:03:53PM -0700, Dave Hansen wrote:
> I asked about this before and it still isn't covered in the description:
> You were specifically asked (maybe in person at LSF/MM?) not to modify
> allocator to pass the keyid around. Please specifically mention how
> this design addresses that feedback in the patch description.
>
> You were told, "don't change the core allocator", so I think you just
> added new functions that wrap the core allocator and called them from
> the majority of sites that call into the core allocator. Personally, I
> think that misses the point of the original request.
>
> Do I have a better way? Nope, not really.

+Michal.

IIRC, Michal was not happy that I propagate the KeyID to very core
allcoator and we've talked about wrappers around existing APIs as a better
solution.

Michal, is it correct?

> > +/*
> > + * Encrypted page has to be cleared once keyid is set, not on allocation.
> > + */
> > +static inline bool encrypted_page_needs_zero(int keyid, gfp_t *gfp_mask)
> > +{
> > + if (!keyid)
> > + return false;
> > +
> > + if (*gfp_mask & __GFP_ZERO) {
> > + *gfp_mask &= ~__GFP_ZERO;
> > + return true;
> > + }
> > +
> > + return false;
> > +}
>
> Shouldn't this be zero_page_at_alloc()?
>
> Otherwise, it gets confusing about whether the page needs zeroing at
> *all*, vs at alloc vs. free.

I like your idea with deferred_page_zero() below. I'll go with it.

> > +static inline struct page *alloc_pages_node_keyid(int nid, int keyid,
> > + gfp_t gfp_mask, unsigned int order)
> > +{
> > + if (nid == NUMA_NO_NODE)
> > + nid = numa_mem_id();
> > +
> > + return __alloc_pages_node_keyid(nid, keyid, gfp_mask, order);
> > +}
>
> We have an innumerable number of (__)?alloc_pages* functions. This adds
> two more. I'm not a big fan of making this worse.
>
> Do I have a better idea? Not really. The best I have is to start being
> more careful about all of the arguments and actually formalize the list
> of things that we need to succeed in an allocation in a struct
> alloc_args or something.

Sounds like a separate project to me :)

> > #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
> > #define alloc_page_vma(gfp_mask, vma, addr) \
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index f2b4abbca55e..fede9bfa89d9 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -38,9 +38,15 @@ static inline struct page *new_page_nodemask(struct page *page,
> > unsigned int order = 0;
> > struct page *new_page = NULL;
> >
> > - if (PageHuge(page))
> > + if (PageHuge(page)) {
> > + /*
> > + * HugeTLB doesn't support encryption. We shouldn't see
> > + * such pages.
> > + */
> > + WARN_ON(page_keyid(page));
> > return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> > preferred_nid, nodemask);
> > + }
>
> Shouldn't we be returning NULL? Seems like failing the allocation is
> much less likely to result in bad things happening.

Okay.

> > if (PageTransHuge(page)) {
> > gfp_mask |= GFP_TRANSHUGE;
> > @@ -50,8 +56,8 @@ static inline struct page *new_page_nodemask(struct page *page,
> > if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> > gfp_mask |= __GFP_HIGHMEM;
> >
> > - new_page = __alloc_pages_nodemask(gfp_mask, order,
> > - preferred_nid, nodemask);
> > + new_page = __alloc_pages_nodemask_keyid(gfp_mask, order,
> > + preferred_nid, nodemask, page_keyid(page));
>
> Needs a comment please. It's totally non-obvious that this is the
> migration case from the context, new_page_nodemask()'s name, or the name
> of 'page'.
>
> /* Allocate a page with the same KeyID as the source page */

Sure.

>
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index faca45ebe62d..fd51aa32ad96 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1187,6 +1187,7 @@ static struct page *compaction_alloc(struct page *migratepage,
> > list_del(&freepage->lru);
> > cc->nr_freepages--;
> >
> > + prep_encrypted_page(freepage, 0, page_keyid(migratepage), false);
> > return freepage;
> > }
>
> Comments, please.
>
> Why is this here?

/* Prepare the page using the same KeyID as the source page */

> What other code might need prep_encrypted_page()?

Custom pages allocators if these pages can end up in encrypted VMAs.

It this case compaction creates own pool of pages to be used for
allocation during page migration.

> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 581b729e05a0..ce7b436444b5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -921,22 +921,28 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> > /* page allocation callback for NUMA node migration */
> > struct page *alloc_new_node_page(struct page *page, unsigned long node)
> > {
> > - if (PageHuge(page))
> > + if (PageHuge(page)) {
> > + /*
> > + * HugeTLB doesn't support encryption. We shouldn't see
> > + * such pages.
> > + */
> > + WARN_ON(page_keyid(page));
> > return alloc_huge_page_node(page_hstate(compound_head(page)),
> > node);
> > - else if (PageTransHuge(page)) {
> > + } else if (PageTransHuge(page)) {
> > struct page *thp;
> >
> > - thp = alloc_pages_node(node,
> > + thp = alloc_pages_node_keyid(node, page_keyid(page),
> > (GFP_TRANSHUGE | __GFP_THISNODE),
> > HPAGE_PMD_ORDER);
> > if (!thp)
> > return NULL;
> > prep_transhuge_page(thp);
> > return thp;
> > - } else
> > - return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
> > - __GFP_THISNODE, 0);
> > + } else {
> > + return __alloc_pages_node_keyid(node, page_keyid(page),
> > + GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> > + }
> > }
> >
> > /*
> > @@ -2013,9 +2019,16 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> > {
> > struct mempolicy *pol;
> > struct page *page;
> > - int preferred_nid;
> > + bool zero = false;
> > + int keyid, preferred_nid;
> > nodemask_t *nmask;
> >
> > + keyid = vma_keyid(vma);
> > + if (keyid && (gfp & __GFP_ZERO)) {
> > + zero = true;
> > + gfp &= ~__GFP_ZERO;
> > + }
>
> Comments, please. 'zero' should be 'deferred_zero', at least.
>
> Also, can't we hide this a _bit_ better?
>
> if (deferred_page_zero(vma))
> gfp &= ~__GFP_ZERO;
>
> Then, later:
>
> deferred_page_prep(vma, page, order);
>
> and hide everything in deferred_page_zero() and deferred_page_prep().
>
>
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3697,6 +3697,39 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> > }
> > #endif /* CONFIG_COMPACTION */
> >
> > +#ifndef CONFIG_NUMA
> > +struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
> > + struct vm_area_struct *vma, unsigned long addr,
> > + int node, bool hugepage)
> > +{
> > + struct page *page;
> > + bool need_zero;
> > + int keyid = vma_keyid(vma);
> > +
> > + need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
> > + page = alloc_pages(gfp_mask, order);
> > + prep_encrypted_page(page, order, keyid, need_zero);
> > +
> > + return page;
> > +}
> > +#endif
>
> Is there *ever* a VMA-based allocation that doesn't need zeroing?

Sure. Any allocations for CoW.

> > +struct page * __alloc_pages_node_keyid(int nid, int keyid,
> > + gfp_t gfp_mask, unsigned int order)
> > +{
> > + struct page *page;
> > + bool need_zero;
> > +
> > + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > + VM_WARN_ON(!node_online(nid));
> > +
> > + need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
> > + page = __alloc_pages(gfp_mask, order, nid);
> > + prep_encrypted_page(page, order, keyid, need_zero);
> > +
> > + return page;
> > +}
> > +
> > #ifdef CONFIG_LOCKDEP
> > static struct lockdep_map __fs_reclaim_map =
> > STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> > @@ -4401,6 +4434,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> > }
> > EXPORT_SYMBOL(__alloc_pages_nodemask);
> >
> > +struct page *
> > +__alloc_pages_nodemask_keyid(gfp_t gfp_mask, unsigned int order,
> > + int preferred_nid, nodemask_t *nodemask, int keyid)
> > +{
> > + struct page *page;
> > + bool need_zero;
> > +
> > + need_zero = encrypted_page_needs_zero(keyid, &gfp_mask);
> > + page = __alloc_pages_nodemask(gfp_mask, order, preferred_nid, nodemask);
> > + prep_encrypted_page(page, order, keyid, need_zero);
> > + return page;
> > +}
> > +EXPORT_SYMBOL(__alloc_pages_nodemask_keyid);
>
> That looks like three duplicates of the same code, wrapping three more
> allocator variants. Do we really have no other alternatives? Can you
> please go ask the folks that gave you the feedback about the allocator
> modifications and ask them if this is OK explicitly?

Michal, any feedback for the patch?

--
G Kirill A. Shutemov

2018-07-19 09:01:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 06/19] mm/khugepaged: Handle encrypted pages

On Wed, Jul 18, 2018 at 04:11:57PM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > khugepaged allocates page in advance, before we found a VMA for
> > collapse. We don't yet know which KeyID to use for the allocation.
>
> That's not really true. We have the VMA and the address in the caller
> (khugepaged_scan_pmd()), but we drop the lock and have to revalidate the
> VMA.

For !NUMA we allocate the page in khugepaged_do_scan(), well before we
know VMA.

>
>
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5ae34097aed1..d116f4ebb622 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1056,6 +1056,16 @@ static void collapse_huge_page(struct mm_struct *mm,
> > */
> > anon_vma_unlock_write(vma->anon_vma);
> >
> > + /*
> > + * At this point new_page is allocated as non-encrypted.
> > + * If VMA's KeyID is non-zero, we need to prepare it to be encrypted
> > + * before coping data.
> > + */
> > + if (vma_keyid(vma)) {
> > + prep_encrypted_page(new_page, HPAGE_PMD_ORDER,
> > + vma_keyid(vma), false);
> > + }
>
> I guess this isn't horribly problematic now, but if we ever keep pools
> of preassigned-keyids, this won't work any more.

I don't get this. What pools of preassigned-keyids are you talking about?

--
Kirill A. Shutemov

2018-07-19 09:55:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 07/19] x86/mm: Mask out KeyID bits from page table entry pfn

On Wed, Jul 18, 2018 at 04:13:20PM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > + } else {
> > + /*
> > + * Reset __PHYSICAL_MASK.
> > + * Maybe needed if there's inconsistent configuation
> > + * between CPUs.
> > + */
> > + physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > + }
>
> This seems like an appropriate place for a WARN_ON(). Either that, or
> axe this code.

There's pr_err_once() above in the function.

--
Kirill A. Shutemov

2018-07-19 10:22:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
> > KeyID zero which used by TME. MKTME KeyIDs start from 1.
> >
> > mktme_keyid_shift holds shift of KeyID within physical address.
>
> I know what all these words mean, but the combination of them makes no
> sense to me. I still don't know what the variable does after reading this.
>
> Is this the lowest bit in the physical address which is used for the
> KeyID? How many bits you must shift up a KeyID to get to the location
> at which it can be masked into the physical address?

Right.

I'm not sure what is not clear from the description. It look fine to me.

> > mktme_keyid_mask holds mask to extract KeyID from physical address.
>
> Good descriptions, wrong place. Please put these in the code.

Okay.

> Also, the grammar constantly needs some work. "holds mask" needs to be
> "holds the mask".

Right. Thanks

> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/include/asm/mktme.h | 16 ++++++++++++++++
> > arch/x86/kernel/cpu/intel.c | 12 ++++++++----
> > arch/x86/mm/Makefile | 2 ++
> > arch/x86/mm/mktme.c | 5 +++++
> > 4 files changed, 31 insertions(+), 4 deletions(-)
> > create mode 100644 arch/x86/include/asm/mktme.h
> > create mode 100644 arch/x86/mm/mktme.c
> >
> > diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> > new file mode 100644
> > index 000000000000..df31876ec48c
> > --- /dev/null
> > +++ b/arch/x86/include/asm/mktme.h
> > @@ -0,0 +1,16 @@
> > +#ifndef _ASM_X86_MKTME_H
> > +#define _ASM_X86_MKTME_H
> > +
> > +#include <linux/types.h>
> > +
> > +#ifdef CONFIG_X86_INTEL_MKTME
> > +extern phys_addr_t mktme_keyid_mask;
> > +extern int mktme_nr_keyids;
> > +extern int mktme_keyid_shift;
> > +#else
> > +#define mktme_keyid_mask ((phys_addr_t)0)
> > +#define mktme_nr_keyids 0
> > +#define mktme_keyid_shift 0
> > +#endif
> > +
> > +#endif
> > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> > index bf2caf9d52dd..efc9e9fc47d4 100644
> > --- a/arch/x86/kernel/cpu/intel.c
> > +++ b/arch/x86/kernel/cpu/intel.c
> > @@ -573,6 +573,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> >
> > #ifdef CONFIG_X86_INTEL_MKTME
> > if (mktme_status == MKTME_ENABLED && nr_keyids) {
> > + mktme_nr_keyids = nr_keyids;
> > + mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
> > +
> > /*
> > * Mask out bits claimed from KeyID from physical address mask.
> > *
> > @@ -580,10 +583,8 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > * and number of bits claimed for KeyID is 6, bits 51:46 of
> > * physical address is unusable.
> > */
> > - phys_addr_t keyid_mask;
> > -
> > - keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
> > - physical_mask &= ~keyid_mask;
> > + mktme_keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, mktme_keyid_shift);
> > + physical_mask &= ~mktme_keyid_mask;
>
> Seems a bit silly that we introduce keyid_mask only to make it global a
> few patches later.

Is it a big deal?

I found it easier to split changes into logical pieces this way.

> > } else {
> > /*
> > * Reset __PHYSICAL_MASK.
> > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > * between CPUs.
> > */
> > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > + mktme_keyid_mask = 0;
> > + mktme_keyid_shift = 0;
> > + mktme_nr_keyids = 0;
> > }
>
> Should be unnecessary. These are zeroed by the compiler.

No. detect_tme() called for each CPU in the system.

> > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> > index 4b101dd6e52f..4ebee899c363 100644
> > --- a/arch/x86/mm/Makefile
> > +++ b/arch/x86/mm/Makefile
> > @@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
> > obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
> > obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
> > obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
> > +
> > +obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
> > diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> > new file mode 100644
> > index 000000000000..467f1b26c737
> > --- /dev/null
> > +++ b/arch/x86/mm/mktme.c
> > @@ -0,0 +1,5 @@
> > +#include <asm/mktme.h>
> > +
> > +phys_addr_t mktme_keyid_mask;
> > +int mktme_nr_keyids;
> > +int mktme_keyid_shift;
>
> Descriptions should be here, please, not in the changelog.

Okay.

--
Kirill A. Shutemov

2018-07-19 12:38:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> > > } else {
> > > /*
> > > * Reset __PHYSICAL_MASK.
> > > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > > * between CPUs.
> > > */
> > > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > > + mktme_keyid_mask = 0;
> > > + mktme_keyid_shift = 0;
> > > + mktme_nr_keyids = 0;
> > > }
> >
> > Should be unnecessary. These are zeroed by the compiler.
>
> No. detect_tme() called for each CPU in the system.

And then the variables are cleared out while other CPUs can access them?
How is that supposed to work?

Thanks,

tglx



2018-07-19 13:14:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, Jul 19, 2018 at 02:37:35PM +0200, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> > > > } else {
> > > > /*
> > > > * Reset __PHYSICAL_MASK.
> > > > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > > > * between CPUs.
> > > > */
> > > > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > > > + mktme_keyid_mask = 0;
> > > > + mktme_keyid_shift = 0;
> > > > + mktme_nr_keyids = 0;
> > > > }
> > >
> > > Should be unnecessary. These are zeroed by the compiler.
> >
> > No. detect_tme() called for each CPU in the system.
>
> And then the variables are cleared out while other CPUs can access them?
> How is that supposed to work?

This code path only matter in patalogical case: when MKTME configuation is
inconsitent between CPUs. Basically if BIOS screwed things up we disable
MKTME.

--
Kirill A. Shutemov

2018-07-19 13:19:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> On Thu, Jul 19, 2018 at 02:37:35PM +0200, Thomas Gleixner wrote:
> > On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > > On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> > > > > } else {
> > > > > /*
> > > > > * Reset __PHYSICAL_MASK.
> > > > > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > > > > * between CPUs.
> > > > > */
> > > > > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > > > > + mktme_keyid_mask = 0;
> > > > > + mktme_keyid_shift = 0;
> > > > > + mktme_nr_keyids = 0;
> > > > > }
> > > >
> > > > Should be unnecessary. These are zeroed by the compiler.
> > >
> > > No. detect_tme() called for each CPU in the system.
> >
> > And then the variables are cleared out while other CPUs can access them?
> > How is that supposed to work?
>
> This code path only matter in patalogical case: when MKTME configuation is
> inconsitent between CPUs. Basically if BIOS screwed things up we disable
> MKTME.

I still don't see how that's supposed to work.

When the inconsistent CPU is brought up _AFTER_ MKTME is enabled, then how
does clearing the variables help? It does not magically make all the other
stuff go away.

Thanks,

tglx



2018-07-19 13:25:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, Jul 19, 2018 at 03:18:03PM +0200, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > On Thu, Jul 19, 2018 at 02:37:35PM +0200, Thomas Gleixner wrote:
> > > On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > > > On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> > > > > > } else {
> > > > > > /*
> > > > > > * Reset __PHYSICAL_MASK.
> > > > > > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > > > > > * between CPUs.
> > > > > > */
> > > > > > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > > > > > + mktme_keyid_mask = 0;
> > > > > > + mktme_keyid_shift = 0;
> > > > > > + mktme_nr_keyids = 0;
> > > > > > }
> > > > >
> > > > > Should be unnecessary. These are zeroed by the compiler.
> > > >
> > > > No. detect_tme() called for each CPU in the system.
> > >
> > > And then the variables are cleared out while other CPUs can access them?
> > > How is that supposed to work?
> >
> > This code path only matter in patalogical case: when MKTME configuation is
> > inconsitent between CPUs. Basically if BIOS screwed things up we disable
> > MKTME.
>
> I still don't see how that's supposed to work.
>
> When the inconsistent CPU is brought up _AFTER_ MKTME is enabled, then how
> does clearing the variables help? It does not magically make all the other
> stuff go away.

We don't actually enable MKTME in kernel. BIOS does. Kernel makes choose
to use it or not. Current design targeted to be used by userspace.
So until init we don't have any other stuff to go away. We can just
pretend that MKTME was never there.

--
Kirill A. Shutemov

2018-07-19 13:43:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> On Thu, Jul 19, 2018 at 03:18:03PM +0200, Thomas Gleixner wrote:
> > On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > > On Thu, Jul 19, 2018 at 02:37:35PM +0200, Thomas Gleixner wrote:
> > > > On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > > > > On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> > > > > > > } else {
> > > > > > > /*
> > > > > > > * Reset __PHYSICAL_MASK.
> > > > > > > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > > > > > > * between CPUs.
> > > > > > > */
> > > > > > > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > > > > > > + mktme_keyid_mask = 0;
> > > > > > > + mktme_keyid_shift = 0;
> > > > > > > + mktme_nr_keyids = 0;
> > > > > > > }
> > > > > >
> > > > > > Should be unnecessary. These are zeroed by the compiler.
> > > > >
> > > > > No. detect_tme() called for each CPU in the system.
> > > >
> > > > And then the variables are cleared out while other CPUs can access them?
> > > > How is that supposed to work?
> > >
> > > This code path only matter in patalogical case: when MKTME configuation is
> > > inconsitent between CPUs. Basically if BIOS screwed things up we disable
> > > MKTME.
> >
> > I still don't see how that's supposed to work.
> >
> > When the inconsistent CPU is brought up _AFTER_ MKTME is enabled, then how
> > does clearing the variables help? It does not magically make all the other
> > stuff go away.
>
> We don't actually enable MKTME in kernel. BIOS does. Kernel makes choose
> to use it or not. Current design targeted to be used by userspace.
> So until init we don't have any other stuff to go away. We can just
> pretend that MKTME was never there.

Hotplug is not guaranteed to happen _BEFORE_ init. Think about physical
hotplug.

Thanks,

tglx

2018-07-19 13:59:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 02/19] mm: Do not use zero page in encrypted pages

On 07/19/2018 12:16 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 10:36:24AM -0700, Dave Hansen wrote:
>> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
>>> Zero page is not encrypted and putting it into encrypted VMA produces
>>> garbage.
>>>
>>> We can map zero page with KeyID-0 into an encrypted VMA, but this would
>>> be violation security boundary between encryption domains.
>> Why? How is it a violation?
>>
>> It only matters if they write secrets. They can't write secrets to the
>> zero page.
> I believe usage of zero page is wrong here. It would indirectly reveal
> content of supposedly encrypted memory region.
>
> I can see argument why it should be okay and I don't have very strong
> opinion on this.

I think we should make the zero page work. If folks are
security-sensitive, they need to write to guarantee it isn't being
shared. That's a pretty low bar.

I'm struggling to think of a case where an attacker has access to the
encrypted data, the virt->phys mapping, *and* can glean something
valuable from the presence of the zero page.

Please spend some time and focus on your patch descriptions. Use facts
that are backed up and are *precise* or tell the story of how your patch
was developed. In this case, citing the "security boundary" is not
precise enough without explaining what the boundary is and how it is
violated.

2018-07-19 14:03:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 03/19] mm/ksm: Do not merge pages with different KeyIDs

On 07/19/2018 12:32 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 10:38:27AM -0700, Dave Hansen wrote:
>> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
>>> Pages encrypted with different encryption keys are not allowed to be
>>> merged by KSM. Otherwise it would cross security boundary.
>> Let's say I'm using plain AES (not AES-XTS). I use the same key in two
>> keyid slots. I map a page with the first keyid and another with the
>> other keyid.
>>
>> Won't they have the same cipertext? Why shouldn't we KSM them?
> We compare plain text, not ciphertext. And for good reason.

What's the reason? Probably good to talk about it for those playing
along at home.

> Comparing ciphertext would only make KSM successful for AES-ECB that
> doesn't dependent on physical address of the page.
>
> MKTME only supports AES-XTS (no plans to support AES-ECB). It effectively
> disables KSM if we go with comparing ciphertext.

But what's the security boundary that is violated? You are talking
about some practical concerns (KSM scanning inefficiency) which is a far
cry from being any kind of security issue.


2018-07-19 14:08:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 05/19] mm/page_alloc: Handle allocation for encrypted memory

On 07/19/2018 01:27 AM, Kirill A. Shutemov wrote:
>> What other code might need prep_encrypted_page()?
>
> Custom pages allocators if these pages can end up in encrypted VMAs.
>
> It this case compaction creates own pool of pages to be used for
> allocation during page migration.

OK, that makes sense. It also sounds like some great information to add
near prep_encrypted_page().

Do we have any ability to catch cases like this if we get them wrong, or
will we just silently corrupt data?

2018-07-19 14:14:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 06/19] mm/khugepaged: Handle encrypted pages

On 07/19/2018 01:59 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 04:11:57PM -0700, Dave Hansen wrote:
>> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
>>> khugepaged allocates page in advance, before we found a VMA for
>>> collapse. We don't yet know which KeyID to use for the allocation.
>>
>> That's not really true. We have the VMA and the address in the caller
>> (khugepaged_scan_pmd()), but we drop the lock and have to revalidate the
>> VMA.
>
> For !NUMA we allocate the page in khugepaged_do_scan(), well before we
> know VMA.

Ahh, thanks for clarifying. That's some more very good information
about the design and progression of your patch that belongs in the
changelog.

>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 5ae34097aed1..d116f4ebb622 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1056,6 +1056,16 @@ static void collapse_huge_page(struct mm_struct *mm,
>>> */
>>> anon_vma_unlock_write(vma->anon_vma);
>>>
>>> + /*
>>> + * At this point new_page is allocated as non-encrypted.
>>> + * If VMA's KeyID is non-zero, we need to prepare it to be encrypted
>>> + * before coping data.
>>> + */
>>> + if (vma_keyid(vma)) {
>>> + prep_encrypted_page(new_page, HPAGE_PMD_ORDER,
>>> + vma_keyid(vma), false);
>>> + }
>>
>> I guess this isn't horribly problematic now, but if we ever keep pools
>> of preassigned-keyids, this won't work any more.
>
> I don't get this. What pools of preassigned-keyids are you talking about?

My point was that if we ever teach the allocator or something _near_ the
allocator to keep pools of pre-zeroed and/or pre-cache-cleared pages,
this approach will need to get changed otherwise we will double-prep pages.

My overall concern with prep_encrypted_page() in this patch set is that
it's inserted pretty ad-hoc. It seems easy to miss spots where it
should be. I'm also unsure of the failure mode and anything we've done
to ensure that if we get this wrong, we scream clearly and loudly about
what happened. Do we do something like that?

2018-07-19 14:20:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 07/19] x86/mm: Mask out KeyID bits from page table entry pfn

On 07/19/2018 02:54 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 04:13:20PM -0700, Dave Hansen wrote:
>> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
>>> + } else {
>>> + /*
>>> + * Reset __PHYSICAL_MASK.
>>> + * Maybe needed if there's inconsistent configuation
>>> + * between CPUs.
>>> + */
>>> + physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
>>> + }
>> This seems like an appropriate place for a WARN_ON(). Either that, or
>> axe this code.
> There's pr_err_once() above in the function.

Do you mean for the (tme_activate != tme_activate_cpu0) check?

But that's about double-activating this feature. This check is about an
inconsistent configuration between two CPUs which seems totally different.

Could you explain?

2018-07-19 14:25:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On 07/19/2018 03:21 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
>> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
>>> mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
>>> KeyID zero which used by TME. MKTME KeyIDs start from 1.
>>>
>>> mktme_keyid_shift holds shift of KeyID within physical address.
>> I know what all these words mean, but the combination of them makes no
>> sense to me. I still don't know what the variable does after reading this.
>>
>> Is this the lowest bit in the physical address which is used for the
>> KeyID? How many bits you must shift up a KeyID to get to the location
>> at which it can be masked into the physical address?
> Right.
>
> I'm not sure what is not clear from the description. It look fine to me.

Well, OK, I guess I can write a better one for you.

"Position in the PTE of the lowest bit of the KeyID"

It's also a name that could use some love (now that I know what it
does). mktme_keyid_pte_shift would be much better. Or
mktme_keyid_low_pte_bit.

2018-07-20 12:17:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 02/19] mm: Do not use zero page in encrypted pages

On Thu, Jul 19, 2018 at 06:58:14AM -0700, Dave Hansen wrote:
> On 07/19/2018 12:16 AM, Kirill A. Shutemov wrote:
> > On Wed, Jul 18, 2018 at 10:36:24AM -0700, Dave Hansen wrote:
> >> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> >>> Zero page is not encrypted and putting it into encrypted VMA produces
> >>> garbage.
> >>>
> >>> We can map zero page with KeyID-0 into an encrypted VMA, but this would
> >>> be violation security boundary between encryption domains.
> >> Why? How is it a violation?
> >>
> >> It only matters if they write secrets. They can't write secrets to the
> >> zero page.
> > I believe usage of zero page is wrong here. It would indirectly reveal
> > content of supposedly encrypted memory region.
> >
> > I can see argument why it should be okay and I don't have very strong
> > opinion on this.
>
> I think we should make the zero page work. If folks are
> security-sensitive, they need to write to guarantee it isn't being
> shared. That's a pretty low bar.
>
> I'm struggling to think of a case where an attacker has access to the
> encrypted data, the virt->phys mapping, *and* can glean something
> valuable from the presence of the zero page.

Okay.

> Please spend some time and focus on your patch descriptions. Use facts
> that are backed up and are *precise* or tell the story of how your patch
> was developed. In this case, citing the "security boundary" is not
> precise enough without explaining what the boundary is and how it is
> violated.

Fair enough. I'll go though all commit messages once again. Sorry.

--
Kirill A. Shutemov

2018-07-20 12:24:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 03/19] mm/ksm: Do not merge pages with different KeyIDs

On Thu, Jul 19, 2018 at 07:02:34AM -0700, Dave Hansen wrote:
> On 07/19/2018 12:32 AM, Kirill A. Shutemov wrote:
> > On Wed, Jul 18, 2018 at 10:38:27AM -0700, Dave Hansen wrote:
> >> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> >>> Pages encrypted with different encryption keys are not allowed to be
> >>> merged by KSM. Otherwise it would cross security boundary.
> >> Let's say I'm using plain AES (not AES-XTS). I use the same key in two
> >> keyid slots. I map a page with the first keyid and another with the
> >> other keyid.
> >>
> >> Won't they have the same cipertext? Why shouldn't we KSM them?
> > We compare plain text, not ciphertext. And for good reason.
>
> What's the reason? Probably good to talk about it for those playing
> along at home.

I'll update commit message.

> > Comparing ciphertext would only make KSM successful for AES-ECB that
> > doesn't dependent on physical address of the page.
> >
> > MKTME only supports AES-XTS (no plans to support AES-ECB). It effectively
> > disables KSM if we go with comparing ciphertext.
>
> But what's the security boundary that is violated? You are talking
> about some practical concerns (KSM scanning inefficiency) which is a far
> cry from being any kind of security issue.

As with zero page, my initial reasoning was that mixing pages from
different secrutiy domains is bad idea.

--
Kirill A. Shutemov

2018-07-20 12:31:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 06/19] mm/khugepaged: Handle encrypted pages

On Thu, Jul 19, 2018 at 07:13:39AM -0700, Dave Hansen wrote:
> On 07/19/2018 01:59 AM, Kirill A. Shutemov wrote:
> > On Wed, Jul 18, 2018 at 04:11:57PM -0700, Dave Hansen wrote:
> >> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> >>> khugepaged allocates page in advance, before we found a VMA for
> >>> collapse. We don't yet know which KeyID to use for the allocation.
> >>
> >> That's not really true. We have the VMA and the address in the caller
> >> (khugepaged_scan_pmd()), but we drop the lock and have to revalidate the
> >> VMA.
> >
> > For !NUMA we allocate the page in khugepaged_do_scan(), well before we
> > know VMA.
>
> Ahh, thanks for clarifying. That's some more very good information
> about the design and progression of your patch that belongs in the
> changelog.

Okay.

> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 5ae34097aed1..d116f4ebb622 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -1056,6 +1056,16 @@ static void collapse_huge_page(struct mm_struct *mm,
> >>> */
> >>> anon_vma_unlock_write(vma->anon_vma);
> >>>
> >>> + /*
> >>> + * At this point new_page is allocated as non-encrypted.
> >>> + * If VMA's KeyID is non-zero, we need to prepare it to be encrypted
> >>> + * before coping data.
> >>> + */
> >>> + if (vma_keyid(vma)) {
> >>> + prep_encrypted_page(new_page, HPAGE_PMD_ORDER,
> >>> + vma_keyid(vma), false);
> >>> + }
> >>
> >> I guess this isn't horribly problematic now, but if we ever keep pools
> >> of preassigned-keyids, this won't work any more.
> >
> > I don't get this. What pools of preassigned-keyids are you talking about?
>
> My point was that if we ever teach the allocator or something _near_ the
> allocator to keep pools of pre-zeroed and/or pre-cache-cleared pages,
> this approach will need to get changed otherwise we will double-prep pages.

It shouldn't be a problem here. It's pretty slow path. We often wait
memory to be compacted before page for khugepaged gets allocated.
Double-prep shouldn't have visible impact.

> My overall concern with prep_encrypted_page() in this patch set is that
> it's inserted pretty ad-hoc. It seems easy to miss spots where it
> should be. I'm also unsure of the failure mode and anything we've done
> to ensure that if we get this wrong, we scream clearly and loudly about
> what happened. Do we do something like that?

I have debugging patch that puts BUG_ONs around set_pte_at() to check if
the page's keyid matches VMA's keyid. But that's not very systematic.
We would need something better than this.

--
Kirill A. Shutemov

2018-07-20 12:33:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 07/19] x86/mm: Mask out KeyID bits from page table entry pfn

On Thu, Jul 19, 2018 at 07:19:01AM -0700, Dave Hansen wrote:
> On 07/19/2018 02:54 AM, Kirill A. Shutemov wrote:
> > On Wed, Jul 18, 2018 at 04:13:20PM -0700, Dave Hansen wrote:
> >> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> >>> + } else {
> >>> + /*
> >>> + * Reset __PHYSICAL_MASK.
> >>> + * Maybe needed if there's inconsistent configuation
> >>> + * between CPUs.
> >>> + */
> >>> + physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> >>> + }
> >> This seems like an appropriate place for a WARN_ON(). Either that, or
> >> axe this code.
> > There's pr_err_once() above in the function.
>
> Do you mean for the (tme_activate != tme_activate_cpu0) check?
>
> But that's about double-activating this feature. This check is about an
> inconsistent configuration between two CPUs which seems totally different.
>
> Could you explain?

(tme_activate != tme_activate_cpu0) check is about inconsistent
configuration. It checks if MSR's content on the given CPU matches MSR on
CPU0.

--
Kirill A. Shutemov

2018-07-20 12:36:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, Jul 19, 2018 at 07:23:27AM -0700, Dave Hansen wrote:
> On 07/19/2018 03:21 AM, Kirill A. Shutemov wrote:
> > On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> >> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> >>> mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
> >>> KeyID zero which used by TME. MKTME KeyIDs start from 1.
> >>>
> >>> mktme_keyid_shift holds shift of KeyID within physical address.
> >> I know what all these words mean, but the combination of them makes no
> >> sense to me. I still don't know what the variable does after reading this.
> >>
> >> Is this the lowest bit in the physical address which is used for the
> >> KeyID? How many bits you must shift up a KeyID to get to the location
> >> at which it can be masked into the physical address?
> > Right.
> >
> > I'm not sure what is not clear from the description. It look fine to me.
>
> Well, OK, I guess I can write a better one for you.
>
> "Position in the PTE of the lowest bit of the KeyID"
>
> It's also a name that could use some love (now that I know what it
> does). mktme_keyid_pte_shift would be much better. Or
> mktme_keyid_low_pte_bit.

Okay, thanks.

--
Kirill A. Shutemov

2018-07-20 12:51:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 09/19] x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()

On Wed, Jul 18, 2018 at 04:30:35PM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > An encrypted VMA will have KeyID stored in vma->vm_page_prot. This way
> > we don't need to do anything special to setup encrypted page table
> > entries
>
> We don't do anything special for protection keys, either. They just
> work too.
>
> > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > index 99fff853c944..3731f7e08757 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -120,8 +120,21 @@
> > * protection key is treated like _PAGE_RW, for
> > * instance, and is *not* included in this mask since
> > * pte_modify() does modify it.
> > + *
> > + * They include the physical address and the memory encryption keyID.
> > + * The paddr and the keyID never occupy the same bits at the same time.
> > + * But, a given bit might be used for the keyID on one system and used for
> > + * the physical address on another. As an optimization, we manage them in
> > + * one unit here since their combination always occupies the same hardware
> > + * bits. PTE_PFN_MASK_MAX stores combined mask.
> > + *
> > + * Cast PAGE_MASK to a signed type so that it is sign-extended if
> > + * virtual addresses are 32-bits but physical addresses are larger
> > + * (ie, 32-bit PAE).
> > */
>
> Could you please make the comment block consistent? You're a lot wider
> than the comment above.

Okay.

> > -#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
> > +#define PTE_PFN_MASK_MAX \
> > + (((signed long)PAGE_MASK) & ((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
> > +#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
> > _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
> > _PAGE_SOFT_DIRTY)
>
> Man, I'm not a fan of this. This saves us from consuming 6 VM_HIGH bits
> (which we are not short on). But, at the cost of complexity.

15, not 6. We have up-to 15 KeyID bits architecturally.

We can just have a separate field in vm_area_struct if we must.
But vm_page_prot work fine so far. I don't see a big reasone to change
them.

> Protection keys eat up PTE space and have an interface called
> pkey_mprotect(). MKTME KeyIDs take up PTE space and will probably have
> an interface called something_mprotect(). Yet, the implementations are
> going to be _very_ different with pkeys being excluded from
> _PAGE_CHG_MASK and KeyIDs being included.
>
> I think you're saved here because we don't _actually_ do pte_modify() on
> an existing PTE: we blow the old one away upon encrypted_mprotect() and
> replace the PTE with a new one.
>
> But, this is incompatible with any case where we want to change the
> KeyID and keep the old PTE target. With AES-XTS, I guess this is a safe
> assumption, but it's worrying.
>
> Are there scenarios where we want to keep PTE contents, but change the
> KeyID?

I don't see such scenario.

If for some reason we would need to map the same memory with different
KeyID it can be done from scratch. Without modifing existing mapping.

--
Kirill A. Shutemov

2018-07-20 12:56:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Thu, Jul 19, 2018 at 03:40:41PM +0200, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > On Thu, Jul 19, 2018 at 03:18:03PM +0200, Thomas Gleixner wrote:
> > > On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > > > On Thu, Jul 19, 2018 at 02:37:35PM +0200, Thomas Gleixner wrote:
> > > > > On Thu, 19 Jul 2018, Kirill A. Shutemov wrote:
> > > > > > On Wed, Jul 18, 2018 at 04:19:10PM -0700, Dave Hansen wrote:
> > > > > > > > } else {
> > > > > > > > /*
> > > > > > > > * Reset __PHYSICAL_MASK.
> > > > > > > > @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> > > > > > > > * between CPUs.
> > > > > > > > */
> > > > > > > > physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> > > > > > > > + mktme_keyid_mask = 0;
> > > > > > > > + mktme_keyid_shift = 0;
> > > > > > > > + mktme_nr_keyids = 0;
> > > > > > > > }
> > > > > > >
> > > > > > > Should be unnecessary. These are zeroed by the compiler.
> > > > > >
> > > > > > No. detect_tme() called for each CPU in the system.
> > > > >
> > > > > And then the variables are cleared out while other CPUs can access them?
> > > > > How is that supposed to work?
> > > >
> > > > This code path only matter in patalogical case: when MKTME configuation is
> > > > inconsitent between CPUs. Basically if BIOS screwed things up we disable
> > > > MKTME.
> > >
> > > I still don't see how that's supposed to work.
> > >
> > > When the inconsistent CPU is brought up _AFTER_ MKTME is enabled, then how
> > > does clearing the variables help? It does not magically make all the other
> > > stuff go away.
> >
> > We don't actually enable MKTME in kernel. BIOS does. Kernel makes choose
> > to use it or not. Current design targeted to be used by userspace.
> > So until init we don't have any other stuff to go away. We can just
> > pretend that MKTME was never there.
>
> Hotplug is not guaranteed to happen _BEFORE_ init. Think about physical
> hotplug.

Ouch. I didn't think about this. :/

In this case I don't see how to handle the situation properly.
Is it okay to WARN() && pray()?

--
Kirill A. Shutemov

2018-07-20 13:02:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 05/19] mm/page_alloc: Handle allocation for encrypted memory

On Thu, Jul 19, 2018 at 07:05:36AM -0700, Dave Hansen wrote:
> On 07/19/2018 01:27 AM, Kirill A. Shutemov wrote:
> >> What other code might need prep_encrypted_page()?
> >
> > Custom pages allocators if these pages can end up in encrypted VMAs.
> >
> > It this case compaction creates own pool of pages to be used for
> > allocation during page migration.
>
> OK, that makes sense. It also sounds like some great information to add
> near prep_encrypted_page().

Okay.

> Do we have any ability to catch cases like this if we get them wrong, or
> will we just silently corrupt data?

I cannot come up with any reasonable way to detect this immediately.
I'll think about this more.

--
Kirill A. Shutemov

2018-07-20 13:20:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Fri, 20 Jul 2018, Kirill A. Shutemov wrote:
> On Thu, Jul 19, 2018 at 03:40:41PM +0200, Thomas Gleixner wrote:
> > > > I still don't see how that's supposed to work.
> > > >
> > > > When the inconsistent CPU is brought up _AFTER_ MKTME is enabled, then how
> > > > does clearing the variables help? It does not magically make all the other
> > > > stuff go away.
> > >
> > > We don't actually enable MKTME in kernel. BIOS does. Kernel makes choose
> > > to use it or not. Current design targeted to be used by userspace.
> > > So until init we don't have any other stuff to go away. We can just
> > > pretend that MKTME was never there.
> >
> > Hotplug is not guaranteed to happen _BEFORE_ init. Think about physical
> > hotplug.
>
> Ouch. I didn't think about this. :/
>
> In this case I don't see how to handle the situation properly.
> Is it okay to WARN() && pray()?

Not really. First of all, you want to do the initial checking on the boot
CPU and then when secondary CPUs are brought up, verify that they have
matching parameters. If they do not, then we should just shut them down
right away before they can touch anything which is TME related and mark
them as 'don't online again'. That needs some extra logic in the hotplug
code, but I already have played with that for different reasons. Stick a
fat comment into that 'not matching' code path for now and I'll give you
the magic for preventing full bringup after polishing it a bit.

Thanks,

tglx

2018-07-20 13:41:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Fri, Jul 20, 2018 at 03:17:54PM +0200, Thomas Gleixner wrote:
> On Fri, 20 Jul 2018, Kirill A. Shutemov wrote:
> > On Thu, Jul 19, 2018 at 03:40:41PM +0200, Thomas Gleixner wrote:
> > > > > I still don't see how that's supposed to work.
> > > > >
> > > > > When the inconsistent CPU is brought up _AFTER_ MKTME is enabled, then how
> > > > > does clearing the variables help? It does not magically make all the other
> > > > > stuff go away.
> > > >
> > > > We don't actually enable MKTME in kernel. BIOS does. Kernel makes choose
> > > > to use it or not. Current design targeted to be used by userspace.
> > > > So until init we don't have any other stuff to go away. We can just
> > > > pretend that MKTME was never there.
> > >
> > > Hotplug is not guaranteed to happen _BEFORE_ init. Think about physical
> > > hotplug.
> >
> > Ouch. I didn't think about this. :/
> >
> > In this case I don't see how to handle the situation properly.
> > Is it okay to WARN() && pray()?
>
> Not really. First of all, you want to do the initial checking on the boot
> CPU and then when secondary CPUs are brought up, verify that they have
> matching parameters. If they do not, then we should just shut them down
> right away before they can touch anything which is TME related and mark
> them as 'don't online again'. That needs some extra logic in the hotplug
> code, but I already have played with that for different reasons. Stick a
> fat comment into that 'not matching' code path for now and I'll give you
> the magic for preventing full bringup after polishing it a bit.

Got it. Thanks!

--
Kirill A. Shutemov

2018-07-23 09:46:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 10/19] x86/mm: Implement page_keyid() using page_ext

On Wed, Jul 18, 2018 at 04:38:02PM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > Store KeyID in bits 31:16 of extended page flags. These bits are unused.
>
> I'd love a two sentence remind of what page_ext is and why you chose to
> use it. Yes, you need this. No, not everybody that you want to review
> this patch set knows what it is or why you chose it.

Okay.

> > page_keyid() returns zero until page_ext is ready.
>
> Is there any implication of this? Or does it not matter because we
> don't run userspace until after page_ext initialization is done?

It matters in sense that we shouldn't reference page_ext before it's
initialized otherwise we will get garbage and crash.

> > page_ext initializer enables static branch to indicate that
>
> "enables a static branch"
>
> > page_keyid() can use page_ext. The same static branch will gate MKTME
> > readiness in general.
>
> Can you elaborate on this a bit? It would also be a nice place to hint
> to the folks working hard on the APIs to ensure she checks this.

Okay.

> > We don't yet set KeyID for the page. It will come in the following
> > patch that implements prep_encrypted_page(). All pages have KeyID-0 for
> > now.
>
> It also wouldn't hurt to mention why you don't use an X86_FEATURE_* for
> this rather than an explicit static branch. I'm sure the x86
> maintainers will be curious.

Sure.

--
Kirill A. Shutemov

2018-07-23 09:49:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 11/19] x86/mm: Implement vma_keyid()

On Wed, Jul 18, 2018 at 04:40:14PM -0700, Dave Hansen wrote:
> > --- a/arch/x86/mm/mktme.c
> > +++ b/arch/x86/mm/mktme.c
> > @@ -1,3 +1,4 @@
> > +#include <linux/mm.h>
> > #include <asm/mktme.h>
> >
> > phys_addr_t mktme_keyid_mask;
> > @@ -37,3 +38,14 @@ struct page_ext_operations page_mktme_ops = {
> > .need = need_page_mktme,
> > .init = init_page_mktme,
> > };
> > +
> > +int vma_keyid(struct vm_area_struct *vma)
> > +{
> > + pgprotval_t prot;
> > +
> > + if (!mktme_enabled())
> > + return 0;
> > +
> > + prot = pgprot_val(vma->vm_page_prot);
> > + return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
> > +}
>
> I'm a bit surprised this isn't inlined. Not that function calls are
> expensive, but we *could* entirely avoid them using the normal pattern of:
>
> // In the header:
> static inline vma_keyid(...)
> {
> if (!mktme_enabled())
> return 0;
>
> return __vma_keyid(...); // <- the .c file version
> }

Okay. I'll do this. But it would be a macros. <asm/mktme.h> gets included
very early. We cannot really use jump label code there directly.

--
Kirill A. Shutemov

2018-07-23 09:52:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 12/19] x86/mm: Implement prep_encrypted_page() and arch_free_page()

On Wed, Jul 18, 2018 at 04:53:27PM -0700, Dave Hansen wrote:
> The description doesn't mention the potential performance implications
> of this patch. That's criminal at this point.
>
> > --- a/arch/x86/mm/mktme.c
> > +++ b/arch/x86/mm/mktme.c
> > @@ -1,4 +1,5 @@
> > #include <linux/mm.h>
> > +#include <linux/highmem.h>
> > #include <asm/mktme.h>
> >
> > phys_addr_t mktme_keyid_mask;
> > @@ -49,3 +50,51 @@ int vma_keyid(struct vm_area_struct *vma)
> > prot = pgprot_val(vma->vm_page_prot);
> > return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
> > }
> > +
> > +void prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> > +{
> > + int i;
> > +
> > + /* It's not encrypted page: nothing to do */
> > + if (!keyid)
> > + return;
>
> prep_encrypted_page() is called in the fast path in the page allocator.
> This out-of-line copy costs a function call for all users and this is
> also out of the reach of the compiler to understand that keyid!=0 is
> unlikely.
>
> I think this needs to be treated to the inline-in-the-header treatment.

Okay. Again as a macros.

> > + /*
> > + * The hardware/CPU does not enforce coherency between mappings of the
> > + * same physical page with different KeyIDs or encryption keys.
> > + * We are responsible for cache management.
> > + *
> > + * We flush cache before allocating encrypted page
> > + */
> > + clflush_cache_range(page_address(page), PAGE_SIZE << order);
>
> It's also worth pointing out that this must be done on the keyid alias
> direct map, not the normal one.
>
> Wait a sec... How do we know which direct map to use?

page_address() -> lowmem_page_address() -> page_to_virt()

page_to_virt() returns virtual address from the right direct mapping.

> > + for (i = 0; i < (1 << order); i++) {
> > + /* All pages coming out of the allocator should have KeyID 0 */
> > + WARN_ON_ONCE(lookup_page_ext(page)->keyid);
> > + lookup_page_ext(page)->keyid = keyid;
> > +
> > + /* Clear the page after the KeyID is set. */
> > + if (zero)
> > + clear_highpage(page);
> > +
> > + page++;
> > + }
> > +}
> > +
> > +void arch_free_page(struct page *page, int order)
> > +{
> > + int i;
> > +
> > + /* It's not encrypted page: nothing to do */
> > + if (!page_keyid(page))
> > + return;
>
> Ditto on pushing this to a header.
>
> > + clflush_cache_range(page_address(page), PAGE_SIZE << order);
>
> OK, how do we know which copy of the direct map to use, here?

The same way as above.

> > + for (i = 0; i < (1 << order); i++) {
> > + /* Check if the page has reasonable KeyID */
> > + WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
> > + lookup_page_ext(page)->keyid = 0;
> > + page++;
> > + }
> > +}
> >
>

--
Kirill A. Shutemov

2018-07-23 10:06:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 17/19] x86/mm: Implement sync_direct_mapping()

On Wed, Jul 18, 2018 at 05:01:37PM -0700, Dave Hansen wrote:
> On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > arch/x86/include/asm/mktme.h | 8 +
> > arch/x86/mm/init_64.c | 10 +
> > arch/x86/mm/mktme.c | 437 +++++++++++++++++++++++++++++++++++
> > 3 files changed, 455 insertions(+)
>
> I'm not the maintainer. But, NAK from me on this on the diffstat alone.
>
> There is simply too much technical debt here. There is no way this code
> is not riddled with bugs and I would bet lots of beer on the fact that
> this has received little to know testing with all the combinations that
> matter, like memory hotplug. I'd love to be proven wrong, so I eagerly
> await to be dazzled with the test results that have so far escaped
> mention in the changelog.
>
> Please make an effort to refactor this to reuse the code that we already
> have to manage the direct mapping. We can't afford 455 new lines of
> page table manipulation that nobody tests or runs.

I'll look in this once again. But I'm not sure that there's any better
solution.

The problem boils down to page allocation issue. We are not be able to
allocate enough page tables in early boot for all direct mappings. At that
stage we have very limited pool of pages that can be used for page tables.
The pool is allocated at compile-time and it's not enough to handle MKTME.

Syncing approach appeared to be the simplest to me.

Other possibility I see is to write down a journal of operations on direct
mappings to be replayed once we have proper page allocator around.

> How _was_ this tested?

Besides normal boot with MTKME enabled and access pages via new direct
mappings, I also test memory hotplug and hotremove with QEMU.

Ideally we wound need some self-test for this. But I don't see a way to
simulate hotplug and hotremove. Soft offlining doesn't cut it. We
actually need to see the ACPI event to trigger the code.

--
Kirill A. Shutemov

2018-07-23 10:13:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 18/19] x86/mm: Handle encrypted memory in page_to_virt() and __pa()

On Thu, Jul 19, 2018 at 12:21:44AM +0200, Thomas Gleixner wrote:
> On Tue, 17 Jul 2018, Kirill A. Shutemov wrote:
>
> > Per-KeyID direct mappings require changes into how we find the right
> > virtual address for a page and virt-to-phys address translations.
> >
> > page_to_virt() definition overwrites default macros provided by
> > <linux/mm.h>. We only overwrite the macros if MTKME is enabled
> > compile-time.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/include/asm/mktme.h | 3 +++
> > arch/x86/include/asm/page_64.h | 2 +-
> > 2 files changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
> > index ba83fba4f9b3..dbfbd955da98 100644
> > --- a/arch/x86/include/asm/mktme.h
> > +++ b/arch/x86/include/asm/mktme.h
> > @@ -29,6 +29,9 @@ void arch_free_page(struct page *page, int order);
> >
> > int sync_direct_mapping(void);
> >
> > +#define page_to_virt(x) \
> > + (__va(PFN_PHYS(page_to_pfn(x))) + page_keyid(x) * direct_mapping_size)
>
> This really does not belong into the mktme header.
>
> Please make this the unconditional x86 page_to_virt() implementation in
> asm/page.h, which is the canonical and obvious place for it.

Okay. (and I owe Dave a beer on this :P)

> The page_keyid() name is quite generic as well. Can this please have some
> kind of reference to the underlying mechanism, i.e. mktme?

Hm. I intentially get the name generic. It used outside arch/x86.

We can have an alias (mktme_page_keyid() ?) to be used in arch/x86 to
indicate undelying mechanism.

Is it what you want to see?

> Please hide the multiplication with direct_mapping_size in the mktme header
> as well. It's non interesting for the !MKTME case. Something like:
>
> #define page_to_virt(x) \
> (__va(PFN_PHYS(page_to_pfn(x))) + mktme_page_to_virt_offset(x))
>
> makes it immediately clear where to look and also makes it clear that the
> offset will be 0 for a !MKTME enabled kernel and (hopefully) for all !MKTME
> enabled processors as well.
>
> And then have a proper implementation of mktme_page_to_virt_offset() with a
> proper comment what on earth this is doing. It might be all obvious to you
> now, but it's completely non obvious for the casual reader and you will
> have to twist your brain around it 6 month from now as well.

Sure.

> > #else
> > #define mktme_keyid_mask ((phys_addr_t)0)
> > #define mktme_nr_keyids 0
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index f57fc3cc2246..a4f394e3471d 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -24,7 +24,7 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
> > /* use the carry flag to determine if x was < __START_KERNEL_map */
> > x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
> >
> > - return x;
> > + return x & direct_mapping_mask;
>
> This hunk also lacks any explanation both in the changelog and in form of a
> comment.

I'll fix it.

> > Per-KeyID direct mappings require changes into how we find the right
> > virtual address for a page and virt-to-phys address translations.
>
> That's pretty useless as it does just tell about 'changes', but not at all
> about what kind of changes and why these changes are required. It's really
> not helpful to assume that everyone stumbling over this will know the whole
> story especially not 6 month after this has been merged and then someone
> ends up with a bisect on that change.
>
> While at it, please get rid of the 'we'. We are neither CPUs nor code.

Okay. I'll rewrite this.

--
Kirill A. Shutemov

2018-07-23 12:56:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 17/19] x86/mm: Implement sync_direct_mapping()

On 07/23/2018 03:04 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 05:01:37PM -0700, Dave Hansen wrote:>> Please make an effort to refactor this to reuse the code that we already
>> have to manage the direct mapping. We can't afford 455 new lines of
>> page table manipulation that nobody tests or runs.
>
> I'll look in this once again. But I'm not sure that there's any better
> solution.
>
> The problem boils down to page allocation issue. We are not be able to
> allocate enough page tables in early boot for all direct mappings. At that
> stage we have very limited pool of pages that can be used for page tables.
> The pool is allocated at compile-time and it's not enough to handle MKTME.
>
> Syncing approach appeared to be the simplest to me.

If that is, indeed, the primary motivation for this design, then please
call that out in the changelog. It's exceedingly difficult to review
without this information.

We also need data and facts, please.

Which pool are we talking about? How large is it now? How large would
it need to be to accommodate MKTME? How much memory do we need to map
before we run into issues?

>> How _was_ this tested?
>
> Besides normal boot with MTKME enabled and access pages via new direct
> mappings, I also test memory hotplug and hotremove with QEMU.

... also great changelog fodder.

> Ideally we wound need some self-test for this. But I don't see a way to
> simulate hotplug and hotremove. Soft offlining doesn't cut it. We
> actually need to see the ACPI event to trigger the code.

That's something that we have to go fix. For the online side, we always
have the "probe" file. I guess nobody ever bothered to make an
equivalent for the remove side. But, that doesn't seem like an
insurmountable problem to me.

2018-07-23 17:26:21

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCHv5 10/19] x86/mm: Implement page_keyid() using page_ext

On Mon, Jul 23, 2018 at 12:45:17PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 04:38:02PM -0700, Dave Hansen wrote:
> > On 07/17/2018 04:20 AM, Kirill A. Shutemov wrote:
> > > Store KeyID in bits 31:16 of extended page flags. These bits are unused.
> >
> > I'd love a two sentence remind of what page_ext is and why you chose to
> > use it. Yes, you need this. No, not everybody that you want to review
> > this patch set knows what it is or why you chose it.
>
> Okay.
>
> > > page_keyid() returns zero until page_ext is ready.
> >
> > Is there any implication of this? Or does it not matter because we
> > don't run userspace until after page_ext initialization is done?
>
> It matters in sense that we shouldn't reference page_ext before it's
> initialized otherwise we will get garbage and crash.
>
> > > page_ext initializer enables static branch to indicate that
> >
> > "enables a static branch"
> >
> > > page_keyid() can use page_ext. The same static branch will gate MKTME
> > > readiness in general.
> >
> > Can you elaborate on this a bit? It would also be a nice place to hint
> > to the folks working hard on the APIs to ensure she checks this.
>
> Okay.

At API init time we can check if (MKTME_ENABLED && mktme_nr_keyids > 0)
Sounds like this is another dependency we need to check and 'wait' on?
It happens after MKTME_ENABLED is set? Let me know.

>
> > > We don't yet set KeyID for the page. It will come in the following
> > > patch that implements prep_encrypted_page(). All pages have KeyID-0 for
> > > now.
> >
> > It also wouldn't hurt to mention why you don't use an X86_FEATURE_* for
> > this rather than an explicit static branch. I'm sure the x86
> > maintainers will be curious.
>
> Sure.
>
> --
> Kirill A. Shutemov

2018-07-26 14:26:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHv5 05/19] mm/page_alloc: Handle allocation for encrypted memory

On Thu 19-07-18 11:27:24, Kirill A. Shutemov wrote:
> On Wed, Jul 18, 2018 at 04:03:53PM -0700, Dave Hansen wrote:
> > I asked about this before and it still isn't covered in the description:
> > You were specifically asked (maybe in person at LSF/MM?) not to modify
> > allocator to pass the keyid around. Please specifically mention how
> > this design addresses that feedback in the patch description.
> >
> > You were told, "don't change the core allocator", so I think you just
> > added new functions that wrap the core allocator and called them from
> > the majority of sites that call into the core allocator. Personally, I
> > think that misses the point of the original request.
> >
> > Do I have a better way? Nope, not really.
>
> +Michal.
>
> IIRC, Michal was not happy that I propagate the KeyID to very core
> allcoator and we've talked about wrappers around existing APIs as a better
> solution.
>
> Michal, is it correct?

Yes that is the case. I haven't seen this series and unlikely will get
to it in upcoming days though so I cannot comment much more
unfortunately.
--
Michal Hocko
SUSE Labs

2018-07-26 17:54:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv5 18/19] x86/mm: Handle encrypted memory in page_to_virt() and __pa()

On 07/23/2018 03:12 AM, Kirill A. Shutemov wrote:
> page_to_virt() definition overwrites default macros provided by
> <linux/mm.h>. We only overwrite the macros if MTKME is enabled
> compile-time.

Can you remind me why we need this in page_to_virt() as opposed to in
the kmap() code? Is it because we have lots of 64-bit code that doesn't
use kmap() or something?

2018-07-27 13:50:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 18/19] x86/mm: Handle encrypted memory in page_to_virt() and __pa()

On Thu, Jul 26, 2018 at 10:26:23AM -0700, Dave Hansen wrote:
> On 07/23/2018 03:12 AM, Kirill A. Shutemov wrote:
> > page_to_virt() definition overwrites default macros provided by
> > <linux/mm.h>. We only overwrite the macros if MTKME is enabled
> > compile-time.
>
> Can you remind me why we need this in page_to_virt() as opposed to in
> the kmap() code? Is it because we have lots of 64-bit code that doesn't
> use kmap() or something?

I just found it most suitable. It should cover all cases, even if kmap()
is not used.

--
Kirill A. Shutemov

2018-07-31 00:09:09

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCHv5 08/19] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

On Tue, 2018-07-17 at 14:20 +0300, Kirill A. Shutemov wrote:
> mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
> KeyID zero which used by TME. MKTME KeyIDs start from 1.
>
> mktme_keyid_shift holds shift of KeyID within physical address.
>
> mktme_keyid_mask holds mask to extract KeyID from physical address.

Sorry to bring this up, but AMD SME already introduced sme_me_mask, and
__sme_{set/clr} in linux/mem_encrypt.h, should we try to merge MKTME
and SME to have common variables, and reuse mem_encrypt.h? IMHO
sme_me_mask is sort of equivalent to'keyID=1'. And w/ different names
for SME and MKTME, in other components which want to use memory
encryption (ex, DMA API), we have to have explict code to distinguish
MKTME and SME IMO, which is not good.

Thanks,
-Kai
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/include/asm/mktme.h | 16 ++++++++++++++++
> arch/x86/kernel/cpu/intel.c | 12 ++++++++----
> arch/x86/mm/Makefile | 2 ++
> arch/x86/mm/mktme.c | 5 +++++
> 4 files changed, 31 insertions(+), 4 deletions(-)
> create mode 100644 arch/x86/include/asm/mktme.h
> create mode 100644 arch/x86/mm/mktme.c
>
> diff --git a/arch/x86/include/asm/mktme.h
> b/arch/x86/include/asm/mktme.h
> new file mode 100644
> index 000000000000..df31876ec48c
> --- /dev/null
> +++ b/arch/x86/include/asm/mktme.h
> @@ -0,0 +1,16 @@
> +#ifndef _ASM_X86_MKTME_H
> +#define _ASM_X86_MKTME_H
> +
> +#include <linux/types.h>
> +
> +#ifdef CONFIG_X86_INTEL_MKTME
> +extern phys_addr_t mktme_keyid_mask;
> +extern int mktme_nr_keyids;
> +extern int mktme_keyid_shift;
> +#else
> +#define mktme_keyid_mask ((phys_addr_t)0)
> +#define mktme_nr_keyids 0
> +#define mktme_keyid_shift 0
> +#endif
> +
> +#endif
> diff --git a/arch/x86/kernel/cpu/intel.c
> b/arch/x86/kernel/cpu/intel.c
> index bf2caf9d52dd..efc9e9fc47d4 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -573,6 +573,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
>
> #ifdef CONFIG_X86_INTEL_MKTME
> if (mktme_status == MKTME_ENABLED && nr_keyids) {
> + mktme_nr_keyids = nr_keyids;
> + mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
> +
> /*
> * Mask out bits claimed from KeyID from physical
> address mask.
> *
> @@ -580,10 +583,8 @@ static void detect_tme(struct cpuinfo_x86 *c)
> * and number of bits claimed for KeyID is 6, bits
> 51:46 of
> * physical address is unusable.
> */
> - phys_addr_t keyid_mask;
> -
> - keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c-
> >x86_phys_bits - keyid_bits);
> - physical_mask &= ~keyid_mask;
> + mktme_keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1,
> mktme_keyid_shift);
> + physical_mask &= ~mktme_keyid_mask;
> } else {
> /*
> * Reset __PHYSICAL_MASK.
> @@ -591,6 +592,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
> * between CPUs.
> */
> physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> + mktme_keyid_mask = 0;
> + mktme_keyid_shift = 0;
> + mktme_nr_keyids = 0;
> }
> #endif
>
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 4b101dd6e52f..4ebee899c363 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) +
> = pti.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
> +
> +obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
> diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
> new file mode 100644
> index 000000000000..467f1b26c737
> --- /dev/null
> +++ b/arch/x86/mm/mktme.c
> @@ -0,0 +1,5 @@
> +#include <asm/mktme.h>
> +
> +phys_addr_t mktme_keyid_mask;
> +int mktme_nr_keyids;
> +int mktme_keyid_shift;

2018-08-15 07:49:22

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCHv5 19/19] x86: Introduce CONFIG_X86_INTEL_MKTME

Hi!

> Add new config option to enabled/disable Multi-Key Total Memory
> Encryption support.
>
> MKTME uses MEMORY_PHYSICAL_PADDING to reserve enough space in per-KeyID
> direct mappings for memory hotplug.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/Kconfig | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b6f1785c2176..023a22568c06 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1523,6 +1523,23 @@ config ARCH_USE_MEMREMAP_PROT
> def_bool y
> depends on AMD_MEM_ENCRYPT
>
> +config X86_INTEL_MKTME
> + bool "Intel Multi-Key Total Memory Encryption"
> + select DYNAMIC_PHYSICAL_MASK
> + select PAGE_EXTENSION
> + depends on X86_64 && CPU_SUP_INTEL
> + ---help---
> + Say yes to enable support for Multi-Key Total Memory Encryption.
> + This requires an Intel processor that has support of the feature.
> +
> + Multikey Total Memory Encryption (MKTME) is a technology that allows
> + transparent memory encryption in upcoming Intel platforms.
> +
> + MKTME is built on top of TME. TME allows encryption of the entirety
> + of system memory using a single key. MKTME allows having multiple
> + encryption domains, each having own key -- different memory pages can
> + be encrypted with different keys.
> +
> # Common NUMA Features
> config NUMA
> bool "Numa Memory Allocation and Scheduler Support"

Would it be good to provide documentation, or link to documentation, explaining
what security guarantees this is supposed to provide, and what disadvantages (if any)
it has? I guess it costs a bit of performance...

I see that TME helps with cold boot attacks.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2018-08-17 09:26:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv5 19/19] x86: Introduce CONFIG_X86_INTEL_MKTME

On Wed, Aug 15, 2018 at 09:48:12AM +0200, Pavel Machek wrote:
> Hi!
>
> > Add new config option to enabled/disable Multi-Key Total Memory
> > Encryption support.
> >
> > MKTME uses MEMORY_PHYSICAL_PADDING to reserve enough space in per-KeyID
> > direct mappings for memory hotplug.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/Kconfig | 19 ++++++++++++++++++-
> > 1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index b6f1785c2176..023a22568c06 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1523,6 +1523,23 @@ config ARCH_USE_MEMREMAP_PROT
> > def_bool y
> > depends on AMD_MEM_ENCRYPT
> >
> > +config X86_INTEL_MKTME
> > + bool "Intel Multi-Key Total Memory Encryption"
> > + select DYNAMIC_PHYSICAL_MASK
> > + select PAGE_EXTENSION
> > + depends on X86_64 && CPU_SUP_INTEL
> > + ---help---
> > + Say yes to enable support for Multi-Key Total Memory Encryption.
> > + This requires an Intel processor that has support of the feature.
> > +
> > + Multikey Total Memory Encryption (MKTME) is a technology that allows
> > + transparent memory encryption in upcoming Intel platforms.
> > +
> > + MKTME is built on top of TME. TME allows encryption of the entirety
> > + of system memory using a single key. MKTME allows having multiple
> > + encryption domains, each having own key -- different memory pages can
> > + be encrypted with different keys.
> > +
> > # Common NUMA Features
> > config NUMA
> > bool "Numa Memory Allocation and Scheduler Support"
>
> Would it be good to provide documentation, or link to documentation, explaining
> what security guarantees this is supposed to provide, and what disadvantages (if any)
> it has?

The main goal is to add additional level of isolation between different
tenants of a machine. It mostly targeted to VMs and protect against
leaking information between guests.

In the design kernel (or hypervisor) is trusted and have a mean to access
encrypted memory as long as key is programmed into the CPU.

Worth noting that encryption happens in memory controller so all data in
caches of all levels are plain-text.

The spec can be found here:

https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf

> I guess it costs a bit of performance...

The most overhead is paid on allocation and freeing of encrypted pages:
switching between keyids for a page requires cache flushing.

Access time to encrypted memory *shouldn't* be measurably slower.
Encryption overhead is hidden within other latencies in memory pipeline.

> I see that TME helps with cold boot attacks.

Right.

--
Kirill A. Shutemov