2018-03-28 16:57:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 00/14] Partial MKTME enabling

Multikey Total Memory Encryption (MKTME)[1] is a technology that allows
transparent memory encryption in upcoming Intel platforms. See overview
below.

Here's updated version of my patchset that brings support of MKTME.
Functionally it matches what I posted as RFC before, but I changed few
things under hood.

Please review.

It's not yet full enabling, but all patches except the last one should be
ready to be applied.

v2:
- Store KeyID of page in page_ext->flags rather than in anon_vma.
anon_vma approach turned out to be problematic. The main problem is
that anon_vma of the page is no longer stable after last mapcount has
gone. We would like to preserve last used KeyID even for freed
pages as it allows to avoid unneccessary cache flushing on allocation
of an encrypted page. page_ext serves this well enough.

- KeyID is now propagated through page allocator. No need in GFP_ENCRYPT
anymore.

- Patch "Decouple dynamic __PHYSICAL_MASK from AMD SME" has been fix to
work with AMD SEV (need to be confirmed by AMD folks).

------------------------------------------------------------------------------

MKTME is built on top of TME. TME allows encryption of the entirety of
system memory using a single key. MKTME allows to have multiple encryption
domains, each having own key -- different memory pages can be encrypted
with different keys.

Key design points of Intel MKTME:

- Initial HW implementation would support upto 63 keys (plus one default
TME key). But the number of keys may be as low as 3, depending to SKU
and BIOS settings

- To access encrypted memory you need to use mapping with proper KeyID
int the page table entry. KeyID is encoded in upper bits of PFN in page
table entry.

This means we cannot use direct map to access encrypted memory from
kernel side. My idea is to re-use kmap() interface to get proper
temporary mapping on kernel side.

- CPU does not enforce coherency between mappings of the same physical
page with different KeyIDs or encryption keys. We wound need to take
care about flushing cache on allocation of encrypted page and on
returning it back to free pool.

- For managing keys, there's MKTME_KEY_PROGRAM leaf of the new PCONFIG
(platform configuration) instruction. It allows load and clear keys
associated with a KeyID. You can also ask CPU to generate a key for
you or disable memory encryption when a KeyID is used.

[1] https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf

Kirill A. Shutemov (14):
x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME
x86/mm: Mask out KeyID bits from page table entry pfn
mm/shmem: Zero out unused vma fields in shmem_pseudo_vma_init()
mm: Do no merge vma with different encryption KeyIDs
mm/khugepaged: Do not collapse pages in encrypted VMAs
mm/page_alloc: Propagate encryption KeyID through page allocator
mm/page_alloc: Add hook in page allocation path for encrypted pages
mm/page_ext: Drop definition of unused PAGE_EXT_DEBUG_POISON
x86/mm: Introduce variables to store number, shift and mask of KeyIDs
x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()
x86/mm: Implement vma_is_encrypted() and vma_keyid()
x86/mm: Implement page_keyid() using page_ext
x86/mm: Implement prep_encrypted_page()
x86: Introduce CONFIG_X86_INTEL_MKTME

arch/ia64/hp/common/sba_iommu.c | 2 +-
arch/ia64/include/asm/thread_info.h | 2 +-
arch/ia64/kernel/uncached.c | 2 +-
arch/ia64/sn/pci/pci_dma.c | 2 +-
arch/ia64/sn/pci/tioca_provider.c | 2 +-
arch/powerpc/kernel/dma.c | 2 +-
arch/powerpc/kernel/iommu.c | 4 +-
arch/powerpc/perf/imc-pmu.c | 4 +-
arch/powerpc/platforms/cell/iommu.c | 6 +-
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 6 +-
arch/powerpc/sysdev/xive/common.c | 2 +-
arch/sparc/kernel/iommu.c | 6 +-
arch/sparc/kernel/pci_sun4v.c | 2 +-
arch/tile/kernel/machine_kexec.c | 2 +-
arch/tile/mm/homecache.c | 2 +-
arch/x86/Kconfig | 21 ++++++
arch/x86/boot/compressed/kaslr_64.c | 5 ++
arch/x86/events/intel/ds.c | 2 +-
arch/x86/events/intel/pt.c | 2 +-
arch/x86/include/asm/mktme.h | 40 ++++++++++++
arch/x86/include/asm/page.h | 1 +
arch/x86/include/asm/page_types.h | 8 ++-
arch/x86/include/asm/pgtable_types.h | 7 +-
arch/x86/kernel/cpu/intel.c | 27 ++++++++
arch/x86/kernel/espfix_64.c | 6 +-
arch/x86/kernel/irq_32.c | 4 +-
arch/x86/kvm/vmx.c | 2 +-
arch/x86/mm/Makefile | 2 +
arch/x86/mm/mem_encrypt_identity.c | 3 +
arch/x86/mm/mktme.c | 63 ++++++++++++++++++
arch/x86/mm/pgtable.c | 5 ++
block/blk-mq.c | 2 +-
drivers/char/agp/sgi-agp.c | 2 +-
drivers/edac/thunderx_edac.c | 2 +-
drivers/hv/channel.c | 2 +-
drivers/iommu/dmar.c | 3 +-
drivers/iommu/intel-iommu.c | 2 +-
drivers/iommu/intel_irq_remapping.c | 2 +-
drivers/misc/sgi-gru/grufile.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
drivers/net/ethernet/amd/xgbe/xgbe-desc.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/sge.c | 5 +-
drivers/net/ethernet/mellanox/mlx4/icm.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/pagealloc.c | 2 +-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 2 +-
drivers/staging/lustre/lnet/lnet/router.c | 2 +-
drivers/staging/lustre/lnet/selftest/rpc.c | 2 +-
include/linux/gfp.h | 35 +++++-----
include/linux/migrate.h | 2 +-
include/linux/mm.h | 21 ++++++
include/linux/page_ext.h | 22 +++----
include/linux/skbuff.h | 2 +-
kernel/events/ring_buffer.c | 4 +-
kernel/fork.c | 2 +-
kernel/profile.c | 2 +-
kernel/trace/ring_buffer.c | 6 +-
kernel/trace/trace.c | 2 +-
kernel/trace/trace_uprobe.c | 2 +-
lib/dma-direct.c | 2 +-
mm/compaction.c | 2 +-
mm/filemap.c | 2 +-
mm/hugetlb.c | 2 +-
mm/internal.h | 2 +-
mm/khugepaged.c | 4 +-
mm/mempolicy.c | 33 ++++++----
mm/migrate.c | 12 ++--
mm/mmap.c | 3 +-
mm/page_alloc.c | 75 ++++++++++++----------
mm/page_ext.c | 3 +
mm/page_isolation.c | 2 +-
mm/percpu-vm.c | 2 +-
mm/shmem.c | 3 +-
mm/slab.c | 2 +-
mm/slob.c | 2 +-
mm/slub.c | 4 +-
mm/sparse-vmemmap.c | 2 +-
mm/vmalloc.c | 8 ++-
net/core/pktgen.c | 2 +-
net/sunrpc/svc.c | 2 +-
80 files changed, 388 insertions(+), 163 deletions(-)
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mktme.c

--
2.16.2



2018-03-28 16:57:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 02/14] x86/mm: Mask out KeyID bits from page table entry pfn

MKTME claims several upper bits of the physical address in a page table
entry to encode KeyID. It effectively shrinks number of bits for
physical address. We should exclude KeyID bits from physical addresses.

For instance, if CPU enumerates 52 physical address bits and number of
bits claimed for KeyID is 6, bits 51:46 must not be threated as part
physical address.

This patch adjusts __PHYSICAL_MASK during MKTME enumeration.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 6106d11ceb6b..a5b9d3dfa0c1 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -586,6 +586,29 @@ static void detect_tme(struct cpuinfo_x86 *c)
mktme_status = MKTME_ENABLED;
}

+#ifdef CONFIG_X86_INTEL_MKTME
+ if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ /*
+ * Mask out bits claimed from KeyID from physical address mask.
+ *
+ * For instance, if a CPU enumerates 52 physical address bits
+ * and number of bits claimed for KeyID is 6, bits 51:46 of
+ * physical address is unusable.
+ */
+ phys_addr_t keyid_mask;
+
+ keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
+ physical_mask &= ~keyid_mask;
+ } else {
+ /*
+ * Reset __PHYSICAL_MASK.
+ * Maybe needed if there's inconsistent configuation
+ * between CPUs.
+ */
+ physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ }
+#endif
+
/*
* KeyID bits effectively lower the number of physical address
* bits. Update cpuinfo_x86::x86_phys_bits accordingly.
--
2.16.2


2018-03-28 16:57:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 13/14] x86/mm: Implement prep_encrypted_page()

The hardware/CPU does not enforce coherency between mappings of the same
physical page with different KeyIDs or encrypt ion keys.
We are responsible for cache management.

We flush cache before changing KeyID of the page. KeyID is preserved for
freed pages to avoid excessive cache flushing.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 3 +++
arch/x86/mm/mktme.c | 29 +++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 5f440d57aa47..5b22ef0f0ae3 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -11,6 +11,9 @@ extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
extern int mktme_keyid_shift;

+#define prep_encrypted_page prep_encrypted_page
+void prep_encrypted_page(struct page *page, int order, int keyid);
+
#define vma_is_encrypted vma_is_encrypted
bool vma_is_encrypted(struct vm_area_struct *vma);

diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 3da25212a372..cebec794bae8 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,4 +1,5 @@
#include <linux/mm.h>
+#include <linux/highmem.h>
#include <asm/mktme.h>

phys_addr_t mktme_keyid_mask;
@@ -21,6 +22,34 @@ int vma_keyid(struct vm_area_struct *vma)
return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
}

+void prep_encrypted_page(struct page *page, int order, int new_keyid)
+{
+ int i;
+ void *v;
+
+ /*
+ * The hardware/CPU does not enforce coherency between mappings of the
+ * same physical page with different KeyIDs or encrypt ion keys.
+ * We are responsible for cache management.
+ *
+ * We flush cache before changing KeyID of the page. KeyID is preserved
+ * for freed pages to avoid exessive cache flushing.
+ */
+
+ for (i = 0; i < (1 << order); i++) {
+ int old_keyid = page_keyid(page);
+
+ if (old_keyid == new_keyid)
+ continue;
+
+ v = kmap_atomic(page + i);
+ clflush_cache_range(v, PAGE_SIZE);
+ kunmap_atomic(v);
+
+ lookup_page_ext(page)->keyid = new_keyid;
+ }
+}
+
static bool need_page_mktme(void)
{
/* Make sure keyid doesn't collide with extended page flags */
--
2.16.2


2018-03-28 16:58:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

Modify several page allocation routines to pass down encryption KeyID to
be used for the allocated page.

There are two basic use cases:

- alloc_page_vma() use VMA's KeyID to allocate the page.

- Page migration and NUMA balancing path use KeyID of original page as
KeyID for newly allocated page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/ia64/hp/common/sba_iommu.c | 2 +-
arch/ia64/include/asm/thread_info.h | 2 +-
arch/ia64/kernel/uncached.c | 2 +-
arch/ia64/sn/pci/pci_dma.c | 2 +-
arch/ia64/sn/pci/tioca_provider.c | 2 +-
arch/powerpc/kernel/dma.c | 2 +-
arch/powerpc/kernel/iommu.c | 4 +--
arch/powerpc/perf/imc-pmu.c | 4 +--
arch/powerpc/platforms/cell/iommu.c | 6 ++--
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++--
arch/powerpc/sysdev/xive/common.c | 2 +-
arch/sparc/kernel/iommu.c | 6 ++--
arch/sparc/kernel/pci_sun4v.c | 2 +-
arch/tile/kernel/machine_kexec.c | 2 +-
arch/tile/mm/homecache.c | 2 +-
arch/x86/events/intel/ds.c | 2 +-
arch/x86/events/intel/pt.c | 2 +-
arch/x86/kernel/espfix_64.c | 6 ++--
arch/x86/kernel/irq_32.c | 4 +--
arch/x86/kvm/vmx.c | 2 +-
block/blk-mq.c | 2 +-
drivers/char/agp/sgi-agp.c | 2 +-
drivers/edac/thunderx_edac.c | 2 +-
drivers/hv/channel.c | 2 +-
drivers/iommu/dmar.c | 3 +-
drivers/iommu/intel-iommu.c | 2 +-
drivers/iommu/intel_irq_remapping.c | 2 +-
drivers/misc/sgi-gru/grufile.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
drivers/net/ethernet/amd/xgbe/xgbe-desc.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/sge.c | 5 ++--
drivers/net/ethernet/mellanox/mlx4/icm.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/pagealloc.c | 2 +-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 2 +-
drivers/staging/lustre/lnet/lnet/router.c | 2 +-
drivers/staging/lustre/lnet/selftest/rpc.c | 2 +-
include/linux/gfp.h | 29 +++++++++----------
include/linux/migrate.h | 2 +-
include/linux/mm.h | 7 +++++
include/linux/skbuff.h | 2 +-
kernel/events/ring_buffer.c | 4 +--
kernel/fork.c | 2 +-
kernel/profile.c | 2 +-
kernel/trace/ring_buffer.c | 6 ++--
kernel/trace/trace.c | 2 +-
kernel/trace/trace_uprobe.c | 2 +-
lib/dma-direct.c | 2 +-
mm/filemap.c | 2 +-
mm/hugetlb.c | 2 +-
mm/khugepaged.c | 2 +-
mm/mempolicy.c | 33 +++++++++++++---------
mm/migrate.c | 12 ++++----
mm/page_alloc.c | 10 +++----
mm/percpu-vm.c | 2 +-
mm/slab.c | 2 +-
mm/slob.c | 2 +-
mm/slub.c | 4 +--
mm/sparse-vmemmap.c | 2 +-
mm/vmalloc.c | 8 ++++--
net/core/pktgen.c | 2 +-
net/sunrpc/svc.c | 2 +-
62 files changed, 132 insertions(+), 113 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index aec4a3354abe..96e70dfaed2d 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1142,7 +1142,7 @@ sba_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
{
struct page *page;

- page = alloc_pages_node(ioc->node, flags, get_order(size));
+ page = alloc_pages_node(ioc->node, flags, get_order(size), 0);
if (unlikely(!page))
return NULL;

diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index 64a1011f6812..ce022719683d 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -83,7 +83,7 @@ struct thread_info {
#define alloc_task_struct_node(node) \
({ \
struct page *page = alloc_pages_node(node, GFP_KERNEL | __GFP_COMP, \
- KERNEL_STACK_SIZE_ORDER); \
+ KERNEL_STACK_SIZE_ORDER, 0); \
struct task_struct *ret = page ? page_address(page) : NULL; \
\
ret; \
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 583f7ff6b589..fa1acce41f36 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -100,7 +100,7 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)

page = __alloc_pages_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
- IA64_GRANULE_SHIFT-PAGE_SHIFT);
+ IA64_GRANULE_SHIFT-PAGE_SHIFT, 0);
if (!page) {
mutex_unlock(&uc_pool->add_chunk_mutex);
return -1;
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index 74c934a997bb..e301cbebe8fc 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -93,7 +93,7 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
node = pcibus_to_node(pdev->bus);
if (likely(node >=0)) {
struct page *p = __alloc_pages_node(node,
- flags, get_order(size));
+ flags, get_order(size), 0);

if (likely(p))
cpuaddr = page_address(p);
diff --git a/arch/ia64/sn/pci/tioca_provider.c b/arch/ia64/sn/pci/tioca_provider.c
index a70b11fd57d6..c5eff2e95f93 100644
--- a/arch/ia64/sn/pci/tioca_provider.c
+++ b/arch/ia64/sn/pci/tioca_provider.c
@@ -122,7 +122,7 @@ tioca_gart_init(struct tioca_kernel *tioca_kern)
tmp =
alloc_pages_node(tioca_kern->ca_closest_node,
GFP_KERNEL | __GFP_ZERO,
- get_order(tioca_kern->ca_gart_size));
+ get_order(tioca_kern->ca_gart_size), 0);

if (!tmp) {
printk(KERN_ERR "%s: Could not allocate "
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index da20569de9d4..5e2bee80cb04 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -105,7 +105,7 @@ void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
};
#endif /* CONFIG_FSL_SOC */

- page = alloc_pages_node(node, flag, get_order(size));
+ page = alloc_pages_node(node, flag, get_order(size), 0);
if (page == NULL)
return NULL;
ret = page_address(page);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index af7a20dc6e09..15f10353659d 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -662,7 +662,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
/* number of bytes needed for the bitmap */
sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);

- page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz));
+ page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz), 0);
if (!page)
panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
tbl->it_map = page_address(page);
@@ -857,7 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
return NULL;

/* Alloc enough pages (and possibly more) */
- page = alloc_pages_node(node, flag, order);
+ page = alloc_pages_node(node, flag, order, 0);
if (!page)
return NULL;
ret = page_address(page);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d7532e7b9ab5..b1189ae1d991 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -565,7 +565,7 @@ static int core_imc_mem_init(int cpu, int size)
/* We need only vbase for core counters */
mem_info->vbase = page_address(alloc_pages_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
- __GFP_NOWARN, get_order(size)));
+ __GFP_NOWARN, get_order(size), 0));
if (!mem_info->vbase)
return -ENOMEM;

@@ -834,7 +834,7 @@ static int thread_imc_mem_alloc(int cpu_id, int size)
*/
local_mem = page_address(alloc_pages_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
- __GFP_NOWARN, get_order(size)));
+ __GFP_NOWARN, get_order(size), 0));
if (!local_mem)
return -ENOMEM;

diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 12352a58072a..19e3b6b67b50 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -320,7 +320,7 @@ static void cell_iommu_setup_stab(struct cbe_iommu *iommu,

/* set up the segment table */
stab_size = segments * sizeof(unsigned long);
- page = alloc_pages_node(iommu->nid, GFP_KERNEL, get_order(stab_size));
+ page = alloc_pages_node(iommu->nid, GFP_KERNEL, get_order(stab_size), 0);
BUG_ON(!page);
iommu->stab = page_address(page);
memset(iommu->stab, 0, stab_size);
@@ -345,7 +345,7 @@ static unsigned long *cell_iommu_alloc_ptab(struct cbe_iommu *iommu,
ptab_size = segments * pages_per_segment * sizeof(unsigned long);
pr_debug("%s: iommu[%d]: ptab_size: %lu, order: %d\n", __func__,
iommu->nid, ptab_size, get_order(ptab_size));
- page = alloc_pages_node(iommu->nid, GFP_KERNEL, get_order(ptab_size));
+ page = alloc_pages_node(iommu->nid, GFP_KERNEL, get_order(ptab_size), 0);
BUG_ON(!page);

ptab = page_address(page);
@@ -519,7 +519,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
* This code also assumes that we have a window that starts at 0,
* which is the case on all spider based blades.
*/
- page = alloc_pages_node(iommu->nid, GFP_KERNEL, 0);
+ page = alloc_pages_node(iommu->nid, GFP_KERNEL, 0, 0);
BUG_ON(!page);
iommu->pad_page = page_address(page);
clear_page(iommu->pad_page);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 2f704afe9af3..7828fe6d2799 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -125,7 +125,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
area->order = order;
area->pages = __alloc_pages_node(area->nid,
GFP_KERNEL|__GFP_THISNODE,
- area->order);
+ area->order, 0);

if (!area->pages) {
printk(KERN_WARNING "%s: no page on node %d\n",
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a6c92c78c9b2..29c4dd645c6b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1811,7 +1811,7 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
table_size = PAGE_SIZE;

table_pages = alloc_pages_node(pe->phb->hose->node, GFP_KERNEL,
- get_order(table_size));
+ get_order(table_size), 0);
if (!table_pages)
goto err;

@@ -2336,7 +2336,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
*/
tce32_segsz = PNV_IODA1_DMA32_SEGSIZE >> (IOMMU_PAGE_SHIFT_4K - 3);
tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
- get_order(tce32_segsz * segs));
+ get_order(tce32_segsz * segs), 0);
if (!tce_mem) {
pe_err(pe, " Failed to allocate a 32-bit TCE memory\n");
goto fail;
@@ -2762,7 +2762,7 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift,
unsigned entries = 1UL << (shift - 3);
long i;

- tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, order, 0);
if (!tce_mem) {
pr_err("Failed to allocate a TCE memory, order=%d\n", order);
return NULL;
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 40c06110821c..c5c52046a56b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1471,7 +1471,7 @@ __be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift)
__be32 *qpage;

alloc_order = xive_alloc_order(queue_shift);
- pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
+ pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order, 0);
if (!pages)
return ERR_PTR(-ENOMEM);
qpage = (__be32 *)page_address(pages);
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index b08dc3416f06..d5c000368ffc 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -120,7 +120,7 @@ int iommu_table_init(struct iommu *iommu, int tsbsize,
/* Allocate and initialize the dummy page which we
* set inactive IO PTEs to point to.
*/
- page = alloc_pages_node(numa_node, GFP_KERNEL, 0);
+ page = alloc_pages_node(numa_node, GFP_KERNEL, 0, 0);
if (!page) {
printk(KERN_ERR "IOMMU: Error, gfp(dummy_page) failed.\n");
goto out_free_map;
@@ -131,7 +131,7 @@ int iommu_table_init(struct iommu *iommu, int tsbsize,

/* Now allocate and setup the IOMMU page table itself. */
order = get_order(tsbsize);
- page = alloc_pages_node(numa_node, GFP_KERNEL, order);
+ page = alloc_pages_node(numa_node, GFP_KERNEL, order, 0);
if (!page) {
printk(KERN_ERR "IOMMU: Error, gfp(tsb) failed.\n");
goto out_free_dummy_page;
@@ -212,7 +212,7 @@ static void *dma_4u_alloc_coherent(struct device *dev, size_t size,
return NULL;

nid = dev->archdata.numa_node;
- page = alloc_pages_node(nid, gfp, order);
+ page = alloc_pages_node(nid, gfp, order, 0);
if (unlikely(!page))
return NULL;

diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index 249367228c33..28b52a8334a8 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -197,7 +197,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
prot = HV_PCI_MAP_ATTR_RELAXED_ORDER;

nid = dev->archdata.numa_node;
- page = alloc_pages_node(nid, gfp, order);
+ page = alloc_pages_node(nid, gfp, order, 0);
if (unlikely(!page))
return NULL;

diff --git a/arch/tile/kernel/machine_kexec.c b/arch/tile/kernel/machine_kexec.c
index 008aa2faef55..e304595ea3c4 100644
--- a/arch/tile/kernel/machine_kexec.c
+++ b/arch/tile/kernel/machine_kexec.c
@@ -215,7 +215,7 @@ static void kexec_find_and_set_command_line(struct kimage *image)
struct page *kimage_alloc_pages_arch(gfp_t gfp_mask, unsigned int order)
{
gfp_mask |= __GFP_THISNODE | __GFP_NORETRY;
- return alloc_pages_node(0, gfp_mask, order);
+ return alloc_pages_node(0, gfp_mask, order, 0);
}

/*
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 4432f31e8479..99580091830b 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -398,7 +398,7 @@ struct page *homecache_alloc_pages_node(int nid, gfp_t gfp_mask,
{
struct page *page;
BUG_ON(gfp_mask & __GFP_HIGHMEM); /* must be lowmem */
- page = alloc_pages_node(nid, gfp_mask, order);
+ page = alloc_pages_node(rch/x86/events/intel/pt.cnid, gfp_mask, order, 0);
if (page)
homecache_change_page_home(page, order, home);
return page;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index da6780122786..2fbb76e62acc 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -321,7 +321,7 @@ static void *dsalloc_pages(size_t size, gfp_t flags, int cpu)
int node = cpu_to_node(cpu);
struct page *page;

- page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+ page = __alloc_pages_node(node, flags | __GFP_ZERO, order, 0);
return page ? page_address(page) : NULL;
}

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 81fd41d5a0d9..85b6109680fd 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -586,7 +586,7 @@ static struct topa *topa_alloc(int cpu, gfp_t gfp)
struct topa *topa;
struct page *p;

- p = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
+ p = alloc_pages_node(node, gfp | __GFP_ZERO, 0, 0);
if (!p)
return NULL;

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index e5ec3cafa72e..ea8f4b19b10f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -172,7 +172,7 @@ void init_espfix_ap(int cpu)
pud_p = &espfix_pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
- struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+ struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0, 0);

pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
@@ -184,7 +184,7 @@ void init_espfix_ap(int cpu)
pmd_p = pmd_offset(&pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
- struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+ struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0, 0);

pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
@@ -194,7 +194,7 @@ void init_espfix_ap(int cpu)
}

pte_p = pte_offset_kernel(&pmd, addr);
- stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
+ stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0, 0));
pte = __pte(__pa(stack_page) | ((__PAGE_KERNEL_RO | _PAGE_ENC) & ptemask));
for (n = 0; n < ESPFIX_PTE_CLONES; n++)
set_pte(&pte_p[n*PTE_STRIDE], pte);
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index c1bdbd3d3232..195a6df22780 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -117,12 +117,12 @@ void irq_ctx_init(int cpu)

irqstk = page_address(alloc_pages_node(cpu_to_node(cpu),
THREADINFO_GFP,
- THREAD_SIZE_ORDER));
+ THREAD_SIZE_ORDER, 0));
per_cpu(hardirq_stack, cpu) = irqstk;

irqstk = page_address(alloc_pages_node(cpu_to_node(cpu),
THREADINFO_GFP,
- THREAD_SIZE_ORDER));
+ THREAD_SIZE_ORDER, 0));
per_cpu(softirq_stack, cpu) = irqstk;

printk(KERN_DEBUG "CPU %u irqstacks, hard=%p soft=%p\n",
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c29fe81d4209..ae2fd611efcc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3897,7 +3897,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
struct page *pages;
struct vmcs *vmcs;

- pages = __alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+ pages = __alloc_pages_node(node, GFP_KERNEL, vmcs_config.order, 0);
if (!pages)
return NULL;
vmcs = page_address(pages);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 16e83e6df404..25ddcacdecd8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2113,7 +2113,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
do {
page = alloc_pages_node(node,
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
- this_order);
+ this_order, 0);
if (page)
break;
if (!this_order--)
diff --git a/drivers/char/agp/sgi-agp.c b/drivers/char/agp/sgi-agp.c
index 3051c73bc383..383fb2e2826e 100644
--- a/drivers/char/agp/sgi-agp.c
+++ b/drivers/char/agp/sgi-agp.c
@@ -46,7 +46,7 @@ static struct page *sgi_tioca_alloc_page(struct agp_bridge_data *bridge)
(struct tioca_kernel *)bridge->dev_private_data;

nid = info->ca_closest_node;
- page = alloc_pages_node(nid, GFP_KERNEL, 0);
+ page = alloc_pages_node(nid, GFP_KERNEL, 0, 0);
if (!page)
return NULL;

diff --git a/drivers/edac/thunderx_edac.c b/drivers/edac/thunderx_edac.c
index 4803c6468bab..a4f935098b4c 100644
--- a/drivers/edac/thunderx_edac.c
+++ b/drivers/edac/thunderx_edac.c
@@ -417,7 +417,7 @@ static ssize_t thunderx_lmc_inject_ecc_write(struct file *file,

atomic_set(&lmc->ecc_int, 0);

- lmc->mem = alloc_pages_node(lmc->node, GFP_KERNEL, 0);
+ lmc->mem = alloc_pages_node(lmc->node, GFP_KERNEL, 0, 0);

if (!lmc->mem)
return -ENOMEM;
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index ba0a092ae085..31e99a9bb2de 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -98,7 +98,7 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size,
page = alloc_pages_node(cpu_to_node(newchannel->target_cpu),
GFP_KERNEL|__GFP_ZERO,
get_order(send_ringbuffer_size +
- recv_ringbuffer_size));
+ recv_ringbuffer_size), 0);

if (!page)
page = alloc_pages(GFP_KERNEL|__GFP_ZERO,
diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 9a7ffd13c7f0..7dc7252365e2 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1449,7 +1449,8 @@ int dmar_enable_qi(struct intel_iommu *iommu)
qi = iommu->qi;


- desc_page = alloc_pages_node(iommu->node, GFP_ATOMIC | __GFP_ZERO, 0);
+ desc_page = alloc_pages_node(iommu->node, GFP_ATOMIC | __GFP_ZERO,
+ 0, 0);
if (!desc_page) {
kfree(qi);
iommu->qi = NULL;
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 24d1b1b42013..a0a2d71f4d4b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -635,7 +635,7 @@ static inline void *alloc_pgtable_page(int node)
struct page *page;
void *vaddr = NULL;

- page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+ page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0, 0);
if (page)
vaddr = page_address(page);
return vaddr;
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 66f69af2c219..528110205d18 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -513,7 +513,7 @@ static int intel_setup_irq_remapping(struct intel_iommu *iommu)
return -ENOMEM;

pages = alloc_pages_node(iommu->node, GFP_KERNEL | __GFP_ZERO,
- INTR_REMAP_PAGE_ORDER);
+ INTR_REMAP_PAGE_ORDER, 0);
if (!pages) {
pr_err("IR%d: failed to allocate pages of order %d\n",
iommu->seq_id, INTR_REMAP_PAGE_ORDER);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 104a05f6b738..7b29cc1f4072 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -276,7 +276,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
for_each_possible_blade(bid) {
pnode = uv_blade_to_pnode(bid);
nid = uv_blade_to_memory_nid(bid);/* -1 if no memory on blade */
- page = alloc_pages_node(nid, GFP_KERNEL, order);
+ page = alloc_pages_node(nid, GFP_KERNEL, order, 0);
if (!page)
goto fail;
gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 340b44d9e8cf..4f7d15e6370d 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -241,7 +241,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
nid = cpu_to_node(cpu);
page = __alloc_pages_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
- pg_order);
+ pg_order, 0);
if (page == NULL) {
dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
"bytes of memory on nid=%d for GRU mq\n", mq_size, nid);
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-desc.c b/drivers/net/ethernet/amd/xgbe/xgbe-desc.c
index cc1e4f820e64..549daa8f7632 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-desc.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-desc.c
@@ -297,7 +297,7 @@ static int xgbe_alloc_pages(struct xgbe_prv_data *pdata,
/* Try to obtain pages, decreasing order if necessary */
gfp = GFP_ATOMIC | __GFP_COMP | __GFP_NOWARN;
while (order >= 0) {
- pages = alloc_pages_node(node, gfp, order);
+ pages = alloc_pages_node(node, gfp, order, 0);
if (pages)
break;

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 6e310a0da7c9..ec93ff44eec6 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -592,7 +592,8 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,
* Prefer large buffers
*/
while (n) {
- pg = alloc_pages_node(node, gfp | __GFP_COMP, s->fl_pg_order);
+ pg = alloc_pages_node(node, gfp | __GFP_COMP,
+ s->fl_pg_order, 0);
if (unlikely(!pg)) {
q->large_alloc_failed++;
break; /* fall back to single pages */
@@ -623,7 +624,7 @@ static unsigned int refill_fl(struct adapter *adap, struct sge_fl *q, int n,

alloc_small_pages:
while (n--) {
- pg = alloc_pages_node(node, gfp, 0);
+ pg = alloc_pages_node(node, gfp, 0, 0);
if (unlikely(!pg)) {
q->alloc_failed++;
break;
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a56bc5..f8281df897f3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -99,7 +99,7 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
{
struct page *page;

- page = alloc_pages_node(node, gfp_mask, order);
+ page = alloc_pages_node(node, gfp_mask, order, 0);
if (!page) {
page = alloc_pages(gfp_mask, order);
if (!page)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index e36d3e3675f9..2c0b075e22bb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
@@ -214,7 +214,7 @@ static int alloc_system_page(struct mlx5_core_dev *dev, u16 func_id)
int err;
int nid = dev_to_node(&dev->pdev->dev);

- page = alloc_pages_node(nid, GFP_HIGHUSER, 0);
+ page = alloc_pages_node(nid, GFP_HIGHUSER, 0, 0);
if (!page) {
mlx5_core_warn(dev, "failed to allocate page\n");
return -ENOMEM;
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index ec84edfda271..c7f5c50b2250 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1101,7 +1101,7 @@ int kiblnd_alloc_pages(struct kib_pages **pp, int cpt, int npages)
for (i = 0; i < npages; i++) {
p->ibp_pages[i] = alloc_pages_node(
cfs_cpt_spread_node(lnet_cpt_table(), cpt),
- GFP_NOFS, 0);
+ GFP_NOFS, 0, 0);
if (!p->ibp_pages[i]) {
CERROR("Can't allocate page %d of %d\n", i, npages);
kiblnd_free_pages(p);
diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 6504761ca598..5604da4bcc0e 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -1320,7 +1320,7 @@ lnet_new_rtrbuf(struct lnet_rtrbufpool *rbp, int cpt)
for (i = 0; i < npages; i++) {
page = alloc_pages_node(
cfs_cpt_spread_node(lnet_cpt_table(), cpt),
- GFP_KERNEL | __GFP_ZERO, 0);
+ GFP_KERNEL | __GFP_ZERO, 0, 0);
if (!page) {
while (--i >= 0)
__free_page(rb->rb_kiov[i].bv_page);
diff --git a/drivers/staging/lustre/lnet/selftest/rpc.c b/drivers/staging/lustre/lnet/selftest/rpc.c
index f8198ad1046e..2bdf6bc716fe 100644
--- a/drivers/staging/lustre/lnet/selftest/rpc.c
+++ b/drivers/staging/lustre/lnet/selftest/rpc.c
@@ -142,7 +142,7 @@ srpc_alloc_bulk(int cpt, unsigned int bulk_off, unsigned int bulk_npg,
int nob;

pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(), cpt),
- GFP_KERNEL, 0);
+ GFP_KERNEL, 0, 0);
if (!pg) {
CERROR("Can't allocate page %d of %d\n", i, bulk_npg);
srpc_free_bulk(bk);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b44d32..d9d45f47447d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -447,13 +447,14 @@ static inline void arch_alloc_page(struct page *page, int order) { }
#endif

struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
- nodemask_t *nodemask);
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int keyid,
+ int preferred_nid, nodemask_t *nodemask);

static inline struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
+__alloc_pages(gfp_t gfp_mask, unsigned int order, int keyid, int preferred_nid)
{
- return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
+ return __alloc_pages_nodemask(gfp_mask, order, keyid, preferred_nid,
+ NULL);
}

/*
@@ -461,12 +462,12 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
-__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order, int keyid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
VM_WARN_ON(!node_online(nid));

- return __alloc_pages(gfp_mask, order, nid);
+ return __alloc_pages(gfp_mask, order, keyid, nid);
}

/*
@@ -475,12 +476,12 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
* online.
*/
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
- unsigned int order)
+ unsigned int order, int keyid)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();

- return __alloc_pages_node(nid, gfp_mask, order);
+ return __alloc_pages_node(nid, gfp_mask, order, keyid);
}

#ifdef CONFIG_NUMA
@@ -494,21 +495,19 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
int node, bool hugepage);
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
#else
#define alloc_pages(gfp_mask, order) \
- alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
- alloc_pages(gfp_mask, order)
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages(gfp_mask, order)
+ alloc_pages_node(numa_node_id(), gfp_mask, order, 0)
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, hugepage) \
+ alloc_pages_node(numa_node_id(), gfp_mask, order, vma_keyid(vma))
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+ alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)

extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a2246cf670ba..b8e62d3b3200 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -51,7 +51,7 @@ static inline struct page *new_page_nodemask(struct page *page,
if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
gfp_mask |= __GFP_HIGHMEM;

- new_page = __alloc_pages_nodemask(gfp_mask, order,
+ new_page = __alloc_pages_nodemask(gfp_mask, order, page_keyid(page),
preferred_nid, nodemask);

if (new_page && PageTransHuge(new_page))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6a72eb82f4b..1287d1a50abf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1493,6 +1493,13 @@ static inline int vma_keyid(struct vm_area_struct *vma)
}
#endif

+#ifndef page_keyid
+static inline int page_keyid(struct page *page)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 99df17109e1b..d785a7935770 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2669,7 +2669,7 @@ static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
*/
gfp_mask |= __GFP_COMP | __GFP_MEMALLOC;

- return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
+ return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order, 0);
}

static inline struct page *dev_alloc_pages(unsigned int order)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 6c6b3c48db71..89b98a80817f 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -529,7 +529,7 @@ static struct page *rb_alloc_aux_page(int node, int order)
order = MAX_ORDER;

do {
- page = alloc_pages_node(node, PERF_AUX_GFP, order);
+ page = alloc_pages_node(node, PERF_AUX_GFP, order, 0);
} while (!page && order--);

if (page && order) {
@@ -706,7 +706,7 @@ static void *perf_mmap_alloc_page(int cpu)
int node;

node = (cpu == -1) ? cpu : cpu_to_node(cpu);
- page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0, 0);
if (!page)
return NULL;

diff --git a/kernel/fork.c b/kernel/fork.c
index e5d9d405ae4e..6fb66ab00f18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -240,7 +240,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
return stack;
#else
struct page *page = alloc_pages_node(node, THREADINFO_GFP,
- THREAD_SIZE_ORDER);
+ THREAD_SIZE_ORDER, 0);

return page ? page_address(page) : NULL;
#endif
diff --git a/kernel/profile.c b/kernel/profile.c
index 9aa2a4445b0d..600b47951492 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -359,7 +359,7 @@ static int profile_prepare_cpu(unsigned int cpu)
if (per_cpu(cpu_profile_hits, cpu)[i])
continue;

- page = __alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ page = __alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0, 0);
if (!page) {
profile_dead_cpu(cpu);
return -ENOMEM;
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dcf1c4dd3efe..68f10b4086ce 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1152,7 +1152,7 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
list_add(&bpage->list, pages);

page = alloc_pages_node(cpu_to_node(cpu),
- GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
+ GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0, 0);
if (!page)
goto free_pages;
bpage->page = page_address(page);
@@ -1227,7 +1227,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, long nr_pages, int cpu)
rb_check_bpage(cpu_buffer, bpage);

cpu_buffer->reader_page = bpage;
- page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+ page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0, 0);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -4406,7 +4406,7 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
goto out;

page = alloc_pages_node(cpu_to_node(cpu),
- GFP_KERNEL | __GFP_NORETRY, 0);
+ GFP_KERNEL | __GFP_NORETRY, 0, 0);
if (!page)
return ERR_PTR(-ENOMEM);

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 300f4ea39646..f98c0062e946 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2176,7 +2176,7 @@ void trace_buffered_event_enable(void)

for_each_tracing_cpu(cpu) {
page = alloc_pages_node(cpu_to_node(cpu),
- GFP_KERNEL | __GFP_NORETRY, 0);
+ GFP_KERNEL | __GFP_NORETRY, 0, 0);
if (!page)
goto failed;

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2014f4351ae0..c31797eb7936 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -710,7 +710,7 @@ static int uprobe_buffer_init(void)

for_each_possible_cpu(cpu) {
struct page *p = alloc_pages_node(cpu_to_node(cpu),
- GFP_KERNEL, 0);
+ GFP_KERNEL, 0, 0);
if (p == NULL) {
err_cpu = cpu;
goto err;
diff --git a/lib/dma-direct.c b/lib/dma-direct.c
index 1277d293d4da..687238f304b2 100644
--- a/lib/dma-direct.c
+++ b/lib/dma-direct.c
@@ -75,7 +75,7 @@ void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
}
}
if (!page)
- page = alloc_pages_node(dev_to_node(dev), gfp, page_order);
+ page = alloc_pages_node(dev_to_node(dev), gfp, page_order, 0);

if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
__free_pages(page, page_order);
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f62212a59..89e32eb8bf9a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -937,7 +937,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
- page = __alloc_pages_node(n, gfp, 0);
+ page = __alloc_pages_node(n, gfp, 0, 0);
} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));

return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 976bbc5646fe..4a65099e1074 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1388,7 +1388,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
- page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
+ page = __alloc_pages_nodemask(gfp_mask, order, 0, nid, nmask);
if (page)
__count_vm_event(HTLB_BUDDY_PGALLOC);
else
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 42f33fd526a0..2451a379c0ed 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -751,7 +751,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
{
VM_BUG_ON_PAGE(*hpage, *hpage);

- *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+ *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER, 0);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 32cba0332787..c2507aceef96 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -952,14 +952,16 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x

thp = alloc_pages_node(node,
(GFP_TRANSHUGE | __GFP_THISNODE),
- HPAGE_PMD_ORDER);
+ HPAGE_PMD_ORDER, page_keyid(page));
if (!thp)
return NULL;
prep_transhuge_page(thp);
return thp;
- } else
- return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
- __GFP_THISNODE, 0);
+ } else {
+ return __alloc_pages_node(node,
+ GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+ 0, page_keyid(page));
+ }
}

/*
@@ -1929,11 +1931,11 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
- unsigned nid)
+ unsigned nid, int keyid)
{
struct page *page;

- page = __alloc_pages(gfp, order, nid);
+ page = __alloc_pages(gfp, order, keyid, nid);
/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
if (!static_branch_likely(&vm_numa_stat_key))
return page;
@@ -1976,15 +1978,17 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
struct page *page;
int preferred_nid;
nodemask_t *nmask;
+ int keyid;

pol = get_vma_policy(vma, addr);
+ keyid = vma_keyid(vma);

if (pol->mode == MPOL_INTERLEAVE) {
unsigned nid;

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
mpol_cond_put(pol);
- page = alloc_page_interleave(gfp, order, nid);
+ page = alloc_page_interleave(gfp, order, nid, keyid);
goto out;
}

@@ -2009,14 +2013,15 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);
page = __alloc_pages_node(hpage_node,
- gfp | __GFP_THISNODE, order);
+ gfp | __GFP_THISNODE,
+ order, keyid);
goto out;
}
}

nmask = policy_nodemask(gfp, pol);
preferred_nid = policy_node(gfp, pol, node);
- page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
+ page = __alloc_pages_nodemask(gfp, order, keyid, preferred_nid, nmask);
mpol_cond_put(pol);
out:
return page;
@@ -2049,12 +2054,14 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
* No reference counting needed for current->mempolicy
* nor system default_policy
*/
- if (pol->mode == MPOL_INTERLEAVE)
- page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
- else
- page = __alloc_pages_nodemask(gfp, order,
+ if (pol->mode == MPOL_INTERLEAVE) {
+ page = alloc_page_interleave(gfp, order,
+ interleave_nodes(pol), 0);
+ } else {
+ page = __alloc_pages_nodemask(gfp, order, 0,
policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));
+ }

return page;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 1e5525a25691..65d01c4479d6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1474,14 +1474,16 @@ static struct page *new_page_node(struct page *p, unsigned long private,

thp = alloc_pages_node(pm->node,
(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
- HPAGE_PMD_ORDER);
+ HPAGE_PMD_ORDER, page_keyid(p));
if (!thp)
return NULL;
prep_transhuge_page(thp);
return thp;
- } else
+ } else {
return __alloc_pages_node(pm->node,
- GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
+ GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+ 0, page_keyid(p));
+ }
}

/*
@@ -1845,7 +1847,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
(GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE | __GFP_NOMEMALLOC |
__GFP_NORETRY | __GFP_NOWARN) &
- ~__GFP_RECLAIM, 0);
+ ~__GFP_RECLAIM, 0, page_keyid(page));

return newpage;
}
@@ -2019,7 +2021,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,

new_page = alloc_pages_node(node,
(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
- HPAGE_PMD_ORDER);
+ HPAGE_PMD_ORDER, page_keyid(page));
if (!new_page)
goto out_fail;
prep_transhuge_page(new_page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1741dd23e7c1..229cdab065ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4226,8 +4226,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
- nodemask_t *nodemask)
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int keyid,
+ int preferred_nid, nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -4346,11 +4346,11 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
__GFP_NOMEMALLOC;
page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
- PAGE_FRAG_CACHE_MAX_ORDER);
+ PAGE_FRAG_CACHE_MAX_ORDER, 0);
nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
#endif
if (unlikely(!page))
- page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+ page = alloc_pages_node(NUMA_NO_NODE, gfp, 0, 0);

nc->va = page ? page_address(page) : NULL;

@@ -4490,7 +4490,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
- struct page *p = alloc_pages_node(nid, gfp_mask, order);
+ struct page *p = alloc_pages_node(nid, gfp_mask, order, 0);
if (!p)
return NULL;
return make_alloc_exact((unsigned long)page_address(p), order, size);
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index d8078de912de..9e19197f351c 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -92,7 +92,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
for (i = page_start; i < page_end; i++) {
struct page **pagep = &pages[pcpu_page_idx(cpu, i)];

- *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+ *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0, 0);
if (!*pagep)
goto err;
}
diff --git a/mm/slab.c b/mm/slab.c
index 324446621b3e..56f42e4ba507 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1407,7 +1407,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,

flags |= cachep->allocflags;

- page = __alloc_pages_node(nodeid, flags, cachep->gfporder);
+ page = __alloc_pages_node(nodeid, flags, cachep->gfporder, 0);
if (!page) {
slab_out_of_memory(cachep, flags, nodeid);
return NULL;
diff --git a/mm/slob.c b/mm/slob.c
index 623e8a5c46ce..062f7acd7248 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -194,7 +194,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)

#ifdef CONFIG_NUMA
if (node != NUMA_NO_NODE)
- page = __alloc_pages_node(node, gfp, order);
+ page = __alloc_pages_node(node, gfp, order, 0);
else
#endif
page = alloc_pages(gfp, order);
diff --git a/mm/slub.c b/mm/slub.c
index e381728a3751..287a9b65da67 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1440,7 +1440,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
- page = __alloc_pages_node(node, flags, order);
+ page = __alloc_pages_node(node, flags, order, 0);

if (page && memcg_charge_slab(page, flags, order, s)) {
__free_pages(page, order);
@@ -3772,7 +3772,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
void *ptr = NULL;

flags |= __GFP_COMP;
- page = alloc_pages_node(node, flags, get_order(size));
+ page = alloc_pages_node(node, flags, get_order(size), 0);
if (page)
ptr = page_address(page);

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bd0276d5f66b..f6648ecb9837 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -58,7 +58,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
static bool warned;
struct page *page;

- page = alloc_pages_node(node, gfp_mask, order);
+ page = alloc_pages_node(node, gfp_mask, order, 0);
if (page)
return page_address(page);

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ebff729cc956..33095a17c20e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1695,10 +1695,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
for (i = 0; i < area->nr_pages; i++) {
struct page *page;

- if (node == NUMA_NO_NODE)
+ if (node == NUMA_NO_NODE) {
page = alloc_page(alloc_mask|highmem_mask);
- else
- page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
+ } else {
+ page = alloc_pages_node(node,
+ alloc_mask|highmem_mask, 0, 0);
+ }

if (unlikely(!page)) {
/* Successfully allocated i pages, free them in __vunmap() */
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index b8ab5c829511..4f8dd0467e1a 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2651,7 +2651,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,

if (pkt_dev->node >= 0 && (pkt_dev->flags & F_NODE))
node = pkt_dev->node;
- pkt_dev->page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ pkt_dev->page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0, 0);
if (!pkt_dev->page)
break;
}
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 387cc4add6f6..a4bc01b6305f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -577,7 +577,7 @@ svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
if (pages > RPCSVC_MAXPAGES)
pages = RPCSVC_MAXPAGES;
while (pages) {
- struct page *p = alloc_pages_node(node, GFP_KERNEL, 0);
+ struct page *p = alloc_pages_node(node, GFP_KERNEL, 0, 0);
if (!p)
break;
rqstp->rq_pages[arghi++] = p;
--
2.16.2


2018-03-28 16:59:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 09/14] x86/mm: Introduce variables to store number, shift and mask of KeyIDs

mktme_nr_keyids holds number of KeyIDs available for MKTME, excluding
KeyID zero which used by TME. MKTME KeyIDs start from 1.

mktme_keyid_shift holds shift of KeyID within physical address.

mktme_keyid_mask holds mask to extract KeyID from physical address.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 16 ++++++++++++++++
arch/x86/kernel/cpu/intel.c | 12 ++++++++----
arch/x86/mm/Makefile | 2 ++
arch/x86/mm/mktme.c | 5 +++++
4 files changed, 31 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mktme.c

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
new file mode 100644
index 000000000000..df31876ec48c
--- /dev/null
+++ b/arch/x86/include/asm/mktme.h
@@ -0,0 +1,16 @@
+#ifndef _ASM_X86_MKTME_H
+#define _ASM_X86_MKTME_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_X86_INTEL_MKTME
+extern phys_addr_t mktme_keyid_mask;
+extern int mktme_nr_keyids;
+extern int mktme_keyid_shift;
+#else
+#define mktme_keyid_mask ((phys_addr_t)0)
+#define mktme_nr_keyids 0
+#define mktme_keyid_shift 0
+#endif
+
+#endif
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index a5b9d3dfa0c1..5de02451c29b 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -588,6 +588,9 @@ static void detect_tme(struct cpuinfo_x86 *c)

#ifdef CONFIG_X86_INTEL_MKTME
if (mktme_status == MKTME_ENABLED && nr_keyids) {
+ mktme_nr_keyids = nr_keyids;
+ mktme_keyid_shift = c->x86_phys_bits - keyid_bits;
+
/*
* Mask out bits claimed from KeyID from physical address mask.
*
@@ -595,10 +598,8 @@ static void detect_tme(struct cpuinfo_x86 *c)
* and number of bits claimed for KeyID is 6, bits 51:46 of
* physical address is unusable.
*/
- phys_addr_t keyid_mask;
-
- keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, c->x86_phys_bits - keyid_bits);
- physical_mask &= ~keyid_mask;
+ mktme_keyid_mask = GENMASK_ULL(c->x86_phys_bits - 1, mktme_keyid_shift);
+ physical_mask &= ~mktme_keyid_mask;
} else {
/*
* Reset __PHYSICAL_MASK.
@@ -606,6 +607,9 @@ static void detect_tme(struct cpuinfo_x86 *c)
* between CPUs.
*/
physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+ mktme_keyid_mask = 0;
+ mktme_keyid_shift = 0;
+ mktme_nr_keyids = 0;
}
#endif

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..4ebee899c363 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
+obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
new file mode 100644
index 000000000000..467f1b26c737
--- /dev/null
+++ b/arch/x86/mm/mktme.c
@@ -0,0 +1,5 @@
+#include <asm/mktme.h>
+
+phys_addr_t mktme_keyid_mask;
+int mktme_nr_keyids;
+int mktme_keyid_shift;
--
2.16.2


2018-03-28 16:59:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 12/14] x86/mm: Implement page_keyid() using page_ext

Store KeyID in bits 31:16 of extended page flags. These bits are unused.

We don't yet set KeyID for the page. It will come in the following
patch that implements prep_encrypted_page(). All pages have KeyID-0 for
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 12 ++++++++++++
arch/x86/include/asm/page.h | 1 +
arch/x86/mm/mktme.c | 12 ++++++++++++
include/linux/page_ext.h | 11 ++++++++++-
mm/page_ext.c | 3 +++
5 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 08f613953207..5f440d57aa47 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -2,6 +2,7 @@
#define _ASM_X86_MKTME_H

#include <linux/types.h>
+#include <linux/page_ext.h>

struct vm_area_struct;

@@ -16,6 +17,17 @@ bool vma_is_encrypted(struct vm_area_struct *vma);
#define vma_keyid vma_keyid
int vma_keyid(struct vm_area_struct *vma);

+extern struct page_ext_operations page_mktme_ops;
+
+#define page_keyid page_keyid
+static inline int page_keyid(struct page *page)
+{
+ if (!mktme_nr_keyids)
+ return 0;
+
+ return lookup_page_ext(page)->keyid;
+}
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..39af59487d5f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,7 @@
struct page;

#include <linux/range.h>
+#include <asm/mktme.h>
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;

diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 3b2f28a21d99..3da25212a372 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -20,3 +20,15 @@ int vma_keyid(struct vm_area_struct *vma)
prot = pgprot_val(vma->vm_page_prot);
return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
}
+
+static bool need_page_mktme(void)
+{
+ /* Make sure keyid doesn't collide with extended page flags */
+ BUILD_BUG_ON(__NR_PAGE_EXT_FLAGS > 16);
+
+ return true;
+}
+
+struct page_ext_operations page_mktme_ops = {
+ .need = need_page_mktme,
+};
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index bbec618a614b..5b9cb41ec1ca 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -23,6 +23,7 @@ enum page_ext_flags {
PAGE_EXT_YOUNG,
PAGE_EXT_IDLE,
#endif
+ __NR_PAGE_EXT_FLAGS
};

/*
@@ -33,7 +34,15 @@ enum page_ext_flags {
* then the page_ext for pfn always exists.
*/
struct page_ext {
- unsigned long flags;
+ union {
+ unsigned long flags;
+#ifdef CONFIG_X86_INTEL_MKTME
+ struct {
+ unsigned short __pad;
+ unsigned short keyid;
+ };
+#endif
+ };
};

extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 5295ef331165..be0d7da8f5ea 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -68,6 +68,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_X86_INTEL_MKTME
+ &page_mktme_ops,
+#endif
};

static unsigned long total_usage;
--
2.16.2


2018-03-28 16:59:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 01/14] x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME

AMD SME claims one bit from physical address to indicate whether the
page is encrypted or not. To achieve that we clear out the bit from
__PHYSICAL_MASK.

The capability to adjust __PHYSICAL_MASK is required beyond AMD SME.
For instance for upcoming Intel Multi-Key Total Memory Encryption.

Factor it out into a separate feature with own Kconfig handle.

It also helps with overhead of AMD SME. It saves more than 3k in .text
on defconfig + AMD_MEM_ENCRYPT:

add/remove: 3/2 grow/shrink: 5/110 up/down: 189/-3753 (-3564)

We would need to return to this once we have infrastructure to patch
constants in code. That's good candidate for it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/boot/compressed/kaslr_64.c | 5 +++++
arch/x86/include/asm/page_types.h | 8 +++++++-
arch/x86/mm/mem_encrypt_identity.c | 3 +++
arch/x86/mm/pgtable.c | 5 +++++
5 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 27fede438959..bf68138662c8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -332,6 +332,9 @@ config ARCH_SUPPORTS_UPROBES
config FIX_EARLYCON_MEM
def_bool y

+config DYNAMIC_PHYSICAL_MASK
+ bool
+
config PGTABLE_LEVELS
int
default 5 if X86_5LEVEL
@@ -1503,6 +1506,7 @@ config ARCH_HAS_MEM_ENCRYPT
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
+ select DYNAMIC_PHYSICAL_MASK
---help---
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/kaslr_64.c
index 522d11431433..748456c365f4 100644
--- a/arch/x86/boot/compressed/kaslr_64.c
+++ b/arch/x86/boot/compressed/kaslr_64.c
@@ -69,6 +69,8 @@ static struct alloc_pgt_data pgt_data;
/* The top level page table entry pointer. */
static unsigned long top_level_pgt;

+phys_addr_t physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+
/*
* Mapping information structure passed to kernel_ident_mapping_init().
* Due to relocation, pointers must be assigned at run time not build time.
@@ -81,6 +83,9 @@ void initialize_identity_maps(void)
/* If running as an SEV guest, the encryption mask is required. */
set_sev_encryption_mask();

+ /* Exclude the encryption mask from __PHYSICAL_MASK */
+ physical_mask &= ~sme_me_mask;
+
/* Init mapping_info with run-time function/buffer pointers. */
mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 1e53560a84bb..c85e15010f48 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -17,7 +17,6 @@
#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))

-#define __PHYSICAL_MASK ((phys_addr_t)(__sme_clr((1ULL << __PHYSICAL_MASK_SHIFT) - 1)))
#define __VIRTUAL_MASK ((1UL << __VIRTUAL_MASK_SHIFT) - 1)

/* Cast *PAGE_MASK to a signed type so that it is sign-extended if
@@ -55,6 +54,13 @@

#ifndef __ASSEMBLY__

+#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
+extern phys_addr_t physical_mask;
+#define __PHYSICAL_MASK physical_mask
+#else
+#define __PHYSICAL_MASK ((phys_addr_t)((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
+#endif
+
extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 1b2197d13832..7ae36868aed2 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -527,6 +527,7 @@ void __init sme_enable(struct boot_params *bp)
/* SEV state cannot be controlled by a command line option */
sme_me_mask = me_mask;
sev_enabled = true;
+ physical_mask &= ~sme_me_mask;
return;
}

@@ -561,4 +562,6 @@ void __init sme_enable(struct boot_params *bp)
sme_me_mask = 0;
else
sme_me_mask = active_by_default ? me_mask : 0;
+
+ physical_mask &= ~sme_me_mask;
}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 34cda7e0551b..0199b94e6b40 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -7,6 +7,11 @@
#include <asm/fixmap.h>
#include <asm/mtrr.h>

+#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
+phys_addr_t physical_mask __ro_after_init = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
+EXPORT_SYMBOL(physical_mask);
+#endif
+
#define PGALLOC_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO)

#ifdef CONFIG_HIGHPTE
--
2.16.2


2018-03-28 17:00:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 04/14] mm: Do no merge vma with different encryption KeyIDs

VMAs with different KeyID do not mix together. Only VMAs with the same
KeyID are compatible.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/mmap.c | 3 ++-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..6c50f77c75d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1479,6 +1479,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
return !vma->vm_ops;
}

+#ifndef vma_keyid
+static inline int vma_keyid(struct vm_area_struct *vma)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SHMEM
/*
* The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/mmap.c b/mm/mmap.c
index 9efdc021ad22..fa218d1c6bfa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1208,7 +1208,8 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
mpol_equal(vma_policy(a), vma_policy(b)) &&
a->vm_file == b->vm_file &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC|VM_SOFTDIRTY)) &&
- b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
+ b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT) &&
+ vma_keyid(a) == vma_keyid(b);
}

/*
--
2.16.2


2018-03-28 17:01:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 03/14] mm/shmem: Zero out unused vma fields in shmem_pseudo_vma_init()

shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.

The pseudo vma doesn't have vm_page_prot set. We are going to encode
encryption KeyID in vm_page_prot. Having garbage there causes problems.

Zero out all unused fields in the pseudo vma.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/shmem.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b85919243399..387b89d9e17a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1404,10 +1404,9 @@ static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
struct shmem_inode_info *info, pgoff_t index)
{
/* Create a pseudo vma that just contains the policy */
- vma->vm_start = 0;
+ memset(vma, 0, sizeof(*vma));
/* Bias interleave by inode number to distribute better across nodes */
vma->vm_pgoff = index + info->vfs_inode.i_ino;
- vma->vm_ops = NULL;
vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
}

--
2.16.2


2018-03-28 17:01:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 05/14] mm/khugepaged: Do not collapse pages in encrypted VMAs

Pages for encrypted VMAs have to be allocated in a special way:
we would need to propagate down not only desired NUMA node but also
whether the page is encrypted.

It complicates not-so-trivial routine of huge page allocation in
khugepaged even more. It also puts more pressure on page allocator:
we cannot re-use pages allocated for encrypted VMA to collapse
page in unencrypted one or vice versa.

I think for now it worth skipping encrypted VMAs. We can return
to this topic later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/khugepaged.c | 2 ++
2 files changed, 9 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6c50f77c75d5..b6a72eb82f4b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1479,6 +1479,13 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
return !vma->vm_ops;
}

+#ifndef vma_is_encrypted
+static inline bool vma_is_encrypted(struct vm_area_struct *vma)
+{
+ return false;
+}
+#endif
+
#ifndef vma_keyid
static inline int vma_keyid(struct vm_area_struct *vma)
{
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e42568284e06..42f33fd526a0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -835,6 +835,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
return false;
if (is_vma_temporary_stack(vma))
return false;
+ if (vma_is_encrypted(vma))
+ return false;
return !(vma->vm_flags & VM_NO_KHUGEPAGED);
}

--
2.16.2


2018-03-28 17:01:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 11/14] x86/mm: Implement vma_is_encrypted() and vma_keyid()

We store KeyID in upper bits for vm_page_prot that match position of
KeyID in PTE. vma_keyid() extracts KeyID from vm_page_prot.

VMA is encrypted if KeyID is non-zero. vma_is_encrypted() checks that.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 9 +++++++++
arch/x86/mm/mktme.c | 17 +++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index df31876ec48c..08f613953207 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -3,10 +3,19 @@

#include <linux/types.h>

+struct vm_area_struct;
+
#ifdef CONFIG_X86_INTEL_MKTME
extern phys_addr_t mktme_keyid_mask;
extern int mktme_nr_keyids;
extern int mktme_keyid_shift;
+
+#define vma_is_encrypted vma_is_encrypted
+bool vma_is_encrypted(struct vm_area_struct *vma);
+
+#define vma_keyid vma_keyid
+int vma_keyid(struct vm_area_struct *vma);
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 467f1b26c737..3b2f28a21d99 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,5 +1,22 @@
+#include <linux/mm.h>
#include <asm/mktme.h>

phys_addr_t mktme_keyid_mask;
int mktme_nr_keyids;
int mktme_keyid_shift;
+
+bool vma_is_encrypted(struct vm_area_struct *vma)
+{
+ return pgprot_val(vma->vm_page_prot) & mktme_keyid_mask;
+}
+
+int vma_keyid(struct vm_area_struct *vma)
+{
+ pgprotval_t prot;
+
+ if (!vma_is_anonymous(vma))
+ return 0;
+
+ prot = pgprot_val(vma->vm_page_prot);
+ return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
+}
--
2.16.2


2018-03-28 17:14:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 07/14] mm/page_alloc: Add hook in page allocation path for encrypted pages

Intel MKTME requires cache flushing when changing encryption KeyID for
a page.

Add prep_encrypted_page() hook for this. We need to pass down KeyID to
it through page allocation path.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/gfp.h | 6 +++++
mm/compaction.c | 2 +-
mm/internal.h | 2 +-
mm/page_alloc.c | 65 ++++++++++++++++++++++++++++-------------------------
mm/page_isolation.c | 2 +-
5 files changed, 44 insertions(+), 33 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index d9d45f47447d..aff798de9c97 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -446,6 +446,12 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif

+#ifndef prep_encrypted_page
+static inline void prep_encrypted_page(struct page *page, int order, int keyid)
+{
+}
+#endif
+
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int keyid,
int preferred_nid, nodemask_t *nodemask);
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c8999d027ab..cb69620fdf34 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -77,7 +77,7 @@ static void map_pages(struct list_head *list)
order = page_private(page);
nr_pages = 1 << order;

- post_alloc_hook(page, order, __GFP_MOVABLE);
+ post_alloc_hook(page, order, page_keyid(page), __GFP_MOVABLE);
if (order)
split_page(page, order);

diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..d896c8e67669 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -164,7 +164,7 @@ extern int __isolate_free_page(struct page *page, unsigned int order);
extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
unsigned int order);
extern void prep_compound_page(struct page *page, unsigned int order);
-extern void post_alloc_hook(struct page *page, unsigned int order,
+extern void post_alloc_hook(struct page *page, unsigned int order, int keyid,
gfp_t gfp_flags);
extern int user_min_free_kbytes;

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 229cdab065ca..a5097d9c2a51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1792,7 +1792,7 @@ static bool check_new_pages(struct page *page, unsigned int order)
return false;
}

-inline void post_alloc_hook(struct page *page, unsigned int order,
+inline void post_alloc_hook(struct page *page, unsigned int order, int keyid,
gfp_t gfp_flags)
{
set_page_private(page, 0);
@@ -1803,14 +1803,15 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
kernel_poison_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
set_page_owner(page, order, gfp_flags);
+ prep_encrypted_page(page, order, keyid);
}

-static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
- unsigned int alloc_flags)
+static void prep_new_page(struct page *page, unsigned int order, int keyid,
+ gfp_t gfp_flags, unsigned int alloc_flags)
{
int i;

- post_alloc_hook(page, order, gfp_flags);
+ post_alloc_hook(page, order, keyid, gfp_flags);

if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
for (i = 0; i < (1 << order); i++)
@@ -3151,8 +3152,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
* a page.
*/
static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
- const struct alloc_context *ac)
+get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int keyid,
+ int alloc_flags, const struct alloc_context *ac)
{
struct zoneref *z = ac->preferred_zoneref;
struct zone *zone;
@@ -3236,7 +3237,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
page = rmqueue(ac->preferred_zoneref->zone, zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
- prep_new_page(page, order, gfp_mask, alloc_flags);
+ prep_new_page(page, order, keyid, gfp_mask,
+ alloc_flags);

/*
* If this is a high-order atomic allocation then check
@@ -3314,27 +3316,27 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
}

static inline struct page *
-__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order, int keyid,
unsigned int alloc_flags,
const struct alloc_context *ac)
{
struct page *page;

- page = get_page_from_freelist(gfp_mask, order,
+ page = get_page_from_freelist(gfp_mask, order, keyid,
alloc_flags|ALLOC_CPUSET, ac);
/*
* fallback to ignore cpuset restriction if our nodes
* are depleted
*/
if (!page)
- page = get_page_from_freelist(gfp_mask, order,
+ page = get_page_from_freelist(gfp_mask, order, keyid,
alloc_flags, ac);

return page;
}

static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int keyid,
const struct alloc_context *ac, unsigned long *did_some_progress)
{
struct oom_control oc = {
@@ -3366,7 +3368,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
* allocation which will never fail due to oom_lock already held.
*/
page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
- ~__GFP_DIRECT_RECLAIM, order,
+ ~__GFP_DIRECT_RECLAIM, order, keyid,
ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
if (page)
goto out;
@@ -3414,7 +3416,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*/
if (gfp_mask & __GFP_NOFAIL)
page = __alloc_pages_cpuset_fallback(gfp_mask, order,
- ALLOC_NO_WATERMARKS, ac);
+ keyid, ALLOC_NO_WATERMARKS, ac);
}
out:
mutex_unlock(&oom_lock);
@@ -3430,7 +3432,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
#ifdef CONFIG_COMPACTION
/* Try memory compaction for high-order allocations before reclaim */
static struct page *
-__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, int keyid,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio, enum compact_result *compact_result)
{
@@ -3454,7 +3456,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
*/
count_vm_event(COMPACTSTALL);

- page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+ page = get_page_from_freelist(gfp_mask, order, keyid, alloc_flags, ac);

if (page) {
struct zone *zone = page_zone(page);
@@ -3547,7 +3549,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
}
#else
static inline struct page *
-__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, int keyid,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio, enum compact_result *compact_result)
{
@@ -3656,7 +3658,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,

/* The really slow allocator path where we enter direct reclaim */
static inline struct page *
-__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, int keyid,
unsigned int alloc_flags, const struct alloc_context *ac,
unsigned long *did_some_progress)
{
@@ -3668,7 +3670,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
return NULL;

retry:
- page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+ page = get_page_from_freelist(gfp_mask, order, keyid, alloc_flags, ac);

/*
* If an allocation failed after direct reclaim, it could be because
@@ -3914,7 +3916,7 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
}

static inline struct page *
-__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int keyid,
struct alloc_context *ac)
{
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
@@ -3979,7 +3981,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* The adjusted alloc_flags might result in immediate success, so try
* that first
*/
- page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+ page = get_page_from_freelist(gfp_mask, order, keyid, alloc_flags, ac);
if (page)
goto got_pg;

@@ -3996,7 +3998,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
(costly_order ||
(order > 0 && ac->migratetype != MIGRATE_MOVABLE))
&& !gfp_pfmemalloc_allowed(gfp_mask)) {
- page = __alloc_pages_direct_compact(gfp_mask, order,
+ page = __alloc_pages_direct_compact(gfp_mask, order, keyid,
alloc_flags, ac,
INIT_COMPACT_PRIORITY,
&compact_result);
@@ -4049,7 +4051,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}

/* Attempt with potentially adjusted zonelist and alloc_flags */
- page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+ page = get_page_from_freelist(gfp_mask, order, keyid, alloc_flags, ac);
if (page)
goto got_pg;

@@ -4062,14 +4064,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

/* Try direct reclaim and then allocating */
- page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
- &did_some_progress);
+ page = __alloc_pages_direct_reclaim(gfp_mask, order, keyid, alloc_flags,
+ ac, &did_some_progress);
if (page)
goto got_pg;

/* Try direct compaction and then allocating */
- page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
- compact_priority, &compact_result);
+ page = __alloc_pages_direct_compact(gfp_mask, order, keyid, alloc_flags,
+ ac, compact_priority, &compact_result);
if (page)
goto got_pg;

@@ -4106,7 +4108,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry_cpuset;

/* Reclaim has failed us, start killing things */
- page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+ page = __alloc_pages_may_oom(gfp_mask, order, keyid,
+ ac, &did_some_progress);
if (page)
goto got_pg;

@@ -4160,7 +4163,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* could deplete whole memory reserves which would just make
* the situation worse
*/
- page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+ page = __alloc_pages_cpuset_fallback(gfp_mask, order, keyid,
+ ALLOC_HARDER, ac);
if (page)
goto got_pg;

@@ -4242,7 +4246,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int keyid,
finalise_ac(gfp_mask, order, &ac);

/* First allocation attempt */
- page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+ page = get_page_from_freelist(alloc_mask, order, keyid,
+ alloc_flags, &ac);
if (likely(page))
goto out;

@@ -4262,7 +4267,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int keyid,
if (unlikely(ac.nodemask != nodemask))
ac.nodemask = nodemask;

- page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+ page = __alloc_pages_slowpath(alloc_mask, order, keyid, &ac);

out:
if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 165ed8117bd1..8bf0f9677093 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -131,7 +131,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
out:
spin_unlock_irqrestore(&zone->lock, flags);
if (isolated_page) {
- post_alloc_hook(page, order, __GFP_MOVABLE);
+ post_alloc_hook(page, order, page_keyid(page), __GFP_MOVABLE);
__free_pages(page, order);
}
}
--
2.16.2


2018-03-28 17:15:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2 12/14] x86/mm: Implement page_keyid() using page_ext

On 03/28/2018 09:55 AM, Kirill A. Shutemov wrote:
> +static inline int page_keyid(struct page *page)
> +{
> + if (!mktme_nr_keyids)
> + return 0;
> +
> + return lookup_page_ext(page)->keyid;
> +}

This doesn't look very optimized. Don't we normally try to use
X86_FEATURE_* for these checks so that we get the runtime patching *and*
compile-time optimizations?

2018-03-28 17:16:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 08/14] mm/page_ext: Drop definition of unused PAGE_EXT_DEBUG_POISON

After bd33ef368135 ("mm: enable page poisoning early at boot")
PAGE_EXT_DEBUG_POISON is not longer used. Remove it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vinayak Menon <[email protected]>
---
include/linux/page_ext.h | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index ca5461efae2f..bbec618a614b 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -16,18 +16,7 @@ struct page_ext_operations {

#ifdef CONFIG_PAGE_EXTENSION

-/*
- * page_ext->flags bits:
- *
- * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to
- * implement generic debug pagealloc feature. The pages are filled with
- * poison patterns and set this flag after free_pages(). The poisoned
- * pages are verified whether the patterns are not corrupted and clear
- * the flag before alloc_pages().
- */
-
enum page_ext_flags {
- PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
PAGE_EXT_OWNER,
#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
--
2.16.2


2018-03-28 17:26:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 10/14] x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()

Encrypted VMA will have KeyID stored in vma->vm_page_prot. This way we
don't need to do anything special to setup encrypted page table entries
and don't need to reserve space for KeyID in a VMA.

This patch changes _PAGE_CHG_MASK to include KeyID bits. Otherwise they
are going to be stripped from vm_page_prot on the first pgprot_modify().

Define PTE_PFN_MASK_MAX similar to PTE_PFN_MASK but based on
__PHYSICAL_MASK_SHIFT. This way we include whole range of bits
architecturally available for PFN without referencing physical_mask and
mktme_keyid_mask variables.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index acfe755562a6..9ea5ba83fc0b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -121,8 +121,13 @@
* protection key is treated like _PAGE_RW, for
* instance, and is *not* included in this mask since
* pte_modify() does modify it.
+ *
+ * It includes full range of PFN bits regardless if they were claimed for KeyID
+ * or not: we want to preserve KeyID on pte_modify() and pgprot_modify().
*/
-#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
+#define PTE_PFN_MASK_MAX \
+ (((signed long)PAGE_MASK) & ((1UL << __PHYSICAL_MASK_SHIFT) - 1))
+#define _PAGE_CHG_MASK (PTE_PFN_MASK_MAX | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
_PAGE_SOFT_DIRTY)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
--
2.16.2


2018-03-28 17:27:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On 03/28/2018 09:55 AM, Kirill A. Shutemov wrote:
> @@ -51,7 +51,7 @@ static inline struct page *new_page_nodemask(struct page *page,
> if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> gfp_mask |= __GFP_HIGHMEM;
>
> - new_page = __alloc_pages_nodemask(gfp_mask, order,
> + new_page = __alloc_pages_nodemask(gfp_mask, order, page_keyid(page),
> preferred_nid, nodemask);

You're not going to like this suggestion.

Am I looking at this too superficially, or does every single site into
which you pass keyid also take a node and gfpmask and often an order? I
think you need to run this by the keepers of page_alloc.c and see if
they'd rather do something more drastic.

2018-03-28 17:39:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 14/14] x86: Introduce CONFIG_X86_INTEL_MKTME

Add new config option to enabled/disable Multi-Key Total Memory
Encryption support.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bf68138662c8..489674c9b2f6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1530,6 +1530,23 @@ config ARCH_USE_MEMREMAP_PROT
def_bool y
depends on AMD_MEM_ENCRYPT

+config X86_INTEL_MKTME
+ bool "Intel Multi-Key Total Memory Encryption"
+ select DYNAMIC_PHYSICAL_MASK
+ select PAGE_EXTENSION
+ depends on X86_64 && CPU_SUP_INTEL
+ ---help---
+ Say yes to enable support for Multi-Key Total Memory Encryption.
+ This requires Intel processor that has support of the feature.
+
+ Multikey Total Memory Encryption (MKTME) is a technology that allows
+ transparent memory encryption in upcoming Intel platforms.
+
+ MKTME is built on top of TME. TME allows encryption of the entirety
+ of system memory using a single key. MKTME allows to have multiple
+ encryption domains, each having own key -- different memory pages can
+ be encrypted with different keys.
+
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
--
2.16.2


2018-03-29 05:33:52

by Vinayak Menon

[permalink] [raw]
Subject: Re: [PATCHv2 08/14] mm/page_ext: Drop definition of unused PAGE_EXT_DEBUG_POISON

On 3/28/2018 10:25 PM, Kirill A. Shutemov wrote:
> After bd33ef368135 ("mm: enable page poisoning early at boot")
> PAGE_EXT_DEBUG_POISON is not longer used. Remove it.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Vinayak Menon <[email protected]>
> ---
> include/linux/page_ext.h | 11 -----------
> 1 file changed, 11 deletions(-)
>
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index ca5461efae2f..bbec618a614b 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -16,18 +16,7 @@ struct page_ext_operations {
>
> #ifdef CONFIG_PAGE_EXTENSION
>
> -/*
> - * page_ext->flags bits:
> - *
> - * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to
> - * implement generic debug pagealloc feature. The pages are filled with
> - * poison patterns and set this flag after free_pages(). The poisoned
> - * pages are verified whether the patterns are not corrupted and clear
> - * the flag before alloc_pages().
> - */
> -
> enum page_ext_flags {
> - PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
> PAGE_EXT_DEBUG_GUARD,
> PAGE_EXT_OWNER,
> #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)

Reviewed-by: Vinayak Menon <[email protected]>


2018-03-29 11:23:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Wed 28-03-18 19:55:32, Kirill A. Shutemov wrote:
> Modify several page allocation routines to pass down encryption KeyID to
> be used for the allocated page.
>
> There are two basic use cases:
>
> - alloc_page_vma() use VMA's KeyID to allocate the page.
>
> - Page migration and NUMA balancing path use KeyID of original page as
> KeyID for newly allocated page.

I am sorry but I am out of time to look closer but this just raised my
eyebrows. This looks like a no-go. The basic allocator has no business
in fancy stuff like a encryption key. If you need something like that
then just build a special allocator API on top. This looks like a no-go
to me.
--
Michal Hocko
SUSE Labs

2018-03-29 12:39:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Thu, Mar 29, 2018 at 01:20:34PM +0200, Michal Hocko wrote:
> On Wed 28-03-18 19:55:32, Kirill A. Shutemov wrote:
> > Modify several page allocation routines to pass down encryption KeyID to
> > be used for the allocated page.
> >
> > There are two basic use cases:
> >
> > - alloc_page_vma() use VMA's KeyID to allocate the page.
> >
> > - Page migration and NUMA balancing path use KeyID of original page as
> > KeyID for newly allocated page.
>
> I am sorry but I am out of time to look closer but this just raised my
> eyebrows. This looks like a no-go. The basic allocator has no business
> in fancy stuff like a encryption key. If you need something like that
> then just build a special allocator API on top. This looks like a no-go
> to me.

The goal is to make memory encryption first class citizen in memory
management and not to invent parallel subsysustem (as we did with hugetlb).

Making memory encryption integral part of Linux VM would involve handing
encrypted page everywhere we expect anonymous page to appear.

We can deal with encrypted page allocation with wrappers but it doesn't
make sense if we going to use them instead of original API everywhere.

--
Kirill A. Shutemov

2018-03-29 12:40:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Wed, Mar 28, 2018 at 10:15:02AM -0700, Dave Hansen wrote:
> On 03/28/2018 09:55 AM, Kirill A. Shutemov wrote:
> > @@ -51,7 +51,7 @@ static inline struct page *new_page_nodemask(struct page *page,
> > if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> > gfp_mask |= __GFP_HIGHMEM;
> >
> > - new_page = __alloc_pages_nodemask(gfp_mask, order,
> > + new_page = __alloc_pages_nodemask(gfp_mask, order, page_keyid(page),
> > preferred_nid, nodemask);
>
> You're not going to like this suggestion.
>
> Am I looking at this too superficially, or does every single site into
> which you pass keyid also take a node and gfpmask and often an order? I
> think you need to run this by the keepers of page_alloc.c and see if
> they'd rather do something more drastic.

Are you talking about having some kind of struct that would indicalte page
allocation context -- gfp_mask + order + node + keyid?

--
Kirill A. Shutemov

2018-03-29 12:45:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 12/14] x86/mm: Implement page_keyid() using page_ext

On Wed, Mar 28, 2018 at 09:59:23AM -0700, Dave Hansen wrote:
> On 03/28/2018 09:55 AM, Kirill A. Shutemov wrote:
> > +static inline int page_keyid(struct page *page)
> > +{
> > + if (!mktme_nr_keyids)
> > + return 0;
> > +
> > + return lookup_page_ext(page)->keyid;
> > +}
>
> This doesn't look very optimized. Don't we normally try to use
> X86_FEATURE_* for these checks so that we get the runtime patching *and*
> compile-time optimizations?

I didn't go to micro optimization just yet. I would like to see whole
stack functioning first.

It doesn't make sense to use cpu_feature_enabledX86_FEATURE_TME) as it
would produce false-positives: MKTME enumeration requires MSR read.

We may change mktme_nr_keyids check to a static key here. But this is not
urgent.

--
Kirill A. Shutemov

2018-03-29 12:53:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Thu 29-03-18 15:37:12, Kirill A. Shutemov wrote:
> On Thu, Mar 29, 2018 at 01:20:34PM +0200, Michal Hocko wrote:
> > On Wed 28-03-18 19:55:32, Kirill A. Shutemov wrote:
> > > Modify several page allocation routines to pass down encryption KeyID to
> > > be used for the allocated page.
> > >
> > > There are two basic use cases:
> > >
> > > - alloc_page_vma() use VMA's KeyID to allocate the page.
> > >
> > > - Page migration and NUMA balancing path use KeyID of original page as
> > > KeyID for newly allocated page.
> >
> > I am sorry but I am out of time to look closer but this just raised my
> > eyebrows. This looks like a no-go. The basic allocator has no business
> > in fancy stuff like a encryption key. If you need something like that
> > then just build a special allocator API on top. This looks like a no-go
> > to me.
>
> The goal is to make memory encryption first class citizen in memory
> management and not to invent parallel subsysustem (as we did with hugetlb).

How do you get a page_keyid for random kernel allocation?

> Making memory encryption integral part of Linux VM would involve handing
> encrypted page everywhere we expect anonymous page to appear.

How many architectures will implement this feature?
--
Michal Hocko
SUSE Labs

2018-03-29 13:15:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Thu, Mar 29, 2018 at 02:52:27PM +0200, Michal Hocko wrote:
> On Thu 29-03-18 15:37:12, Kirill A. Shutemov wrote:
> > On Thu, Mar 29, 2018 at 01:20:34PM +0200, Michal Hocko wrote:
> > > On Wed 28-03-18 19:55:32, Kirill A. Shutemov wrote:
> > > > Modify several page allocation routines to pass down encryption KeyID to
> > > > be used for the allocated page.
> > > >
> > > > There are two basic use cases:
> > > >
> > > > - alloc_page_vma() use VMA's KeyID to allocate the page.
> > > >
> > > > - Page migration and NUMA balancing path use KeyID of original page as
> > > > KeyID for newly allocated page.
> > >
> > > I am sorry but I am out of time to look closer but this just raised my
> > > eyebrows. This looks like a no-go. The basic allocator has no business
> > > in fancy stuff like a encryption key. If you need something like that
> > > then just build a special allocator API on top. This looks like a no-go
> > > to me.
> >
> > The goal is to make memory encryption first class citizen in memory
> > management and not to invent parallel subsysustem (as we did with hugetlb).
>
> How do you get a page_keyid for random kernel allocation?

Initial feature enabling only targets userspace anonymous memory, but we
can definately use the same technology in the future for kernel hardening
if we would choose so.

For anonymous memory, we can get KeyID from VMA or from other page
(migration case).

> > Making memory encryption integral part of Linux VM would involve handing
> > encrypted page everywhere we expect anonymous page to appear.
>
> How many architectures will implement this feature?

I can't read the future.

I'm only aware about one architecture so far.

--
Kirill A. Shutemov

2018-03-29 13:39:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Thu 29-03-18 16:13:08, Kirill A. Shutemov wrote:
> On Thu, Mar 29, 2018 at 02:52:27PM +0200, Michal Hocko wrote:
> > On Thu 29-03-18 15:37:12, Kirill A. Shutemov wrote:
> > > On Thu, Mar 29, 2018 at 01:20:34PM +0200, Michal Hocko wrote:
> > > > On Wed 28-03-18 19:55:32, Kirill A. Shutemov wrote:
> > > > > Modify several page allocation routines to pass down encryption KeyID to
> > > > > be used for the allocated page.
> > > > >
> > > > > There are two basic use cases:
> > > > >
> > > > > - alloc_page_vma() use VMA's KeyID to allocate the page.
> > > > >
> > > > > - Page migration and NUMA balancing path use KeyID of original page as
> > > > > KeyID for newly allocated page.
> > > >
> > > > I am sorry but I am out of time to look closer but this just raised my
> > > > eyebrows. This looks like a no-go. The basic allocator has no business
> > > > in fancy stuff like a encryption key. If you need something like that
> > > > then just build a special allocator API on top. This looks like a no-go
> > > > to me.
> > >
> > > The goal is to make memory encryption first class citizen in memory
> > > management and not to invent parallel subsysustem (as we did with hugetlb).
> >
> > How do you get a page_keyid for random kernel allocation?
>
> Initial feature enabling only targets userspace anonymous memory, but we
> can definately use the same technology in the future for kernel hardening
> if we would choose so.

So what kind of key are you going to use for those allocations. Moreover
why cannot you simply wrap those few places which are actually using the
encryption now?

> For anonymous memory, we can get KeyID from VMA or from other page
> (migration case).
>
> > > Making memory encryption integral part of Linux VM would involve handing
> > > encrypted page everywhere we expect anonymous page to appear.
> >
> > How many architectures will implement this feature?
>
> I can't read the future.

Fair enough, only few of us can, but you are proposing a generic code
changes based on a single architecture design so we should better make
sure other architectures can work with that approach without a major
refactoring.
--
Michal Hocko
SUSE Labs

2018-03-29 14:35:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Thu, Mar 29, 2018 at 03:37:00PM +0200, Michal Hocko wrote:
> On Thu 29-03-18 16:13:08, Kirill A. Shutemov wrote:
> > On Thu, Mar 29, 2018 at 02:52:27PM +0200, Michal Hocko wrote:
> > > On Thu 29-03-18 15:37:12, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 29, 2018 at 01:20:34PM +0200, Michal Hocko wrote:
> > > > > On Wed 28-03-18 19:55:32, Kirill A. Shutemov wrote:
> > > > > > Modify several page allocation routines to pass down encryption KeyID to
> > > > > > be used for the allocated page.
> > > > > >
> > > > > > There are two basic use cases:
> > > > > >
> > > > > > - alloc_page_vma() use VMA's KeyID to allocate the page.
> > > > > >
> > > > > > - Page migration and NUMA balancing path use KeyID of original page as
> > > > > > KeyID for newly allocated page.
> > > > >
> > > > > I am sorry but I am out of time to look closer but this just raised my
> > > > > eyebrows. This looks like a no-go. The basic allocator has no business
> > > > > in fancy stuff like a encryption key. If you need something like that
> > > > > then just build a special allocator API on top. This looks like a no-go
> > > > > to me.
> > > >
> > > > The goal is to make memory encryption first class citizen in memory
> > > > management and not to invent parallel subsysustem (as we did with hugetlb).
> > >
> > > How do you get a page_keyid for random kernel allocation?
> >
> > Initial feature enabling only targets userspace anonymous memory, but we
> > can definately use the same technology in the future for kernel hardening
> > if we would choose so.
>
> So what kind of key are you going to use for those allocations.

KeyID zero is default. You can think about this as do-not-encrypt.

In MKTME case it means that this memory is encrypted with TME key (random
generated at boot).

> Moreover why cannot you simply wrap those few places which are actually
> using the encryption now?

We can wrap these few places. And I tried this approach. It proved to be
slow.

Hardware doesn't enforce coherency between mappings of the same physical
page with different KeyIDs. OS is responsible for cache management: the
cache has flushed before switching the page to other KeyID.

As we allocate encrypted and unencrypted pages from the same pool, the
approach with wrapper forces us to flush cache on allocation (to switch it
non-zero KeyID) *and* freeing (to switch it back to KeyID-0) encrypted page.
We don't know if the page will be allocated next time using wrapper or
not, so we have to play safe.

This way it's about 4-6 times slower to allocate-free encrypted page
comparing to unencrypted one. On macrobenchmark, I see about 15% slowdown.

With approach I propose we can often avoid cache flushing: we can only
flush the cache on allocation and only if the page had different KeyID
last time it was allocated. It brings slowdown on macrobenchmark to 3.6%
which is more reasonable (more optimizations possible).

Other way to keep separate pool of encrypted pages within page allocator.
I think it would cause more troubles...

I would be glad to find less intrusive way to get reasonable performance.
Any suggestions?

> > For anonymous memory, we can get KeyID from VMA or from other page
> > (migration case).
> >
> > > > Making memory encryption integral part of Linux VM would involve handing
> > > > encrypted page everywhere we expect anonymous page to appear.
> > >
> > > How many architectures will implement this feature?
> >
> > I can't read the future.
>
> Fair enough, only few of us can, but you are proposing a generic code
> changes based on a single architecture design so we should better make
> sure other architectures can work with that approach without a major
> refactoring.

I tried to keep the implementation as generic as possible: VMA may be
encrypted, you can point to deseried key with an integer (KeyID),
allocation of encrypted page *may* require special handling.

Ther only assumption I made is that KeyID 0 is special, meaning
no-encryption. I think it's reasonable assumption, but easily
fixable if proved to be wrong.

If you see other places I made the abstaction too tied to specific HW
implementation, let me know.

--
Kirill A. Shutemov

2018-03-30 08:09:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 06/14] mm/page_alloc: Propagate encryption KeyID through page allocator

On Wed, Mar 28, 2018 at 07:55:32PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
> index 4432f31e8479..99580091830b 100644
> --- a/arch/tile/mm/homecache.c
> +++ b/arch/tile/mm/homecache.c
> @@ -398,7 +398,7 @@ struct page *homecache_alloc_pages_node(int nid, gfp_t gfp_mask,
> {
> struct page *page;
> BUG_ON(gfp_mask & __GFP_HIGHMEM); /* must be lowmem */
> - page = alloc_pages_node(nid, gfp_mask, order);
> + page = alloc_pages_node(rch/x86/events/intel/pt.cnid, gfp_mask, order, 0);
> if (page)
> homecache_change_page_home(page, order, home);
> return page;

Ouch. Fixup:

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 99580091830b..9eb14da556a8 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -398,7 +398,7 @@ struct page *homecache_alloc_pages_node(int nid, gfp_t gfp_mask,
{
struct page *page;
BUG_ON(gfp_mask & __GFP_HIGHMEM); /* must be lowmem */
- page = alloc_pages_node(rch/x86/events/intel/pt.cnid, gfp_mask, order, 0);
+ page = alloc_pages_node(nid, gfp_mask, order, 0);
if (page)
homecache_change_page_home(page, order, home);
return page;
--
Kirill A. Shutemov

2018-04-02 21:14:28

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCHv2 01/14] x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME



On 3/28/2018 11:55 AM, Kirill A. Shutemov wrote:
> AMD SME claims one bit from physical address to indicate whether the
> page is encrypted or not. To achieve that we clear out the bit from
> __PHYSICAL_MASK.
>
> The capability to adjust __PHYSICAL_MASK is required beyond AMD SME.
> For instance for upcoming Intel Multi-Key Total Memory Encryption.
>
> Factor it out into a separate feature with own Kconfig handle.
>
> It also helps with overhead of AMD SME. It saves more than 3k in .text
> on defconfig + AMD_MEM_ENCRYPT:
>
> add/remove: 3/2 grow/shrink: 5/110 up/down: 189/-3753 (-3564)
>
> We would need to return to this once we have infrastructure to patch
> constants in code. That's good candidate for it.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Reviewed-by: Tom Lendacky <[email protected]>

> ---
> arch/x86/Kconfig | 4 ++++
> arch/x86/boot/compressed/kaslr_64.c | 5 +++++
> arch/x86/include/asm/page_types.h | 8 +++++++-
> arch/x86/mm/mem_encrypt_identity.c | 3 +++
> arch/x86/mm/pgtable.c | 5 +++++
> 5 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 27fede438959..bf68138662c8 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -332,6 +332,9 @@ config ARCH_SUPPORTS_UPROBES
> config FIX_EARLYCON_MEM
> def_bool y
>
> +config DYNAMIC_PHYSICAL_MASK
> + bool
> +
> config PGTABLE_LEVELS
> int
> default 5 if X86_5LEVEL
> @@ -1503,6 +1506,7 @@ config ARCH_HAS_MEM_ENCRYPT
> config AMD_MEM_ENCRYPT
> bool "AMD Secure Memory Encryption (SME) support"
> depends on X86_64 && CPU_SUP_AMD
> + select DYNAMIC_PHYSICAL_MASK
> ---help---
> Say yes to enable support for the encryption of system memory.
> This requires an AMD processor that supports Secure Memory
> diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/kaslr_64.c
> index 522d11431433..748456c365f4 100644
> --- a/arch/x86/boot/compressed/kaslr_64.c
> +++ b/arch/x86/boot/compressed/kaslr_64.c
> @@ -69,6 +69,8 @@ static struct alloc_pgt_data pgt_data;
> /* The top level page table entry pointer. */
> static unsigned long top_level_pgt;
>
> +phys_addr_t physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> +
> /*
> * Mapping information structure passed to kernel_ident_mapping_init().
> * Due to relocation, pointers must be assigned at run time not build time.
> @@ -81,6 +83,9 @@ void initialize_identity_maps(void)
> /* If running as an SEV guest, the encryption mask is required. */
> set_sev_encryption_mask();
>
> + /* Exclude the encryption mask from __PHYSICAL_MASK */
> + physical_mask &= ~sme_me_mask;
> +
> /* Init mapping_info with run-time function/buffer pointers. */
> mapping_info.alloc_pgt_page = alloc_pgt_page;
> mapping_info.context = &pgt_data;
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 1e53560a84bb..c85e15010f48 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -17,7 +17,6 @@
> #define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
> #define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))
>
> -#define __PHYSICAL_MASK ((phys_addr_t)(__sme_clr((1ULL << __PHYSICAL_MASK_SHIFT) - 1)))
> #define __VIRTUAL_MASK ((1UL << __VIRTUAL_MASK_SHIFT) - 1)
>
> /* Cast *PAGE_MASK to a signed type so that it is sign-extended if
> @@ -55,6 +54,13 @@
>
> #ifndef __ASSEMBLY__
>
> +#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> +extern phys_addr_t physical_mask;
> +#define __PHYSICAL_MASK physical_mask
> +#else
> +#define __PHYSICAL_MASK ((phys_addr_t)((1ULL << __PHYSICAL_MASK_SHIFT) - 1))
> +#endif
> +
> extern int devmem_is_allowed(unsigned long pagenr);
>
> extern unsigned long max_low_pfn_mapped;
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 1b2197d13832..7ae36868aed2 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -527,6 +527,7 @@ void __init sme_enable(struct boot_params *bp)
> /* SEV state cannot be controlled by a command line option */
> sme_me_mask = me_mask;
> sev_enabled = true;
> + physical_mask &= ~sme_me_mask;
> return;
> }
>
> @@ -561,4 +562,6 @@ void __init sme_enable(struct boot_params *bp)
> sme_me_mask = 0;
> else
> sme_me_mask = active_by_default ? me_mask : 0;
> +
> + physical_mask &= ~sme_me_mask;
> }
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 34cda7e0551b..0199b94e6b40 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -7,6 +7,11 @@
> #include <asm/fixmap.h>
> #include <asm/mtrr.h>
>
> +#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
> +phys_addr_t physical_mask __ro_after_init = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
> +EXPORT_SYMBOL(physical_mask);
> +#endif
> +
> #define PGALLOC_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO)
>
> #ifdef CONFIG_HIGHPTE
>