From: Mike Rapoport <[email protected]>
Hi,
This is a second version of "secret" mappings implementation backed by a
file descriptor.
The file descriptor is created using memfd_create() syscall with a new
MFD_SECRET flag. The file descriptor should be configured using ioctl() to
define the desired protection and then mmap() of the fd will create a
"secret" memory mapping. The pages in that mapping will be marked as not
present in the direct map and will have desired protection bits set in the
user page table. For instance, current implementation allows uncached
mappings.
Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs.
As the fragmentation of the direct map was one of the major concerns raised
during the previous postings, I've added an amortizing cache of PMD-size
pages to each file descriptor and an ability to reserve large chunks of the
physical memory at boot time and then use this memory as an allocation pool
for the secret memory areas.
In addition, I've tried to find some numbers that show the benefit of using
larger pages in the direct map, but I couldn't find anything so I've run a
couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U with
32G RAM).
I've tested three variants: the default with 28G of the physical memory
covered with 1G pages, then I disabled 1G pages using "nogbpages" in the
kernel command line and at last I've forced the entire direct map to use 4K
pages using a simple patch to arch/x86/mm/init.c.
I've made runs of the benchmarks with SSD and tmpfs.
Surprisingly, the results does not show huge advantage for large pages. For
instance, here the results for kernel build with 'make -j8', in seconds:
| 1G | 2M | 4K
------------------------+--------+--------+---------
ssd, mitigations=on | 308.75 | 317.37 | 314.9
ssd, mitigations=off | 305.25 | 295.32 | 304.92
ram, mitigations=on | 301.58 | 322.49 | 306.54
ram, mitigations=off | 299.32 | 288.44 | 310.65
All the results I have are available at [1].
If anybody is interested in plain text, please let me know.
[1] https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing
Mike Rapoport (5):
mm: make HPAGE_PxD_{SHIFT,MASK,SIZE} always available
mmap: make mlock_future_check() global
mm: extend memfd with ability to create "secret" memory areas
mm: secretmem: use PMD-size pages to amortize direct map fragmentation
mm: secretmem: add ability to reserve memory at boot
include/linux/huge_mm.h | 10 +-
include/linux/memfd.h | 9 +
include/uapi/linux/magic.h | 1 +
include/uapi/linux/memfd.h | 6 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/internal.h | 3 +
mm/memfd.c | 10 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 445 +++++++++++++++++++++++++++++++++++++
10 files changed, 480 insertions(+), 13 deletions(-)
create mode 100644 mm/secretmem.c
base-commit: 7c30b859a947535f2213277e827d7ac7dcff9c84
--
2.26.2
From: Mike Rapoport <[email protected]>
Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.
Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.
Signed-off-by: Mike Rapoport <[email protected]>
---
mm/secretmem.c | 107 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 88 insertions(+), 19 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index df8f8c958cc2..c6fcf6d76951 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -5,6 +5,7 @@
#include <linux/memfd.h>
#include <linux/printk.h>
#include <linux/pagemap.h>
+#include <linux/genalloc.h>
#include <linux/pseudo_fs.h>
#include <linux/set_memory.h>
#include <linux/sched/signal.h>
@@ -23,24 +24,66 @@
#define SECRETMEM_UNCACHED 0x2
struct secretmem_ctx {
+ struct gen_pool *pool;
unsigned int mode;
};
-static struct page *secretmem_alloc_page(gfp_t gfp)
+static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
{
- /*
- * FIXME: use a cache of large pages to reduce the direct map
- * fragmentation
- */
- return alloc_page(gfp);
+ unsigned long nr_pages = (1 << HPAGE_PMD_ORDER);
+ struct gen_pool *pool = ctx->pool;
+ unsigned long addr;
+ struct page *page;
+ int err;
+
+ page = alloc_pages(gfp, HPAGE_PMD_ORDER);
+ if (!page)
+ return -ENOMEM;
+
+ addr = (unsigned long)page_address(page);
+ split_page(page, HPAGE_PMD_ORDER);
+
+ err = gen_pool_add(pool, addr, HPAGE_PMD_SIZE, NUMA_NO_NODE);
+ if (err) {
+ __free_pages(page, HPAGE_PMD_ORDER);
+ return err;
+ }
+
+ __kernel_map_pages(page, nr_pages, 0);
+
+ return 0;
+}
+
+static struct page *secretmem_alloc_page(struct secretmem_ctx *ctx,
+ gfp_t gfp)
+{
+ struct gen_pool *pool = ctx->pool;
+ unsigned long addr;
+ struct page *page;
+ int err;
+
+ if (gen_pool_avail(pool) < PAGE_SIZE) {
+ err = secretmem_pool_increase(ctx, gfp);
+ if (err)
+ return NULL;
+ }
+
+ addr = gen_pool_alloc(pool, PAGE_SIZE);
+ if (!addr)
+ return NULL;
+
+ page = virt_to_page(addr);
+ get_page(page);
+
+ return page;
}
static vm_fault_t secretmem_fault(struct vm_fault *vmf)
{
+ struct secretmem_ctx *ctx = vmf->vma->vm_file->private_data;
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
struct inode *inode = file_inode(vmf->vma->vm_file);
pgoff_t offset = vmf->pgoff;
- unsigned long addr;
struct page *page;
int ret = 0;
@@ -49,7 +92,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
page = find_get_entry(mapping, offset);
if (!page) {
- page = secretmem_alloc_page(vmf->gfp_mask);
+ page = secretmem_alloc_page(ctx, vmf->gfp_mask);
if (!page)
return vmf_error(-ENOMEM);
@@ -57,14 +100,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
if (unlikely(ret))
goto err_put_page;
- ret = set_direct_map_invalid_noflush(page);
- if (ret)
- goto err_del_page_cache;
-
- addr = (unsigned long)page_address(page);
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
__SetPageUptodate(page);
+ set_page_private(page, (unsigned long)ctx);
ret = VM_FAULT_LOCKED;
}
@@ -72,8 +109,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
vmf->page = page;
return ret;
-err_del_page_cache:
- delete_from_page_cache(page);
err_put_page:
put_page(page);
return vmf_error(ret);
@@ -155,7 +190,11 @@ static int secretmem_migratepage(struct address_space *mapping,
static void secretmem_freepage(struct page *page)
{
- set_direct_map_default_noflush(page);
+ unsigned long addr = (unsigned long)page_address(page);
+ struct secretmem_ctx *ctx = (struct secretmem_ctx *)page_private(page);
+ struct gen_pool *pool = ctx->pool;
+
+ gen_pool_free(pool, addr, PAGE_SIZE);
}
static const struct address_space_operations secretmem_aops = {
@@ -179,13 +218,18 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
if (!ctx)
goto err_free_inode;
+ ctx->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
+ if (!ctx->pool)
+ goto err_free_ctx;
+
file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
O_RDWR, &secretmem_fops);
if (IS_ERR(file))
- goto err_free_ctx;
+ goto err_free_pool;
mapping_set_unevictable(inode->i_mapping);
+ inode->i_private = ctx;
inode->i_mapping->private_data = ctx;
inode->i_mapping->a_ops = &secretmem_aops;
@@ -197,6 +241,8 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
return file;
+err_free_pool:
+ gen_pool_destroy(ctx->pool);
err_free_ctx:
kfree(ctx);
err_free_inode:
@@ -204,11 +250,34 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
return file;
}
+static void secretmem_cleanup_chunk(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk, void *data)
+{
+ unsigned long start = chunk->start_addr;
+ unsigned long end = chunk->end_addr;
+ unsigned long nr_pages, addr;
+
+ nr_pages = (end - start + 1) / PAGE_SIZE;
+ __kernel_map_pages(virt_to_page(start), nr_pages, 1);
+
+ for (addr = start; addr < end; addr += PAGE_SIZE)
+ put_page(virt_to_page(addr));
+}
+
+static void secretmem_cleanup_pool(struct secretmem_ctx *ctx)
+{
+ struct gen_pool *pool = ctx->pool;
+
+ gen_pool_for_each_chunk(pool, secretmem_cleanup_chunk, ctx);
+ gen_pool_destroy(pool);
+}
+
static void secretmem_evict_inode(struct inode *inode)
{
struct secretmem_ctx *ctx = inode->i_private;
truncate_inode_pages_final(&inode->i_data);
+ secretmem_cleanup_pool(ctx);
clear_inode(inode);
kfree(ctx);
}
--
2.26.2
From: Mike Rapoport <[email protected]>
Taking pages out from the direct map and bringing them back may create
undesired fragmentation and usage of the smaller pages in the direct
mapping of the physical memory.
This can be avoided if a significantly large area of the physical memory
would be reserved for secretmem purposes at boot time.
Add ability to reserve physical memory for secretmem at boot time using
"secretmem" kernel parameter and then use that reserved memory as a global
pool for secret memory needs.
Signed-off-by: Mike Rapoport <[email protected]>
---
mm/secretmem.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 137 insertions(+), 8 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index c6fcf6d76951..d50ac9ea1844 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -6,6 +6,8 @@
#include <linux/printk.h>
#include <linux/pagemap.h>
#include <linux/genalloc.h>
+#include <linux/memblock.h>
+#include <linux/mm_inline.h>
#include <linux/pseudo_fs.h>
#include <linux/set_memory.h>
#include <linux/sched/signal.h>
@@ -28,6 +30,39 @@ struct secretmem_ctx {
unsigned int mode;
};
+struct secretmem_pool {
+ struct gen_pool *pool;
+ unsigned long reserved_size;
+ void *reserved;
+};
+
+static struct secretmem_pool secretmem_pool;
+
+static struct page *secretmem_alloc_huge_page(gfp_t gfp)
+{
+ struct gen_pool *pool = secretmem_pool.pool;
+ unsigned long addr = 0;
+ struct page *page = NULL;
+
+ if (pool) {
+ if (gen_pool_avail(pool) < HPAGE_PMD_SIZE)
+ return NULL;
+
+ addr = gen_pool_alloc(pool, HPAGE_PMD_SIZE);
+ if (!addr)
+ return NULL;
+
+ page = virt_to_page(addr);
+ } else {
+ page = alloc_pages(gfp, HPAGE_PMD_ORDER);
+
+ if (page)
+ split_page(page, HPAGE_PMD_ORDER);
+ }
+
+ return page;
+}
+
static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
{
unsigned long nr_pages = (1 << HPAGE_PMD_ORDER);
@@ -36,12 +71,11 @@ static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
struct page *page;
int err;
- page = alloc_pages(gfp, HPAGE_PMD_ORDER);
+ page = secretmem_alloc_huge_page(gfp);
if (!page)
return -ENOMEM;
addr = (unsigned long)page_address(page);
- split_page(page, HPAGE_PMD_ORDER);
err = gen_pool_add(pool, addr, HPAGE_PMD_SIZE, NUMA_NO_NODE);
if (err) {
@@ -250,11 +284,23 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
return file;
}
-static void secretmem_cleanup_chunk(struct gen_pool *pool,
- struct gen_pool_chunk *chunk, void *data)
+static void secretmem_recycle_range(unsigned long start, unsigned long end)
+{
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE) {
+ struct page *page = virt_to_page(addr);
+ struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+ __ClearPageLRU(page);
+ del_page_from_lru_list(page, lruvec, page_off_lru(page));
+ }
+
+ gen_pool_free(secretmem_pool.pool, start, HPAGE_PMD_SIZE);
+}
+
+static void secretmem_release_range(unsigned long start, unsigned long end)
{
- unsigned long start = chunk->start_addr;
- unsigned long end = chunk->end_addr;
unsigned long nr_pages, addr;
nr_pages = (end - start + 1) / PAGE_SIZE;
@@ -264,6 +310,18 @@ static void secretmem_cleanup_chunk(struct gen_pool *pool,
put_page(virt_to_page(addr));
}
+static void secretmem_cleanup_chunk(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk, void *data)
+{
+ unsigned long start = chunk->start_addr;
+ unsigned long end = chunk->end_addr;
+
+ if (secretmem_pool.pool)
+ secretmem_recycle_range(start, end);
+ else
+ secretmem_release_range(start, end);
+}
+
static void secretmem_cleanup_pool(struct secretmem_ctx *ctx)
{
struct gen_pool *pool = ctx->pool;
@@ -303,14 +361,85 @@ static struct file_system_type secretmem_fs = {
.kill_sb = kill_anon_super,
};
+static int secretmem_reserved_mem_init(void)
+{
+ struct gen_pool *pool;
+ struct page *page;
+ void *addr;
+ int err;
+
+ if (!secretmem_pool.reserved)
+ return 0;
+
+ pool = gen_pool_create(PMD_SHIFT, NUMA_NO_NODE);
+ if (!pool)
+ return -ENOMEM;
+
+ err = gen_pool_add(pool, (unsigned long)secretmem_pool.reserved,
+ secretmem_pool.reserved_size, NUMA_NO_NODE);
+ if (err)
+ goto err_destroy_pool;
+
+ for (addr = secretmem_pool.reserved;
+ addr < secretmem_pool.reserved + secretmem_pool.reserved_size;
+ addr += PAGE_SIZE) {
+ page = virt_to_page(addr);
+ __ClearPageReserved(page);
+ set_page_count(page, 1);
+ }
+
+ secretmem_pool.pool = pool;
+ page = virt_to_page(secretmem_pool.reserved);
+ __kernel_map_pages(page, secretmem_pool.reserved_size / PAGE_SIZE, 0);
+ return 0;
+
+err_destroy_pool:
+ gen_pool_destroy(pool);
+ return err;
+}
+
static int secretmem_init(void)
{
- int ret = 0;
+ int ret;
+
+ ret = secretmem_reserved_mem_init();
+ if (ret)
+ return ret;
secretmem_mnt = kern_mount(&secretmem_fs);
- if (IS_ERR(secretmem_mnt))
+ if (IS_ERR(secretmem_mnt)) {
+ gen_pool_destroy(secretmem_pool.pool);
ret = PTR_ERR(secretmem_mnt);
+ }
return ret;
}
fs_initcall(secretmem_init);
+
+static int __init secretmem_setup(char *str)
+{
+ phys_addr_t align = HPAGE_PMD_SIZE;
+ unsigned long reserved_size;
+ void *reserved;
+
+ reserved_size = memparse(str, NULL);
+ if (!reserved_size)
+ return 0;
+
+ if (reserved_size * 2 > HPAGE_PUD_SIZE)
+ align = HPAGE_PUD_SIZE;
+
+ reserved = memblock_alloc(reserved_size, align);
+ if (!reserved) {
+ pr_err("failed to reserve %zu bytes\n", secretmem_pool.reserved_size);
+ return 0;
+ }
+
+ secretmem_pool.reserved_size = reserved_size;
+ secretmem_pool.reserved = reserved;
+
+ pr_info("reserved %zuM\n", reserved_size >> 20);
+
+ return 1;
+}
+__setup("secretmem=", secretmem_setup);
--
2.26.2
From: Mike Rapoport <[email protected]>
The definitions of shift, mask and size for the second and the third level
of the leaf pages are available only when CONFIG_TRANSPARENT_HUGEPAGE is
set. Otherwise they evaluate to BUILD_BUG().
There is no explanation neither in the code nor in the changelog why the
usage of, e.g. HPAGE_PMD_SIZE should be only allowed with THP and forbidden
otherwise while the definitions of HPAGE_PMD_SIZE and HPAGE_PUD_SIZE
express the sizes better than ambiguous HPAGE_SIZE.
Make HPAGE_PxD_{SHIFT,MASK,SIZE} definitions available unconditionally.
Signed-off-by: Mike Rapoport <[email protected]>
---
include/linux/huge_mm.h | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 71f20776b06c..1f4b44a76e31 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -115,7 +115,6 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
@@ -124,6 +123,8 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
extern unsigned long transparent_hugepage_flags;
/*
@@ -316,13 +317,6 @@ static inline struct list_head *page_deferred_list(struct page *page)
}
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
static inline int hpage_nr_pages(struct page *page)
{
--
2.26.2
From: Mike Rapoport <[email protected]>
It will be used by the upcoming secret memory implementation.
Signed-off-by: Mike Rapoport <[email protected]>
---
mm/internal.h | 3 +++
mm/mmap.c | 5 ++---
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 9886db20d94f..af0a92f8f6bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -349,6 +349,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
extern void mlock_vma_page(struct page *page);
extern unsigned int munlock_vma_page(struct page *page);
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len);
+
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 59a4682ebf3f..4dd40a4fedfb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1310,9 +1310,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
return hint;
}
-static inline int mlock_future_check(struct mm_struct *mm,
- unsigned long flags,
- unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len)
{
unsigned long locked, lock_limit;
--
2.26.2
From: Mike Rapoport <[email protected]>
Extend memfd_create() system call with the ability to create memory areas
visible only in the context of the owning process and not mapped not only
to other processes but in the kernel page tables as well.
The user will create a file descriptor using the memfd_create system call.
The user than has to use ioctl() to define the desired protection mode for
the memory associated with that file descriptor and only when the mode is
set it is possible to mmap() the memory. For instance, the following
exapmple will create an uncached mapping (error handling is omitted):
fd = memfd_create("secret", MFD_SECRET);
ioctl(fd, MFD_SECRET_UNCACHED);
ftruncate(fd. MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
fd, 0);
Signed-off-by: Mike Rapoport <[email protected]>
---
include/linux/memfd.h | 9 ++
include/uapi/linux/magic.h | 1 +
include/uapi/linux/memfd.h | 6 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/memfd.c | 10 +-
mm/secretmem.c | 247 +++++++++++++++++++++++++++++++++++++
7 files changed, 275 insertions(+), 2 deletions(-)
create mode 100644 mm/secretmem.c
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..d3ca7285f51a 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -13,4 +13,13 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
}
#endif
+#ifdef CONFIG_MEMFD_SECRETMEM
+extern struct file *secretmem_file_create(const char *name, unsigned int flags);
+#else
+static inline struct file *secretmem_file_create(const char *name, unsigned int flags)
+{
+ return ERR_PTR(-EINVAL);
+}
+#endif
+
#endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define Z3FOLD_MAGIC 0x33
#define PPC_CMM_MAGIC 0xc7571590
+#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..3320a79b638d 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,12 @@
#define MFD_CLOEXEC 0x0001U
#define MFD_ALLOW_SEALING 0x0002U
#define MFD_HUGETLB 0x0004U
+#define MFD_SECRET 0x0008U
+
+/* ioctls for secret memory */
+#define MFD_SECRET_IOCTL '-'
+#define MFD_SECRET_EXCLUSIVE _IOW(MFD_SECRET_IOCTL, 0x13, unsigned long)
+#define MFD_SECRET_UNCACHED _IOW(MFD_SECRET_IOCTL, 0x14, unsigned long)
/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Kconfig b/mm/Kconfig
index f2104cc0d35c..20dfcc54cc7a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config ARCH_HAS_HUGEPD
config MAPPING_DIRTY_HELPERS
bool
+config MEMFD_SECRETMEM
+ def_bool MEMFD_CREATE && ARCH_HAS_SET_DIRECT_MAP
+
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 6e9d46b2efc9..a9459c8a655a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_MEMFD_SECRETMEM) += secretmem.o
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..3e1cc37e0389 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_SECRET_MASK (MFD_CLOEXEC | MFD_SECRET)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET)
SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
@@ -257,6 +258,9 @@ SYSCALL_DEFINE2(memfd_create,
char *name;
long len;
+ if (flags & ~(unsigned int)MFD_SECRET_MASK)
+ return -EINVAL;
+
if (!(flags & MFD_HUGETLB)) {
if (flags & ~(unsigned int)MFD_ALL_FLAGS)
return -EINVAL;
@@ -296,7 +300,9 @@ SYSCALL_DEFINE2(memfd_create,
goto err_name;
}
- if (flags & MFD_HUGETLB) {
+ if (flags & MFD_SECRET) {
+ file = secretmem_file_create(name, flags);
+ } else if (flags & MFD_HUGETLB) {
struct user_struct *user = NULL;
file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..df8f8c958cc2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,247 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/memfd.h>
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+#define SECRETMEM_EXCLUSIVE 0x1
+#define SECRETMEM_UNCACHED 0x2
+
+struct secretmem_ctx {
+ unsigned int mode;
+};
+
+static struct page *secretmem_alloc_page(gfp_t gfp)
+{
+ /*
+ * FIXME: use a cache of large pages to reduce the direct map
+ * fragmentation
+ */
+ return alloc_page(gfp);
+}
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ pgoff_t offset = vmf->pgoff;
+ unsigned long addr;
+ struct page *page;
+ int ret = 0;
+
+ if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+ return vmf_error(-EINVAL);
+
+ page = find_get_entry(mapping, offset);
+ if (!page) {
+ page = secretmem_alloc_page(vmf->gfp_mask);
+ if (!page)
+ return vmf_error(-ENOMEM);
+
+ ret = add_to_page_cache_lru(page, mapping, offset, vmf->gfp_mask);
+ if (unlikely(ret))
+ goto err_put_page;
+
+ ret = set_direct_map_invalid_noflush(page);
+ if (ret)
+ goto err_del_page_cache;
+
+ addr = (unsigned long)page_address(page);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+ __SetPageUptodate(page);
+
+ ret = VM_FAULT_LOCKED;
+ }
+
+ vmf->page = page;
+ return ret;
+
+err_del_page_cache:
+ delete_from_page_cache(page);
+err_put_page:
+ put_page(page);
+ return vmf_error(ret);
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+ .fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct secretmem_ctx *ctx = file->private_data;
+ unsigned long mode = ctx->mode;
+ unsigned long len = vma->vm_end - vma->vm_start;
+
+ if (!mode)
+ return -EINVAL;
+
+ if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+ return -EAGAIN;
+
+ switch (mode) {
+ case SECRETMEM_UNCACHED:
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ fallthrough;
+ case SECRETMEM_EXCLUSIVE:
+ vma->vm_ops = &secretmem_vm_ops;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ vma->vm_flags |= VM_LOCKED;
+
+ return 0;
+}
+
+static long secretmem_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+ struct secretmem_ctx *ctx = file->private_data;
+ unsigned long mode = ctx->mode;
+
+ if (mode)
+ return -EINVAL;
+
+ switch (cmd) {
+ case MFD_SECRET_EXCLUSIVE:
+ mode = SECRETMEM_EXCLUSIVE;
+ break;
+ case MFD_SECRET_UNCACHED:
+ mode = SECRETMEM_UNCACHED;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ ctx->mode = mode;
+
+ return 0;
+}
+
+const struct file_operations secretmem_fops = {
+ .mmap = secretmem_mmap,
+ .unlocked_ioctl = secretmem_ioctl,
+ .compat_ioctl = secretmem_ioctl,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+ return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+ set_direct_map_default_noflush(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+ .freepage = secretmem_freepage,
+ .migratepage = secretmem_migratepage,
+ .isolate_page = secretmem_isolate_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+struct file *secretmem_file_create(const char *name, unsigned int flags)
+{
+ struct inode *inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+ struct file *file = ERR_PTR(-ENOMEM);
+ struct secretmem_ctx *ctx;
+
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ goto err_free_inode;
+
+ file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+ O_RDWR, &secretmem_fops);
+ if (IS_ERR(file))
+ goto err_free_ctx;
+
+ mapping_set_unevictable(inode->i_mapping);
+
+ inode->i_mapping->private_data = ctx;
+ inode->i_mapping->a_ops = &secretmem_aops;
+
+ /* pretend we are a normal file with zero size */
+ inode->i_mode |= S_IFREG;
+ inode->i_size = 0;
+
+ file->private_data = ctx;
+
+ return file;
+
+err_free_ctx:
+ kfree(ctx);
+err_free_inode:
+ iput(inode);
+ return file;
+}
+
+static void secretmem_evict_inode(struct inode *inode)
+{
+ struct secretmem_ctx *ctx = inode->i_private;
+
+ truncate_inode_pages_final(&inode->i_data);
+ clear_inode(inode);
+ kfree(ctx);
+}
+
+static const struct super_operations secretmem_super_ops = {
+ .evict_inode = secretmem_evict_inode,
+};
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+ struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC);
+
+ if (!ctx)
+ return -ENOMEM;
+ ctx->ops = &secretmem_super_ops;
+
+ return 0;
+}
+
+static struct file_system_type secretmem_fs = {
+ .name = "secretmem",
+ .init_fs_context = secretmem_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+ int ret = 0;
+
+ secretmem_mnt = kern_mount(&secretmem_fs);
+ if (IS_ERR(secretmem_mnt))
+ ret = PTR_ERR(secretmem_mnt);
+
+ return ret;
+}
+fs_initcall(secretmem_init);
--
2.26.2
On Mon, 6 Jul 2020, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> The definitions of shift, mask and size for the second and the third level
> of the leaf pages are available only when CONFIG_TRANSPARENT_HUGEPAGE is
> set. Otherwise they evaluate to BUILD_BUG().
>
> There is no explanation neither in the code nor in the changelog why the
> usage of, e.g. HPAGE_PMD_SIZE should be only allowed with THP and forbidden
> otherwise while the definitions of HPAGE_PMD_SIZE and HPAGE_PUD_SIZE
> express the sizes better than ambiguous HPAGE_SIZE.
>
> Make HPAGE_PxD_{SHIFT,MASK,SIZE} definitions available unconditionally.
Adding Andrea to Cc, he's the one who structured it that way,
and should be consulted.
I'm ambivalent myself. Many's the time I've been irritated by the
BUILD_BUG() in HPAGE_etc, and it's responsible for very many #ifdef
CONFIG_TRANSPARENT_HUGEPAGEs or IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)s
that you find uglily scattered around the source.
But that's the point of it: it's warning when you write code peculiar
to THP, that is going to bloat the build of kernels without any THP.
So although I've often been tempted to do as you suggest, I've always
ended up respecting Andrea's intention, and worked around it instead
(sometimes with #ifdef or IS_ENABLED(), sometimes with
PMD_{SHIFT,MASK_SIZE}, sometimes with a local definition).
Hugh
>
> Signed-off-by: Mike Rapoport <[email protected]>
> ---
> include/linux/huge_mm.h | 10 ++--------
> 1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 71f20776b06c..1f4b44a76e31 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -115,7 +115,6 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define HPAGE_PMD_SHIFT PMD_SHIFT
> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
> @@ -124,6 +123,8 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
> #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +
> extern unsigned long transparent_hugepage_flags;
>
> /*
> @@ -316,13 +317,6 @@ static inline struct list_head *page_deferred_list(struct page *page)
> }
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> -#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> -#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
> -
> -#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
> -#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
> -#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>
> static inline int hpage_nr_pages(struct page *page)
> {
> --
> 2.26.2
Hi Hugh,
On Mon, Jul 06, 2020 at 10:07:34PM -0700, Hugh Dickins wrote:
> On Mon, 6 Jul 2020, Mike Rapoport wrote:
> > From: Mike Rapoport <[email protected]>
> >
> > The definitions of shift, mask and size for the second and the third level
> > of the leaf pages are available only when CONFIG_TRANSPARENT_HUGEPAGE is
> > set. Otherwise they evaluate to BUILD_BUG().
> >
> > There is no explanation neither in the code nor in the changelog why the
> > usage of, e.g. HPAGE_PMD_SIZE should be only allowed with THP and forbidden
> > otherwise while the definitions of HPAGE_PMD_SIZE and HPAGE_PUD_SIZE
> > express the sizes better than ambiguous HPAGE_SIZE.
> >
> > Make HPAGE_PxD_{SHIFT,MASK,SIZE} definitions available unconditionally.
>
> Adding Andrea to Cc, he's the one who structured it that way,
> and should be consulted.
>
> I'm ambivalent myself. Many's the time I've been irritated by the
> BUILD_BUG() in HPAGE_etc, and it's responsible for very many #ifdef
> CONFIG_TRANSPARENT_HUGEPAGEs or IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)s
> that you find uglily scattered around the source.
>
> But that's the point of it: it's warning when you write code peculiar
> to THP, that is going to bloat the build of kernels without any THP.
>
> So although I've often been tempted to do as you suggest, I've always
> ended up respecting Andrea's intention, and worked around it instead
> (sometimes with #ifdef or IS_ENABLED(), sometimes with
> PMD_{SHIFT,MASK_SIZE}, sometimes with a local definition).
I could do with a local definition as well, but I think HPAGE_PxD_SHIFT
is better and more descriptive than ambiguous HPAGE_SHIFT and I was
thinking about wider change to use "THP" defines rather than "hugetlb"
defines wherever possible.
In the end, HPAGE_PMD_SIZE does not have to be associated with THP and
limited to it, it just says what is the size of a leaf page at PMD
level.
> Hugh
>
> >
> > Signed-off-by: Mike Rapoport <[email protected]>
> > ---
> > include/linux/huge_mm.h | 10 ++--------
> > 1 file changed, 2 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 71f20776b06c..1f4b44a76e31 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -115,7 +115,6 @@ extern struct kobj_attribute shmem_enabled_attr;
> > #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > #define HPAGE_PMD_SHIFT PMD_SHIFT
> > #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> > #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
> > @@ -124,6 +123,8 @@ extern struct kobj_attribute shmem_enabled_attr;
> > #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
> > #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +
> > extern unsigned long transparent_hugepage_flags;
> >
> > /*
> > @@ -316,13 +317,6 @@ static inline struct list_head *page_deferred_list(struct page *page)
> > }
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > -#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> > -#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
> > -
> > -#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
> > -#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
> > -#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
> >
> > static inline int hpage_nr_pages(struct page *page)
> > {
> > --
> > 2.26.2
--
Sincerely yours,
Mike.
Hello Hugh and Mike,
On Mon, Jul 06, 2020 at 10:07:34PM -0700, Hugh Dickins wrote:
> Adding Andrea to Cc, he's the one who structured it that way,
> and should be consulted.
>
> I'm ambivalent myself. Many's the time I've been irritated by the
> BUILD_BUG() in HPAGE_etc, and it's responsible for very many #ifdef
> CONFIG_TRANSPARENT_HUGEPAGEs or IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)s
> that you find uglily scattered around the source.
>
> But that's the point of it: it's warning when you write code peculiar
> to THP, that is going to bloat the build of kernels without any THP.
>
> So although I've often been tempted to do as you suggest, I've always
> ended up respecting Andrea's intention, and worked around it instead
> (sometimes with #ifdef or IS_ENABLED(), sometimes with
> PMD_{SHIFT,MASK_SIZE}, sometimes with a local definition).
The only other reasons that comes to mind in addition of optimizing
the bloat away at build time is to make it easier to identify the THP
code and to make it explicit that hugetlbfs shouldn't us it or it
could be wrong on some arches.
However for this case the BUILD_BUG() looks right and this doesn't
look like a false positive.
This patchset has nothing to do THP, so it'd be more correct to use
MAX_ORDER whenever the fragmentation is about the buddy (doesn't look
the case here) or PUD_SIZE/ORDER/PMD_SIZE/ORDER if the objective is
not to unnecessarily split extra and unrelated hugepud/hugepmds in the
direct mapping (as in this case).
The real issue exposed by the BUILD_BUG is the lack of PMD_ORDER
definition and fs/dax.c already run into and it solved it locally in the
dax.c file:
/* The order of a PMD entry */
#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
The fact it's not just this patch but also dax.c that run into the
same issue, makes me think PMD_ORDER should be defined and then you
can use PMD_* and PUD_* for this non-THP purpose.
Then the question if to remove the BUILD_BUG becomes orthogonal to
this patchset, but I don't see much value in retaining HPAGE_PMD/PUD_*
unless the BUILD_BUG is retained too, because this patchset already
hints that without the BUILD_BUG() the HPAGE_PMD_* definitions would
likely spill into non THP paths and they would lose also the only
value left (the ability to localize the THP code paths). So I wouldn't
be against removing the BUILD_BUG if it's causing maintenance
overhead, but then I would drop HPAGE_PMD_* too along with it or it
may just cause confusion.
Thanks,
Andrea
On Fri, Jul 10, 2020 at 12:40:37PM -0400, Andrea Arcangeli wrote:
> Hello Hugh and Mike,
>
> On Mon, Jul 06, 2020 at 10:07:34PM -0700, Hugh Dickins wrote:
> > Adding Andrea to Cc, he's the one who structured it that way,
> > and should be consulted.
> >
> > I'm ambivalent myself. Many's the time I've been irritated by the
> > BUILD_BUG() in HPAGE_etc, and it's responsible for very many #ifdef
> > CONFIG_TRANSPARENT_HUGEPAGEs or IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)s
> > that you find uglily scattered around the source.
> >
> > But that's the point of it: it's warning when you write code peculiar
> > to THP, that is going to bloat the build of kernels without any THP.
> >
> > So although I've often been tempted to do as you suggest, I've always
> > ended up respecting Andrea's intention, and worked around it instead
> > (sometimes with #ifdef or IS_ENABLED(), sometimes with
> > PMD_{SHIFT,MASK_SIZE}, sometimes with a local definition).
>
> The only other reasons that comes to mind in addition of optimizing
> the bloat away at build time is to make it easier to identify the THP
> code and to make it explicit that hugetlbfs shouldn't us it or it
> could be wrong on some arches.
>
> However for this case the BUILD_BUG() looks right and this doesn't
> look like a false positive.
>
> This patchset has nothing to do THP, so it'd be more correct to use
> MAX_ORDER whenever the fragmentation is about the buddy (doesn't look
> the case here) or PUD_SIZE/ORDER/PMD_SIZE/ORDER if the objective is
> not to unnecessarily split extra and unrelated hugepud/hugepmds in the
> direct mapping (as in this case).
>
> The real issue exposed by the BUILD_BUG is the lack of PMD_ORDER
> definition and fs/dax.c already run into and it solved it locally in the
> dax.c file:
>
> /* The order of a PMD entry */
> #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
>
> The fact it's not just this patch but also dax.c that run into the
> same issue, makes me think PMD_ORDER should be defined and then you
> can use PMD_* and PUD_* for this non-THP purpose.
We'll run into some namespace issues.
arch/arm/kernel/head.S:#define PMD_ORDER 3
arch/arm/kernel/head.S:#define PMD_ORDER 2
arch/mips/include/asm/pgtable-32.h:#define PMD_ORDER aieeee_attempt_to_allocate_pmd
arch/mips/include/asm/pgtable-64.h:#define PMD_ORDER 0
arch/parisc/include/asm/pgtable.h:#define PMD_ORDER 1 /* Number of pages per pmd */
> Then the question if to remove the BUILD_BUG becomes orthogonal to
> this patchset, but I don't see much value in retaining HPAGE_PMD/PUD_*
> unless the BUILD_BUG is retained too, because this patchset already
> hints that without the BUILD_BUG() the HPAGE_PMD_* definitions would
> likely spill into non THP paths and they would lose also the only
> value left (the ability to localize the THP code paths). So I wouldn't
> be against removing the BUILD_BUG if it's causing maintenance
> overhead, but then I would drop HPAGE_PMD_* too along with it or it
> may just cause confusion.
btw, using the hpage_ prefix already caused one problem in the hugetlb
code:
https://lore.kernel.org/linux-mm/[email protected]/
I'd suggest we rename these to THP_PMD_* and THP_PUD_* to make it clear
they're only for the THP case.
On Fri, Jul 10, 2020 at 05:57:46PM +0100, Matthew Wilcox wrote:
> btw, using the hpage_ prefix already caused one problem in the hugetlb
> code:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> I'd suggest we rename these to THP_PMD_* and THP_PUD_* to make it clear
> they're only for the THP case.
The confusion seem to have happened only about hpage_nr_pages not
about HPAGE_PMD_*. It's just the hpage_ prefix alone that is commonly
used by hugetlbfs only and so it's not surprising it caused confusion.
So I certainly agree hpage_nr_pages would better be renamed to
something more THP specific (either hpage_pmd_nr_pages or
trans_huge_nr_pages or as you wish), but HPAGE_PMD_ don't look too
confusing about the fact it's only for the THP case since the non-THP
case won't necessarily care about PMDs.
Thanks,
Andrea
On Fri, Jul 10, 2020 at 05:57:46PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 10, 2020 at 12:40:37PM -0400, Andrea Arcangeli wrote:
> > Hello Hugh and Mike,
> >
> > On Mon, Jul 06, 2020 at 10:07:34PM -0700, Hugh Dickins wrote:
> > > Adding Andrea to Cc, he's the one who structured it that way,
> > > and should be consulted.
> > >
> > > I'm ambivalent myself. Many's the time I've been irritated by the
> > > BUILD_BUG() in HPAGE_etc, and it's responsible for very many #ifdef
> > > CONFIG_TRANSPARENT_HUGEPAGEs or IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)s
> > > that you find uglily scattered around the source.
> > >
> > > But that's the point of it: it's warning when you write code peculiar
> > > to THP, that is going to bloat the build of kernels without any THP.
> > >
> > > So although I've often been tempted to do as you suggest, I've always
> > > ended up respecting Andrea's intention, and worked around it instead
> > > (sometimes with #ifdef or IS_ENABLED(), sometimes with
> > > PMD_{SHIFT,MASK_SIZE}, sometimes with a local definition).
> >
> > The only other reasons that comes to mind in addition of optimizing
> > the bloat away at build time is to make it easier to identify the THP
> > code and to make it explicit that hugetlbfs shouldn't us it or it
> > could be wrong on some arches.
> >
> > However for this case the BUILD_BUG() looks right and this doesn't
> > look like a false positive.
> >
> > This patchset has nothing to do THP, so it'd be more correct to use
> > MAX_ORDER whenever the fragmentation is about the buddy (doesn't look
> > the case here) or PUD_SIZE/ORDER/PMD_SIZE/ORDER if the objective is
> > not to unnecessarily split extra and unrelated hugepud/hugepmds in the
> > direct mapping (as in this case).
> >
> > The real issue exposed by the BUILD_BUG is the lack of PMD_ORDER
> > definition and fs/dax.c already run into and it solved it locally in the
> > dax.c file:
> >
> > /* The order of a PMD entry */
> > #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
> >
> > The fact it's not just this patch but also dax.c that run into the
> > same issue, makes me think PMD_ORDER should be defined and then you
> > can use PMD_* and PUD_* for this non-THP purpose.
>
> We'll run into some namespace issues.
>
> arch/arm/kernel/head.S:#define PMD_ORDER 3
> arch/arm/kernel/head.S:#define PMD_ORDER 2
> arch/mips/include/asm/pgtable-32.h:#define PMD_ORDER aieeee_attempt_to_allocate_pmd
> arch/mips/include/asm/pgtable-64.h:#define PMD_ORDER 0
> arch/parisc/include/asm/pgtable.h:#define PMD_ORDER 1 /* Number of pages per pmd */
This can be easily solved with, e.g.
#define PMD_PAGE_ORDER (PMD_SHIFT - PAGE_SHIFT)
or by renaming the current defines to PMD_ALLOC_ORDER.
> > Then the question if to remove the BUILD_BUG becomes orthogonal to
> > this patchset, but I don't see much value in retaining HPAGE_PMD/PUD_*
> > unless the BUILD_BUG is retained too, because this patchset already
> > hints that without the BUILD_BUG() the HPAGE_PMD_* definitions would
> > likely spill into non THP paths and they would lose also the only
> > value left (the ability to localize the THP code paths). So I wouldn't
> > be against removing the BUILD_BUG if it's causing maintenance
> > overhead, but then I would drop HPAGE_PMD_* too along with it or it
> > may just cause confusion.
>
> btw, using the hpage_ prefix already caused one problem in the hugetlb
> code:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> I'd suggest we rename these to THP_PMD_* and THP_PUD_* to make it clear
> they're only for the THP case.
I agree that THP_PMD_* and THP_PUD_* would be less confusing if we are
to differentiate THP and non-THP usage of 2nd and 3rd level leaf pages.
--
Sincerely yours,
Mike.
On Mon, Jul 06, 2020 at 08:20:49PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Extend memfd_create() system call with the ability to create memory areas
> visible only in the context of the owning process and not mapped not only
> to other processes but in the kernel page tables as well.
>
> The user will create a file descriptor using the memfd_create system call.
> The user than has to use ioctl() to define the desired protection mode for
> the memory associated with that file descriptor and only when the mode is
> set it is possible to mmap() the memory. For instance, the following
> exapmple will create an uncached mapping (error handling is omitted):
>
> fd = memfd_create("secret", MFD_SECRET);
I'm not convinced that it belong to memfd. You don't share anything with
memfd, but the syscall.
> ioctl(fd, MFD_SECRET_UNCACHED);
> ftruncate(fd. MAP_SIZE);
Mix of tabs and spaces?
> ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
> fd, 0);
>
> Signed-off-by: Mike Rapoport <[email protected]>
> ---
> include/linux/memfd.h | 9 ++
> include/uapi/linux/magic.h | 1 +
> include/uapi/linux/memfd.h | 6 +
> mm/Kconfig | 3 +
> mm/Makefile | 1 +
> mm/memfd.c | 10 +-
> mm/secretmem.c | 247 +++++++++++++++++++++++++++++++++++++
> 7 files changed, 275 insertions(+), 2 deletions(-)
> create mode 100644 mm/secretmem.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..d3ca7285f51a 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -13,4 +13,13 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
> }
> #endif
>
> +#ifdef CONFIG_MEMFD_SECRETMEM
> +extern struct file *secretmem_file_create(const char *name, unsigned int flags);
> +#else
> +static inline struct file *secretmem_file_create(const char *name, unsigned int flags)
> +{
> + return ERR_PTR(-EINVAL);
> +}
> +#endif
> +
> #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index f3956fc11de6..35687dcb1a42 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -97,5 +97,6 @@
> #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> #define Z3FOLD_MAGIC 0x33
> #define PPC_CMM_MAGIC 0xc7571590
> +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..3320a79b638d 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,12 @@
> #define MFD_CLOEXEC 0x0001U
> #define MFD_ALLOW_SEALING 0x0002U
> #define MFD_HUGETLB 0x0004U
> +#define MFD_SECRET 0x0008U
> +
> +/* ioctls for secret memory */
> +#define MFD_SECRET_IOCTL '-'
> +#define MFD_SECRET_EXCLUSIVE _IOW(MFD_SECRET_IOCTL, 0x13, unsigned long)
> +#define MFD_SECRET_UNCACHED _IOW(MFD_SECRET_IOCTL, 0x14, unsigned long)
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f2104cc0d35c..20dfcc54cc7a 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -872,4 +872,7 @@ config ARCH_HAS_HUGEPD
> config MAPPING_DIRTY_HELPERS
> bool
>
> +config MEMFD_SECRETMEM
> + def_bool MEMFD_CREATE && ARCH_HAS_SET_DIRECT_MAP
> +
> endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index 6e9d46b2efc9..a9459c8a655a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> +obj-$(CONFIG_MEMFD_SECRETMEM) += secretmem.o
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 2647c898990c..3e1cc37e0389 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_SECRET_MASK (MFD_CLOEXEC | MFD_SECRET)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> @@ -257,6 +258,9 @@ SYSCALL_DEFINE2(memfd_create,
> char *name;
> long len;
>
> + if (flags & ~(unsigned int)MFD_SECRET_MASK)
> + return -EINVAL;
> +
Didn't you just broke MFD_ALLOW_SEALING and MFD_HUGETLB with this?
I guess the check has to be under 'if (flags & MFD_SECRET) {' check, no?
And (unsigned int) case looks redundant to me.
> if (!(flags & MFD_HUGETLB)) {
> if (flags & ~(unsigned int)MFD_ALL_FLAGS)
> return -EINVAL;
> @@ -296,7 +300,9 @@ SYSCALL_DEFINE2(memfd_create,
> goto err_name;
> }
>
> - if (flags & MFD_HUGETLB) {
> + if (flags & MFD_SECRET) {
> + file = secretmem_file_create(name, flags);
> + } else if (flags & MFD_HUGETLB) {
> struct user_struct *user = NULL;
>
> file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> new file mode 100644
> index 000000000000..df8f8c958cc2
> --- /dev/null
> +++ b/mm/secretmem.c
> @@ -0,0 +1,247 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/mm.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/memfd.h>
> +#include <linux/printk.h>
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/set_memory.h>
> +#include <linux/sched/signal.h>
> +
> +#include <uapi/linux/memfd.h>
> +#include <uapi/linux/magic.h>
> +
> +#include <asm/tlbflush.h>
> +
> +#include "internal.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "secretmem: " fmt
> +
> +#define SECRETMEM_EXCLUSIVE 0x1
> +#define SECRETMEM_UNCACHED 0x2
> +
> +struct secretmem_ctx {
> + unsigned int mode;
> +};
> +
> +static struct page *secretmem_alloc_page(gfp_t gfp)
> +{
> + /*
> + * FIXME: use a cache of large pages to reduce the direct map
> + * fragmentation
> + */
> + return alloc_page(gfp);
> +}
> +
> +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> + pgoff_t offset = vmf->pgoff;
> + unsigned long addr;
> + struct page *page;
> + int ret = 0;
> +
> + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> + return vmf_error(-EINVAL);
> +
> + page = find_get_entry(mapping, offset);
> + if (!page) {
> + page = secretmem_alloc_page(vmf->gfp_mask);
> + if (!page)
> + return vmf_error(-ENOMEM);
> +
> + ret = add_to_page_cache_lru(page, mapping, offset, vmf->gfp_mask);
> + if (unlikely(ret))
> + goto err_put_page;
What the reason to add it to LRU? These pages never evictable. Do we have
some PageLRU() check that needs to be satisfied or what?
> +
> + ret = set_direct_map_invalid_noflush(page);
> + if (ret)
> + goto err_del_page_cache;
> +
> + addr = (unsigned long)page_address(page);
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
> + __SetPageUptodate(page);
> +
> + ret = VM_FAULT_LOCKED;
> + }
> +
> + vmf->page = page;
> + return ret;
> +
> +err_del_page_cache:
> + delete_from_page_cache(page);
> +err_put_page:
> + put_page(page);
> + return vmf_error(ret);
> +}
> +
> +static const struct vm_operations_struct secretmem_vm_ops = {
> + .fault = secretmem_fault,
> +};
> +
> +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct secretmem_ctx *ctx = file->private_data;
> + unsigned long mode = ctx->mode;
> + unsigned long len = vma->vm_end - vma->vm_start;
> +
> + if (!mode)
> + return -EINVAL;
> +
> + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
> + return -EAGAIN;
> +
> + switch (mode) {
> + case SECRETMEM_UNCACHED:
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + fallthrough;
> + case SECRETMEM_EXCLUSIVE:
> + vma->vm_ops = &secretmem_vm_ops;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + vma->vm_flags |= VM_LOCKED;
> +
> + return 0;
> +}
> +
> +static long secretmem_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> +{
> + struct secretmem_ctx *ctx = file->private_data;
> + unsigned long mode = ctx->mode;
> +
> + if (mode)
> + return -EINVAL;
> +
> + switch (cmd) {
> + case MFD_SECRET_EXCLUSIVE:
> + mode = SECRETMEM_EXCLUSIVE;
> + break;
> + case MFD_SECRET_UNCACHED:
> + mode = SECRETMEM_UNCACHED;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + ctx->mode = mode;
> +
> + return 0;
> +}
> +
> +const struct file_operations secretmem_fops = {
> + .mmap = secretmem_mmap,
> + .unlocked_ioctl = secretmem_ioctl,
> + .compat_ioctl = secretmem_ioctl,
> +};
> +
> +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
> +{
> + return false;
> +}
> +
> +static int secretmem_migratepage(struct address_space *mapping,
> + struct page *newpage, struct page *page,
> + enum migrate_mode mode)
> +{
> + return -EBUSY;
> +}
> +
> +static void secretmem_freepage(struct page *page)
> +{
> + set_direct_map_default_noflush(page);
> +}
> +
> +static const struct address_space_operations secretmem_aops = {
> + .freepage = secretmem_freepage,
> + .migratepage = secretmem_migratepage,
> + .isolate_page = secretmem_isolate_page,
> +};
> +
> +static struct vfsmount *secretmem_mnt;
> +
> +struct file *secretmem_file_create(const char *name, unsigned int flags)
> +{
> + struct inode *inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
> + struct file *file = ERR_PTR(-ENOMEM);
> + struct secretmem_ctx *ctx;
> +
> + if (IS_ERR(inode))
> + return ERR_CAST(inode);
> +
> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
> + goto err_free_inode;
> +
> + file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> + O_RDWR, &secretmem_fops);
> + if (IS_ERR(file))
> + goto err_free_ctx;
> +
> + mapping_set_unevictable(inode->i_mapping);
> +
> + inode->i_mapping->private_data = ctx;
> + inode->i_mapping->a_ops = &secretmem_aops;
> +
> + /* pretend we are a normal file with zero size */
> + inode->i_mode |= S_IFREG;
> + inode->i_size = 0;
> +
> + file->private_data = ctx;
> +
> + return file;
> +
> +err_free_ctx:
> + kfree(ctx);
> +err_free_inode:
> + iput(inode);
> + return file;
> +}
> +
> +static void secretmem_evict_inode(struct inode *inode)
> +{
> + struct secretmem_ctx *ctx = inode->i_private;
> +
> + truncate_inode_pages_final(&inode->i_data);
> + clear_inode(inode);
> + kfree(ctx);
> +}
> +
> +static const struct super_operations secretmem_super_ops = {
> + .evict_inode = secretmem_evict_inode,
> +};
> +
> +static int secretmem_init_fs_context(struct fs_context *fc)
> +{
> + struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC);
> +
> + if (!ctx)
> + return -ENOMEM;
> + ctx->ops = &secretmem_super_ops;
> +
> + return 0;
> +}
> +
> +static struct file_system_type secretmem_fs = {
> + .name = "secretmem",
> + .init_fs_context = secretmem_init_fs_context,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static int secretmem_init(void)
> +{
> + int ret = 0;
> +
> + secretmem_mnt = kern_mount(&secretmem_fs);
> + if (IS_ERR(secretmem_mnt))
> + ret = PTR_ERR(secretmem_mnt);
> +
> + return ret;
> +}
> +fs_initcall(secretmem_init);
> --
> 2.26.2
>
--
Kirill A. Shutemov
On Mon, Jul 06, 2020 at 08:20:50PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Removing a PAGE_SIZE page from the direct map every time such page is
> allocated for a secret memory mapping will cause severe fragmentation of
> the direct map. This fragmentation can be reduced by using PMD-size pages
> as a pool for small pages for secret memory mappings.
>
> Add a gen_pool per secretmem inode and lazily populate this pool with
> PMD-size pages.
>
> Signed-off-by: Mike Rapoport <[email protected]>
> ---
> mm/secretmem.c | 107 ++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 88 insertions(+), 19 deletions(-)
>
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index df8f8c958cc2..c6fcf6d76951 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -5,6 +5,7 @@
> #include <linux/memfd.h>
> #include <linux/printk.h>
> #include <linux/pagemap.h>
> +#include <linux/genalloc.h>
> #include <linux/pseudo_fs.h>
> #include <linux/set_memory.h>
> #include <linux/sched/signal.h>
> @@ -23,24 +24,66 @@
> #define SECRETMEM_UNCACHED 0x2
>
> struct secretmem_ctx {
> + struct gen_pool *pool;
> unsigned int mode;
> };
>
> -static struct page *secretmem_alloc_page(gfp_t gfp)
> +static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
> {
> - /*
> - * FIXME: use a cache of large pages to reduce the direct map
> - * fragmentation
> - */
> - return alloc_page(gfp);
> + unsigned long nr_pages = (1 << HPAGE_PMD_ORDER);
> + struct gen_pool *pool = ctx->pool;
> + unsigned long addr;
> + struct page *page;
> + int err;
> +
> + page = alloc_pages(gfp, HPAGE_PMD_ORDER);
> + if (!page)
> + return -ENOMEM;
> +
> + addr = (unsigned long)page_address(page);
> + split_page(page, HPAGE_PMD_ORDER);
> +
> + err = gen_pool_add(pool, addr, HPAGE_PMD_SIZE, NUMA_NO_NODE);
> + if (err) {
> + __free_pages(page, HPAGE_PMD_ORDER);
> + return err;
> + }
> +
> + __kernel_map_pages(page, nr_pages, 0);
It's worth nothing that unlike flush_tlb_kernel_range(),
__kernel_map_pages() only flushed local TLB, so other CPU may still have
access to the page. It's shouldn't be a blocker, but deserve a comment.
> +
> + return 0;
> +}
> +
> +static struct page *secretmem_alloc_page(struct secretmem_ctx *ctx,
> + gfp_t gfp)
> +{
> + struct gen_pool *pool = ctx->pool;
> + unsigned long addr;
> + struct page *page;
> + int err;
> +
> + if (gen_pool_avail(pool) < PAGE_SIZE) {
> + err = secretmem_pool_increase(ctx, gfp);
> + if (err)
> + return NULL;
> + }
> +
> + addr = gen_pool_alloc(pool, PAGE_SIZE);
> + if (!addr)
> + return NULL;
> +
> + page = virt_to_page(addr);
> + get_page(page);
> +
> + return page;
> }
>
> static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> {
> + struct secretmem_ctx *ctx = vmf->vma->vm_file->private_data;
> struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> struct inode *inode = file_inode(vmf->vma->vm_file);
> pgoff_t offset = vmf->pgoff;
> - unsigned long addr;
> struct page *page;
> int ret = 0;
>
> @@ -49,7 +92,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>
> page = find_get_entry(mapping, offset);
> if (!page) {
> - page = secretmem_alloc_page(vmf->gfp_mask);
> + page = secretmem_alloc_page(ctx, vmf->gfp_mask);
> if (!page)
> return vmf_error(-ENOMEM);
>
> @@ -57,14 +100,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> if (unlikely(ret))
> goto err_put_page;
>
> - ret = set_direct_map_invalid_noflush(page);
> - if (ret)
> - goto err_del_page_cache;
> -
> - addr = (unsigned long)page_address(page);
> - flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> -
> __SetPageUptodate(page);
> + set_page_private(page, (unsigned long)ctx);
>
> ret = VM_FAULT_LOCKED;
> }
> @@ -72,8 +109,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> vmf->page = page;
> return ret;
>
> -err_del_page_cache:
> - delete_from_page_cache(page);
> err_put_page:
> put_page(page);
> return vmf_error(ret);
> @@ -155,7 +190,11 @@ static int secretmem_migratepage(struct address_space *mapping,
>
> static void secretmem_freepage(struct page *page)
> {
> - set_direct_map_default_noflush(page);
> + unsigned long addr = (unsigned long)page_address(page);
> + struct secretmem_ctx *ctx = (struct secretmem_ctx *)page_private(page);
> + struct gen_pool *pool = ctx->pool;
> +
> + gen_pool_free(pool, addr, PAGE_SIZE);
> }
>
> static const struct address_space_operations secretmem_aops = {
> @@ -179,13 +218,18 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
> if (!ctx)
> goto err_free_inode;
>
> + ctx->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> + if (!ctx->pool)
> + goto err_free_ctx;
> +
> file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> O_RDWR, &secretmem_fops);
> if (IS_ERR(file))
> - goto err_free_ctx;
> + goto err_free_pool;
>
> mapping_set_unevictable(inode->i_mapping);
>
> + inode->i_private = ctx;
> inode->i_mapping->private_data = ctx;
> inode->i_mapping->a_ops = &secretmem_aops;
>
> @@ -197,6 +241,8 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
>
> return file;
>
> +err_free_pool:
> + gen_pool_destroy(ctx->pool);
> err_free_ctx:
> kfree(ctx);
> err_free_inode:
> @@ -204,11 +250,34 @@ struct file *secretmem_file_create(const char *name, unsigned int flags)
> return file;
> }
>
> +static void secretmem_cleanup_chunk(struct gen_pool *pool,
> + struct gen_pool_chunk *chunk, void *data)
> +{
> + unsigned long start = chunk->start_addr;
> + unsigned long end = chunk->end_addr;
> + unsigned long nr_pages, addr;
> +
> + nr_pages = (end - start + 1) / PAGE_SIZE;
> + __kernel_map_pages(virt_to_page(start), nr_pages, 1);
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE)
> + put_page(virt_to_page(addr));
> +}
> +
> +static void secretmem_cleanup_pool(struct secretmem_ctx *ctx)
> +{
> + struct gen_pool *pool = ctx->pool;
> +
> + gen_pool_for_each_chunk(pool, secretmem_cleanup_chunk, ctx);
> + gen_pool_destroy(pool);
> +}
> +
> static void secretmem_evict_inode(struct inode *inode)
> {
> struct secretmem_ctx *ctx = inode->i_private;
>
> truncate_inode_pages_final(&inode->i_data);
> + secretmem_cleanup_pool(ctx);
> clear_inode(inode);
> kfree(ctx);
> }
> --
> 2.26.2
>
--
Kirill A. Shutemov
On Mon, Jul 13, 2020 at 01:58:12PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jul 06, 2020 at 08:20:49PM +0300, Mike Rapoport wrote:
> > From: Mike Rapoport <[email protected]>
> >
> > Extend memfd_create() system call with the ability to create memory areas
> > visible only in the context of the owning process and not mapped not only
> > to other processes but in the kernel page tables as well.
> >
> > The user will create a file descriptor using the memfd_create system call.
> > The user than has to use ioctl() to define the desired protection mode for
> > the memory associated with that file descriptor and only when the mode is
> > set it is possible to mmap() the memory. For instance, the following
> > exapmple will create an uncached mapping (error handling is omitted):
> >
> > fd = memfd_create("secret", MFD_SECRET);
>
> I'm not convinced that it belong to memfd. You don't share anything with
> memfd, but the syscall.
I've chosen memfd because it implements file descriptor for memory
access. Indeed, there similarities end and this can be entirely new
system call.
And, TBH, I was too lazy to wire up a new syscall for RFC :)
> > ioctl(fd, MFD_SECRET_UNCACHED);
> > ftruncate(fd. MAP_SIZE);
>
> Mix of tabs and spaces?
Will fix.
> > ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
> > fd, 0);
> >
> > Signed-off-by: Mike Rapoport <[email protected]>
> > ---
> > include/linux/memfd.h | 9 ++
> > include/uapi/linux/magic.h | 1 +
> > include/uapi/linux/memfd.h | 6 +
> > mm/Kconfig | 3 +
> > mm/Makefile | 1 +
> > mm/memfd.c | 10 +-
> > mm/secretmem.c | 247 +++++++++++++++++++++++++++++++++++++
> > 7 files changed, 275 insertions(+), 2 deletions(-)
> > create mode 100644 mm/secretmem.c
> >
> > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > index 4f1600413f91..d3ca7285f51a 100644
> > --- a/include/linux/memfd.h
> > +++ b/include/linux/memfd.h
> > @@ -13,4 +13,13 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
> > }
> > #endif
> >
> > +#ifdef CONFIG_MEMFD_SECRETMEM
> > +extern struct file *secretmem_file_create(const char *name, unsigned int flags);
> > +#else
> > +static inline struct file *secretmem_file_create(const char *name, unsigned int flags)
> > +{
> > + return ERR_PTR(-EINVAL);
> > +}
> > +#endif
> > +
> > #endif /* __LINUX_MEMFD_H */
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index f3956fc11de6..35687dcb1a42 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -97,5 +97,6 @@
> > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
> > #define Z3FOLD_MAGIC 0x33
> > #define PPC_CMM_MAGIC 0xc7571590
> > +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
> >
> > #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 7a8a26751c23..3320a79b638d 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -8,6 +8,12 @@
> > #define MFD_CLOEXEC 0x0001U
> > #define MFD_ALLOW_SEALING 0x0002U
> > #define MFD_HUGETLB 0x0004U
> > +#define MFD_SECRET 0x0008U
> > +
> > +/* ioctls for secret memory */
> > +#define MFD_SECRET_IOCTL '-'
> > +#define MFD_SECRET_EXCLUSIVE _IOW(MFD_SECRET_IOCTL, 0x13, unsigned long)
> > +#define MFD_SECRET_UNCACHED _IOW(MFD_SECRET_IOCTL, 0x14, unsigned long)
> >
> > /*
> > * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index f2104cc0d35c..20dfcc54cc7a 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -872,4 +872,7 @@ config ARCH_HAS_HUGEPD
> > config MAPPING_DIRTY_HELPERS
> > bool
> >
> > +config MEMFD_SECRETMEM
> > + def_bool MEMFD_CREATE && ARCH_HAS_SET_DIRECT_MAP
> > +
> > endmenu
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 6e9d46b2efc9..a9459c8a655a 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> > obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
> > obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> > obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> > +obj-$(CONFIG_MEMFD_SECRETMEM) += secretmem.o
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 2647c898990c..3e1cc37e0389 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -245,7 +245,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> > +#define MFD_SECRET_MASK (MFD_CLOEXEC | MFD_SECRET)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_SECRET)
> >
> > SYSCALL_DEFINE2(memfd_create,
> > const char __user *, uname,
> > @@ -257,6 +258,9 @@ SYSCALL_DEFINE2(memfd_create,
> > char *name;
> > long len;
> >
> > + if (flags & ~(unsigned int)MFD_SECRET_MASK)
> > + return -EINVAL;
> > +
>
> Didn't you just broke MFD_ALLOW_SEALING and MFD_HUGETLB with this?
> I guess the check has to be under 'if (flags & MFD_SECRET) {' check, no?
Apparently I did. Thanks!
> And (unsigned int) case looks redundant to me.
>
> > if (!(flags & MFD_HUGETLB)) {
> > if (flags & ~(unsigned int)MFD_ALL_FLAGS)
> > return -EINVAL;
> > @@ -296,7 +300,9 @@ SYSCALL_DEFINE2(memfd_create,
> > goto err_name;
> > }
> >
> > - if (flags & MFD_HUGETLB) {
> > + if (flags & MFD_SECRET) {
> > + file = secretmem_file_create(name, flags);
> > + } else if (flags & MFD_HUGETLB) {
> > struct user_struct *user = NULL;
> >
> > file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user,
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > new file mode 100644
> > index 000000000000..df8f8c958cc2
> > --- /dev/null
> > +++ b/mm/secretmem.c
> > @@ -0,0 +1,247 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/mm.h>
> > +#include <linux/fs.h>
> > +#include <linux/mount.h>
> > +#include <linux/memfd.h>
> > +#include <linux/printk.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/set_memory.h>
> > +#include <linux/sched/signal.h>
> > +
> > +#include <uapi/linux/memfd.h>
> > +#include <uapi/linux/magic.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +#include "internal.h"
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) "secretmem: " fmt
> > +
> > +#define SECRETMEM_EXCLUSIVE 0x1
> > +#define SECRETMEM_UNCACHED 0x2
> > +
> > +struct secretmem_ctx {
> > + unsigned int mode;
> > +};
> > +
> > +static struct page *secretmem_alloc_page(gfp_t gfp)
> > +{
> > + /*
> > + * FIXME: use a cache of large pages to reduce the direct map
> > + * fragmentation
> > + */
> > + return alloc_page(gfp);
> > +}
> > +
> > +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > +{
> > + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > + struct inode *inode = file_inode(vmf->vma->vm_file);
> > + pgoff_t offset = vmf->pgoff;
> > + unsigned long addr;
> > + struct page *page;
> > + int ret = 0;
> > +
> > + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > + return vmf_error(-EINVAL);
> > +
> > + page = find_get_entry(mapping, offset);
> > + if (!page) {
> > + page = secretmem_alloc_page(vmf->gfp_mask);
> > + if (!page)
> > + return vmf_error(-ENOMEM);
> > +
> > + ret = add_to_page_cache_lru(page, mapping, offset, vmf->gfp_mask);
> > + if (unlikely(ret))
> > + goto err_put_page;
>
> What the reason to add it to LRU? These pages never evictable. Do we have
> some PageLRU() check that needs to be satisfied or what?
Hmm, most probably not. I'll recheck and will replace with
add_to_page_cache().
> > +
> > + ret = set_direct_map_invalid_noflush(page);
> > + if (ret)
> > + goto err_del_page_cache;
> > +
> > + addr = (unsigned long)page_address(page);
> > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > +
> > + __SetPageUptodate(page);
> > +
> > + ret = VM_FAULT_LOCKED;
> > + }
> > +
> > + vmf->page = page;
> > + return ret;
> > +
> > +err_del_page_cache:
> > + delete_from_page_cache(page);
> > +err_put_page:
> > + put_page(page);
> > + return vmf_error(ret);
> > +}
> > +
> > +static const struct vm_operations_struct secretmem_vm_ops = {
> > + .fault = secretmem_fault,
> > +};
> > +
> > +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > + struct secretmem_ctx *ctx = file->private_data;
> > + unsigned long mode = ctx->mode;
> > + unsigned long len = vma->vm_end - vma->vm_start;
> > +
> > + if (!mode)
> > + return -EINVAL;
> > +
> > + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
> > + return -EAGAIN;
> > +
> > + switch (mode) {
> > + case SECRETMEM_UNCACHED:
> > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > + fallthrough;
> > + case SECRETMEM_EXCLUSIVE:
> > + vma->vm_ops = &secretmem_vm_ops;
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + vma->vm_flags |= VM_LOCKED;
> > +
> > + return 0;
> > +}
> > +
> > +static long secretmem_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> > +{
> > + struct secretmem_ctx *ctx = file->private_data;
> > + unsigned long mode = ctx->mode;
> > +
> > + if (mode)
> > + return -EINVAL;
> > +
> > + switch (cmd) {
> > + case MFD_SECRET_EXCLUSIVE:
> > + mode = SECRETMEM_EXCLUSIVE;
> > + break;
> > + case MFD_SECRET_UNCACHED:
> > + mode = SECRETMEM_UNCACHED;
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + ctx->mode = mode;
> > +
> > + return 0;
> > +}
> > +
> > +const struct file_operations secretmem_fops = {
> > + .mmap = secretmem_mmap,
> > + .unlocked_ioctl = secretmem_ioctl,
> > + .compat_ioctl = secretmem_ioctl,
> > +};
> > +
> > +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
> > +{
> > + return false;
> > +}
> > +
> > +static int secretmem_migratepage(struct address_space *mapping,
> > + struct page *newpage, struct page *page,
> > + enum migrate_mode mode)
> > +{
> > + return -EBUSY;
> > +}
> > +
> > +static void secretmem_freepage(struct page *page)
> > +{
> > + set_direct_map_default_noflush(page);
> > +}
> > +
> > +static const struct address_space_operations secretmem_aops = {
> > + .freepage = secretmem_freepage,
> > + .migratepage = secretmem_migratepage,
> > + .isolate_page = secretmem_isolate_page,
> > +};
> > +
> > +static struct vfsmount *secretmem_mnt;
> > +
> > +struct file *secretmem_file_create(const char *name, unsigned int flags)
> > +{
> > + struct inode *inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
> > + struct file *file = ERR_PTR(-ENOMEM);
> > + struct secretmem_ctx *ctx;
> > +
> > + if (IS_ERR(inode))
> > + return ERR_CAST(inode);
> > +
> > + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> > + if (!ctx)
> > + goto err_free_inode;
> > +
> > + file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
> > + O_RDWR, &secretmem_fops);
> > + if (IS_ERR(file))
> > + goto err_free_ctx;
> > +
> > + mapping_set_unevictable(inode->i_mapping);
> > +
> > + inode->i_mapping->private_data = ctx;
> > + inode->i_mapping->a_ops = &secretmem_aops;
> > +
> > + /* pretend we are a normal file with zero size */
> > + inode->i_mode |= S_IFREG;
> > + inode->i_size = 0;
> > +
> > + file->private_data = ctx;
> > +
> > + return file;
> > +
> > +err_free_ctx:
> > + kfree(ctx);
> > +err_free_inode:
> > + iput(inode);
> > + return file;
> > +}
> > +
> > +static void secretmem_evict_inode(struct inode *inode)
> > +{
> > + struct secretmem_ctx *ctx = inode->i_private;
> > +
> > + truncate_inode_pages_final(&inode->i_data);
> > + clear_inode(inode);
> > + kfree(ctx);
> > +}
> > +
> > +static const struct super_operations secretmem_super_ops = {
> > + .evict_inode = secretmem_evict_inode,
> > +};
> > +
> > +static int secretmem_init_fs_context(struct fs_context *fc)
> > +{
> > + struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC);
> > +
> > + if (!ctx)
> > + return -ENOMEM;
> > + ctx->ops = &secretmem_super_ops;
> > +
> > + return 0;
> > +}
> > +
> > +static struct file_system_type secretmem_fs = {
> > + .name = "secretmem",
> > + .init_fs_context = secretmem_init_fs_context,
> > + .kill_sb = kill_anon_super,
> > +};
> > +
> > +static int secretmem_init(void)
> > +{
> > + int ret = 0;
> > +
> > + secretmem_mnt = kern_mount(&secretmem_fs);
> > + if (IS_ERR(secretmem_mnt))
> > + ret = PTR_ERR(secretmem_mnt);
> > +
> > + return ret;
> > +}
> > +fs_initcall(secretmem_init);
> > --
> > 2.26.2
> >
>
> --
> Kirill A. Shutemov
--
Sincerely yours,
Mike.
On Mon, Jul 13, 2020 at 02:05:05PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jul 06, 2020 at 08:20:50PM +0300, Mike Rapoport wrote:
> > From: Mike Rapoport <[email protected]>
> >
> > Removing a PAGE_SIZE page from the direct map every time such page is
> > allocated for a secret memory mapping will cause severe fragmentation of
> > the direct map. This fragmentation can be reduced by using PMD-size pages
> > as a pool for small pages for secret memory mappings.
> >
> > Add a gen_pool per secretmem inode and lazily populate this pool with
> > PMD-size pages.
> >
> > Signed-off-by: Mike Rapoport <[email protected]>
> > ---
> > mm/secretmem.c | 107 ++++++++++++++++++++++++++++++++++++++++---------
> > 1 file changed, 88 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > index df8f8c958cc2..c6fcf6d76951 100644
> > --- a/mm/secretmem.c
> > +++ b/mm/secretmem.c
> > @@ -5,6 +5,7 @@
> > #include <linux/memfd.h>
> > #include <linux/printk.h>
> > #include <linux/pagemap.h>
> > +#include <linux/genalloc.h>
> > #include <linux/pseudo_fs.h>
> > #include <linux/set_memory.h>
> > #include <linux/sched/signal.h>
> > @@ -23,24 +24,66 @@
> > #define SECRETMEM_UNCACHED 0x2
> >
> > struct secretmem_ctx {
> > + struct gen_pool *pool;
> > unsigned int mode;
> > };
> >
> > -static struct page *secretmem_alloc_page(gfp_t gfp)
> > +static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
> > {
> > - /*
> > - * FIXME: use a cache of large pages to reduce the direct map
> > - * fragmentation
> > - */
> > - return alloc_page(gfp);
> > + unsigned long nr_pages = (1 << HPAGE_PMD_ORDER);
> > + struct gen_pool *pool = ctx->pool;
> > + unsigned long addr;
> > + struct page *page;
> > + int err;
> > +
> > + page = alloc_pages(gfp, HPAGE_PMD_ORDER);
> > + if (!page)
> > + return -ENOMEM;
> > +
> > + addr = (unsigned long)page_address(page);
> > + split_page(page, HPAGE_PMD_ORDER);
> > +
> > + err = gen_pool_add(pool, addr, HPAGE_PMD_SIZE, NUMA_NO_NODE);
> > + if (err) {
> > + __free_pages(page, HPAGE_PMD_ORDER);
> > + return err;
> > + }
> > +
> > + __kernel_map_pages(page, nr_pages, 0);
>
> It's worth nothing that unlike flush_tlb_kernel_range(),
> __kernel_map_pages() only flushed local TLB, so other CPU may still have
> access to the page. It's shouldn't be a blocker, but deserve a comment.
Sure.
> > +
> > + return 0;
> > +}
> > +
--
Sincerely yours,
Mike.
Hi!
> This is a second version of "secret" mappings implementation backed by a
> file descriptor.
>
> The file descriptor is created using memfd_create() syscall with a new
> MFD_SECRET flag. The file descriptor should be configured using ioctl() to
> define the desired protection and then mmap() of the fd will create a
> "secret" memory mapping. The pages in that mapping will be marked as not
> present in the direct map and will have desired protection bits set in the
> user page table. For instance, current implementation allows uncached
> mappings.
>
> Hiding secret memory mappings behind an anonymous file allows (ab)use of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
>
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs.
I believe unix userspace normally requires mappings to be... well... protected from
other users. How is this "secret" thing different? How do you explain the difference
to userland programmers?
Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Fri, 2020-07-17 at 10:36 +0200, Pavel Machek wrote:
> Hi!
>
> > This is a second version of "secret" mappings implementation backed
> > by a file descriptor.
> >
> > The file descriptor is created using memfd_create() syscall with a
> > new MFD_SECRET flag. The file descriptor should be configured using
> > ioctl() to define the desired protection and then mmap() of the fd
> > will create a "secret" memory mapping. The pages in that mapping
> > will be marked as not present in the direct map and will have
> > desired protection bits set in the user page table. For instance,
> > current implementation allows uncached mappings.
> >
> > Hiding secret memory mappings behind an anonymous file allows
> > (ab)use of the page cache for tracking pages allocated for the
> > "secret" mappings as well as using address_space_operations for
> > e.g. page migration callbacks.
> >
> > The anonymous file may be also used implicitly, like hugetlb files,
> > to implement mmap(MAP_SECRET) and use the secret memory areas with
> > "native" mm ABIs.
>
> I believe unix userspace normally requires mappings to be... well...
> protected from other users. How is this "secret" thing different? How
> do you explain the difference to userland programmers?
That's true in the normal case, but for the container cloud the threat
model we're considering is a hostile other tenant trying to trick the
kernel into giving them access to your mappings. In the FOSDEM talk we
did about this:
https://fosdem.org/2020/schedule/event/kernel_address_space_isolation/
We demonstrated the case where the hostile tenant obtained host root
and then tried to get access via ptrace. The point being that pushing
the pages out of the direct map means that even root can't get access
to the secret by any means the OS provides. If you want to play with
this yourself, we have a userspace library:
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git/
It does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any
secret keys get automatically protected this way and the other thing it
does is expose the API to the user who needs it. I anticipate that a
lot of the use cases would be like the openssl one: many toolkits that
deal with secret keys already have special handling for the memory to
try to give them greater protection, so this would simply be pluggable
into the toolkits without any need for user application modification.
James