From: Mike Rapoport <[email protected]>
Hi,
This is an implementation of "secret" mappings backed by a file descriptor.
I've dropped the boot time reservation patch for now as it is not strictly
required for the basic usage and can be easily added later either with or
without CMA.
v6 changes:
* Silence the warning about missing syscall, thanks to Qian Cai
* Replace spaces with tabs in Kconfig additions, per Randy
* Add a selftest.
v5 changes:
* rebase on v5.9-rc5
* drop boot time memory reservation patch
v4 changes:
* rebase on v5.9-rc1
* Do not redefine PMD_PAGE_ORDER in fs/dax.c, thanks Kirill
* Make secret mappings exclusive by default and only require flags to
memfd_secret() system call for uncached mappings, thanks again Kirill :)
v3 changes:
* Squash kernel-parameters.txt update into the commit that added the
command line option.
* Make uncached mode explicitly selectable by architectures. For now enable
it only on x86.
v2 changes:
* Follow Michael's suggestion and name the new system call 'memfd_secret'
* Add kernel-parameters documentation about the boot option
* Fix i386-tinyconfig regression reported by the kbuild bot.
CONFIG_SECRETMEM now depends on !EMBEDDED to disable it on small systems
from one side and still make it available unconditionally on
architectures that support SET_DIRECT_MAP.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will have desired protection bits set in the user page
table. For instance, current implementation allows uncached mappings.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
Additionally, the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
[1] that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
I've hesitated whether to continue to use new flags to memfd_create() or to
add a new system call and I've decided to use a new system call after I've
started to look into man pages update. There would have been two completely
independent descriptions and I think it would have been very confusing.
Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
As the fragmentation of the direct map was one of the major concerns raised
during the previous postings, I've added an amortizing cache of PMD-size
pages to each file descriptor that is used as an allocation pool for the
secret memory areas.
v5: https://lore.kernel.org/lkml/[email protected]
v4: https://lore.kernel.org/lkml/[email protected]
v3: https://lore.kernel.org/lkml/[email protected]
v2: https://lore.kernel.org/lkml/[email protected]
v1: https://lore.kernel.org/lkml/[email protected]
Mike Rapoport (6):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
mm: introduce memfd_secret system call to create "secret" memory areas
arch, mm: wire up memfd_secret system call were relevant
mm: secretmem: use PMD-size pages to amortize direct map fragmentation
secretmem: test: add basic selftest for memfd_secret(2)
arch/Kconfig | 7 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/riscv/include/asm/unistd.h | 1 +
arch/x86/Kconfig | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/dax.c | 11 +-
include/linux/pgtable.h | 3 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
include/uapi/linux/secretmem.h | 8 +
kernel/sys_ni.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/internal.h | 3 +
mm/mmap.c | 5 +-
mm/secretmem.c | 333 ++++++++++++++++++++++
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 +++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 ++
25 files changed, 703 insertions(+), 13 deletions(-)
create mode 100644 include/uapi/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
--
2.28.0
From: Mike Rapoport <[email protected]>
It will be used by the upcoming secret memory implementation.
Signed-off-by: Mike Rapoport <[email protected]>
---
mm/internal.h | 3 +++
mm/mmap.c | 5 ++---
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 10c677655912..40544fbf49c9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -350,6 +350,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
extern void mlock_vma_page(struct page *page);
extern unsigned int munlock_vma_page(struct page *page);
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len);
+
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 40248d84ad5f..190761920142 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1310,9 +1310,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
return hint;
}
-static inline int mlock_future_check(struct mm_struct *mm,
- unsigned long flags,
- unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len)
{
unsigned long locked, lock_limit;
--
2.28.0
From: Mike Rapoport <[email protected]>
Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.
The user will create a file descriptor using the memfd_secret() system call
where flags supplied as a parameter to this system call will define the
desired protection mode for the memory associated with that file
descriptor.
Currently there are two protection modes:
* exclusive - the memory area is unmapped from the kernel direct map and it
is present only in the page tables of the owning mm.
* uncached - the memory area is present only in the page tables of the
owning mm and it is mapped there as uncached.
The "exclusive" mode is enabled implicitly and it is the default mode for
memfd_secret().
The "uncached" mode requires architecture support and an architecture
should opt-in for this mode using HAVE_SECRETMEM_UNCACHED configuration
option.
For instance, the following example will create an uncached mapping (error
handling is omitted):
fd = memfd_secret(SECRETMEM_UNCACHED);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/Kconfig | 7 +
arch/x86/Kconfig | 1 +
include/uapi/linux/magic.h | 1 +
include/uapi/linux/secretmem.h | 8 +
kernel/sys_ni.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/secretmem.c | 264 +++++++++++++++++++++++++++++++++
8 files changed, 288 insertions(+)
create mode 100644 include/uapi/linux/secretmem.h
create mode 100644 mm/secretmem.c
diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..d3f11b2d03e8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -975,6 +975,13 @@ config HAVE_SPARSE_SYSCALL_NR
config ARCH_HAS_VDSO_DATA
bool
+config HAVE_SECRETMEM_UNCACHED
+ bool
+ help
+ An architecture can select this if its semantics of non-cached
+ mappings can be used to prevent speculative loads and it is
+ useful for secret protection.
+
source "kernel/gcov/Kconfig"
source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..38ead8bd9909 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -220,6 +220,7 @@ config X86
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
+ select HAVE_SECRETMEM_UNCACHED
select HOTPLUG_SMT if SMP
select IRQ_FORCED_THREADING
select NEED_SG_DMA_LENGTH
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define Z3FOLD_MAGIC 0x33
#define PPC_CMM_MAGIC 0xc7571590
+#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/secretmem.h b/include/uapi/linux/secretmem.h
new file mode 100644
index 000000000000..2b9675f5dea9
--- /dev/null
+++ b/include/uapi/linux/secretmem.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SECRERTMEM_H
+#define _UAPI_LINUX_SECRERTMEM_H
+
+/* secretmem operation modes */
+#define SECRETMEM_UNCACHED 0x1
+
+#endif /* _UAPI_LINUX_SECRERTMEM_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d59775ea79c..8ae8d0c2d381 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -349,6 +349,8 @@ COND_SYSCALL(pkey_mprotect);
COND_SYSCALL(pkey_alloc);
COND_SYSCALL(pkey_free);
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
/*
* Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f..d2fc73ccc183 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -868,4 +868,8 @@ config ARCH_HAS_HUGEPD
config MAPPING_DIRTY_HELPERS
bool
+config SECRETMEM
+ def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+ select GENERIC_ALLOCATOR
+
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..cae063dc8298 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..3293f761076e
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/secretmem.h>
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Secret memory areas are always exclusive to owning mm and they are
+ * removed from the direct map.
+ */
+#ifdef CONFIG_HAVE_SECRETMEM_UNCACHED
+#define SECRETMEM_MODE_MASK (SECRETMEM_UNCACHED)
+#else
+#define SECRETMEM_MODE_MASK (0x0)
+#endif
+
+#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK
+
+struct secretmem_ctx {
+ unsigned int mode;
+};
+
+static struct page *secretmem_alloc_page(gfp_t gfp)
+{
+ /*
+ * FIXME: use a cache of large pages to reduce the direct map
+ * fragmentation
+ */
+ return alloc_page(gfp);
+}
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ pgoff_t offset = vmf->pgoff;
+ unsigned long addr;
+ struct page *page;
+ int ret = 0;
+
+ if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+ return vmf_error(-EINVAL);
+
+ page = find_get_entry(mapping, offset);
+ if (!page) {
+ page = secretmem_alloc_page(vmf->gfp_mask);
+ if (!page)
+ return vmf_error(-ENOMEM);
+
+ ret = add_to_page_cache(page, mapping, offset, vmf->gfp_mask);
+ if (unlikely(ret))
+ goto err_put_page;
+
+ ret = set_direct_map_invalid_noflush(page);
+ if (ret)
+ goto err_del_page_cache;
+
+ addr = (unsigned long)page_address(page);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+ __SetPageUptodate(page);
+
+ ret = VM_FAULT_LOCKED;
+ }
+
+ vmf->page = page;
+ return ret;
+
+err_del_page_cache:
+ delete_from_page_cache(page);
+err_put_page:
+ put_page(page);
+ return vmf_error(ret);
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+ .fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct secretmem_ctx *ctx = file->private_data;
+ unsigned long len = vma->vm_end - vma->vm_start;
+
+ if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+ return -EINVAL;
+
+ if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+ return -EAGAIN;
+
+ if (ctx->mode & SECRETMEM_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+ vma->vm_ops = &secretmem_vm_ops;
+ vma->vm_flags |= VM_LOCKED;
+
+ return 0;
+}
+
+const struct file_operations secretmem_fops = {
+ .mmap = secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+ return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+ set_direct_map_default_noflush(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+ .freepage = secretmem_freepage,
+ .migratepage = secretmem_migratepage,
+ .isolate_page = secretmem_isolate_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+ struct file *file = ERR_PTR(-ENOMEM);
+ struct secretmem_ctx *ctx;
+ struct inode *inode;
+
+ inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ goto err_free_inode;
+
+ file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+ O_RDWR, &secretmem_fops);
+ if (IS_ERR(file))
+ goto err_free_ctx;
+
+ mapping_set_unevictable(inode->i_mapping);
+
+ inode->i_mapping->private_data = ctx;
+ inode->i_mapping->a_ops = &secretmem_aops;
+
+ /* pretend we are a normal file with zero size */
+ inode->i_mode |= S_IFREG;
+ inode->i_size = 0;
+
+ file->private_data = ctx;
+
+ ctx->mode = flags & SECRETMEM_MODE_MASK;
+
+ return file;
+
+err_free_ctx:
+ kfree(ctx);
+err_free_inode:
+ iput(inode);
+ return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+ struct file *file;
+ int fd, err;
+
+ /* make sure local flags do not confict with global fcntl.h */
+ BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+ if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+ return -EINVAL;
+
+ fd = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (fd < 0)
+ return fd;
+
+ file = secretmem_file_create(flags);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ goto err_put_fd;
+ }
+
+ file->f_flags |= O_LARGEFILE;
+
+ fd_install(fd, file);
+ return fd;
+
+err_put_fd:
+ put_unused_fd(fd);
+ return err;
+}
+
+static void secretmem_evict_inode(struct inode *inode)
+{
+ struct secretmem_ctx *ctx = inode->i_private;
+
+ truncate_inode_pages_final(&inode->i_data);
+ clear_inode(inode);
+ kfree(ctx);
+}
+
+static const struct super_operations secretmem_super_ops = {
+ .evict_inode = secretmem_evict_inode,
+};
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+ struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC);
+
+ if (!ctx)
+ return -ENOMEM;
+ ctx->ops = &secretmem_super_ops;
+
+ return 0;
+}
+
+static struct file_system_type secretmem_fs = {
+ .name = "secretmem",
+ .init_fs_context = secretmem_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+ int ret = 0;
+
+ secretmem_mnt = kern_mount(&secretmem_fs);
+ if (IS_ERR(secretmem_mnt))
+ ret = PTR_ERR(secretmem_mnt);
+
+ return ret;
+}
+fs_initcall(secretmem_init);
--
2.28.0
From: Mike Rapoport <[email protected]>
Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
---
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/riscv/include/asm/unistd.h | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 ++++++-
scripts/checksyscalls.sh | 4 ++++
9 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3b859596840d..b3b2019f8d16 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 440
+#define __NR_compat_syscalls 441
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 734860ac7cf9..ce0838fc7a5c 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -887,6 +887,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_memfd_secret 440
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
#define __ARCH_WANT_SET_GET_RLIMIT
#define __ARCH_WANT_TIME32_SYSCALLS
#define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
#include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
*/
#define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
#include <uapi/asm/unistd.h>
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 9d1102873666..e7a58a360732 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
+440 i386 memfd_secret sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f30d6ae9a688..635d7aa2bb9a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common memfd_secret sys_memfd_secret
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 75ac7f8ae93c..78afb99c6892 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1006,6 +1006,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 995b36c2ea7d..d063e37dbb4a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -860,8 +860,13 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 440
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
+
#undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
/*
* 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
#define __IGNORE_setrlimit /* setrlimit */
#endif
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
/* Missing flags argument */
#define __IGNORE_renameat /* renameat2 */
--
2.28.0
From: Mike Rapoport <[email protected]>
Removing a PAGE_SIZE page from the direct map every time such page is
allocated for a secret memory mapping will cause severe fragmentation of
the direct map. This fragmentation can be reduced by using PMD-size pages
as a pool for small pages for secret memory mappings.
Add a gen_pool per secretmem inode and lazily populate this pool with
PMD-size pages.
Signed-off-by: Mike Rapoport <[email protected]>
---
mm/secretmem.c | 107 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 88 insertions(+), 19 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 3293f761076e..333eb18fb483 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -12,6 +12,7 @@
#include <linux/bitops.h>
#include <linux/printk.h>
#include <linux/pagemap.h>
+#include <linux/genalloc.h>
#include <linux/syscalls.h>
#include <linux/pseudo_fs.h>
#include <linux/set_memory.h>
@@ -40,24 +41,66 @@
#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK
struct secretmem_ctx {
+ struct gen_pool *pool;
unsigned int mode;
};
-static struct page *secretmem_alloc_page(gfp_t gfp)
+static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
{
- /*
- * FIXME: use a cache of large pages to reduce the direct map
- * fragmentation
- */
- return alloc_page(gfp);
+ unsigned long nr_pages = (1 << PMD_PAGE_ORDER);
+ struct gen_pool *pool = ctx->pool;
+ unsigned long addr;
+ struct page *page;
+ int err;
+
+ page = alloc_pages(gfp, PMD_PAGE_ORDER);
+ if (!page)
+ return -ENOMEM;
+
+ addr = (unsigned long)page_address(page);
+ split_page(page, PMD_PAGE_ORDER);
+
+ err = gen_pool_add(pool, addr, PMD_SIZE, NUMA_NO_NODE);
+ if (err) {
+ __free_pages(page, PMD_PAGE_ORDER);
+ return err;
+ }
+
+ __kernel_map_pages(page, nr_pages, 0);
+
+ return 0;
+}
+
+static struct page *secretmem_alloc_page(struct secretmem_ctx *ctx,
+ gfp_t gfp)
+{
+ struct gen_pool *pool = ctx->pool;
+ unsigned long addr;
+ struct page *page;
+ int err;
+
+ if (gen_pool_avail(pool) < PAGE_SIZE) {
+ err = secretmem_pool_increase(ctx, gfp);
+ if (err)
+ return NULL;
+ }
+
+ addr = gen_pool_alloc(pool, PAGE_SIZE);
+ if (!addr)
+ return NULL;
+
+ page = virt_to_page(addr);
+ get_page(page);
+
+ return page;
}
static vm_fault_t secretmem_fault(struct vm_fault *vmf)
{
+ struct secretmem_ctx *ctx = vmf->vma->vm_file->private_data;
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
struct inode *inode = file_inode(vmf->vma->vm_file);
pgoff_t offset = vmf->pgoff;
- unsigned long addr;
struct page *page;
int ret = 0;
@@ -66,7 +109,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
page = find_get_entry(mapping, offset);
if (!page) {
- page = secretmem_alloc_page(vmf->gfp_mask);
+ page = secretmem_alloc_page(ctx, vmf->gfp_mask);
if (!page)
return vmf_error(-ENOMEM);
@@ -74,14 +117,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
if (unlikely(ret))
goto err_put_page;
- ret = set_direct_map_invalid_noflush(page);
- if (ret)
- goto err_del_page_cache;
-
- addr = (unsigned long)page_address(page);
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
__SetPageUptodate(page);
+ set_page_private(page, (unsigned long)ctx);
ret = VM_FAULT_LOCKED;
}
@@ -89,8 +126,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
vmf->page = page;
return ret;
-err_del_page_cache:
- delete_from_page_cache(page);
err_put_page:
put_page(page);
return vmf_error(ret);
@@ -138,7 +173,11 @@ static int secretmem_migratepage(struct address_space *mapping,
static void secretmem_freepage(struct page *page)
{
- set_direct_map_default_noflush(page);
+ unsigned long addr = (unsigned long)page_address(page);
+ struct secretmem_ctx *ctx = (struct secretmem_ctx *)page_private(page);
+ struct gen_pool *pool = ctx->pool;
+
+ gen_pool_free(pool, addr, PAGE_SIZE);
}
static const struct address_space_operations secretmem_aops = {
@@ -163,13 +202,18 @@ static struct file *secretmem_file_create(unsigned long flags)
if (!ctx)
goto err_free_inode;
+ ctx->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
+ if (!ctx->pool)
+ goto err_free_ctx;
+
file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
O_RDWR, &secretmem_fops);
if (IS_ERR(file))
- goto err_free_ctx;
+ goto err_free_pool;
mapping_set_unevictable(inode->i_mapping);
+ inode->i_private = ctx;
inode->i_mapping->private_data = ctx;
inode->i_mapping->a_ops = &secretmem_aops;
@@ -183,6 +227,8 @@ static struct file *secretmem_file_create(unsigned long flags)
return file;
+err_free_pool:
+ gen_pool_destroy(ctx->pool);
err_free_ctx:
kfree(ctx);
err_free_inode:
@@ -221,11 +267,34 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
return err;
}
+static void secretmem_cleanup_chunk(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk, void *data)
+{
+ unsigned long start = chunk->start_addr;
+ unsigned long end = chunk->end_addr;
+ unsigned long nr_pages, addr;
+
+ nr_pages = (end - start + 1) / PAGE_SIZE;
+ __kernel_map_pages(virt_to_page(start), nr_pages, 1);
+
+ for (addr = start; addr < end; addr += PAGE_SIZE)
+ put_page(virt_to_page(addr));
+}
+
+static void secretmem_cleanup_pool(struct secretmem_ctx *ctx)
+{
+ struct gen_pool *pool = ctx->pool;
+
+ gen_pool_for_each_chunk(pool, secretmem_cleanup_chunk, ctx);
+ gen_pool_destroy(pool);
+}
+
static void secretmem_evict_inode(struct inode *inode)
{
struct secretmem_ctx *ctx = inode->i_private;
truncate_inode_pages_final(&inode->i_data);
+ secretmem_cleanup_pool(ctx);
clear_inode(inode);
kfree(ctx);
}
--
2.28.0
From: Mike Rapoport <[email protected]>
The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.
Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/dax.c | 11 ++++-------
include/linux/pgtable.h | 3 +++
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 994ab66a9907..c0b9aa4bda9e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -49,9 +49,6 @@ static inline unsigned int pe_order(enum page_entry_size pe_size)
#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT)
-/* The order of a PMD entry */
-#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
-
static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
static int __init init_dax_wait_table(void)
@@ -98,7 +95,7 @@ static bool dax_is_locked(void *entry)
static unsigned int dax_entry_order(void *entry)
{
if (xa_to_value(entry) & DAX_PMD)
- return PMD_ORDER;
+ return PMD_PAGE_ORDER;
return 0;
}
@@ -1455,7 +1452,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
{
struct vm_area_struct *vma = vmf->vma;
struct address_space *mapping = vma->vm_file->f_mapping;
- XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
+ XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER);
unsigned long pmd_addr = vmf->address & PMD_MASK;
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool sync;
@@ -1514,7 +1511,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* entry is already in the array, for instance), it will return
* VM_FAULT_FALLBACK.
*/
- entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+ entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER);
if (xa_is_internal(entry)) {
result = xa_to_internal(entry);
goto fallback;
@@ -1680,7 +1677,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
if (order == 0)
ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
#ifdef CONFIG_FS_DAX_PMD
- else if (order == PMD_ORDER)
+ else if (order == PMD_PAGE_ORDER)
ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
#endif
else
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..b0389078df39 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -28,6 +28,9 @@
#define USER_PGTABLES_CEILING 0UL
#endif
+/* Number of base pages in a second level leaf page */
+#define PMD_PAGE_ORDER (PMD_SHIFT - PAGE_SHIFT)
+
/*
* A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
*
--
2.28.0
From: Mike Rapoport <[email protected]>
The test verifies that file descriptor created with memfd_secret does
not allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.
Signed-off-by: Mike Rapoport <[email protected]>
---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 301 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 ++
4 files changed, 321 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 849e8226395a..8a951fed3c3f 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -20,3 +20,4 @@ va_128TBswitch
map_fixed_noreplace
write_to_hugetlbfs
hmm-tests
+memfd_secret
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index a9026706d597..937afee6a8af 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -21,6 +21,7 @@ TEST_GEN_FILES += thuge-gen
TEST_GEN_FILES += transhuge-stress
TEST_GEN_FILES += userfaultfd
TEST_GEN_FILES += khugepaged
+TEST_GEN_FILES += memfd_secret
ifeq ($(ARCH),x86_64)
CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32)
@@ -112,4 +113,4 @@ endif
$(OUTPUT)/userfaultfd: LDLIBS += -lpthread
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..81e1a8689241
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2020, Mike Rapoport, IBM Corporation.
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#include <linux/secretmem.h>
+
+#define PATTERN 0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+ return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+ char buf[64];
+
+ if ((read(fd, buf, sizeof(buf)) >= 0) ||
+ (write(fd, buf, sizeof(buf)) >= 0) ||
+ (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+ (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+ fail("unexpected file IO\n");
+ else
+ pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+ size_t len;
+ char *mem;
+
+ len = mlock_limit_cur;
+ mem = mmap(NULL, len, prot, mode, fd, 0);
+ if (mem == MAP_FAILED) {
+ fail("unable to mmap secret memory\n");
+ return;
+ }
+ munmap(mem, len);
+
+ len = mlock_limit_max * 2;
+ mem = mmap(NULL, len, prot, mode, fd, 0);
+ if (mem != MAP_FAILED) {
+ fail("unexpected mlock limit violation\n");
+ munmap(mem, len);
+ return;
+ }
+
+ pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+ struct iovec liov, riov;
+ char buf[64];
+ char *mem;
+
+ if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+ fail("pipe write: %s\n", strerror(errno));
+ exit(KSFT_FAIL);
+ }
+
+ liov.iov_len = riov.iov_len = sizeof(buf);
+ liov.iov_base = buf;
+ riov.iov_base = mem;
+
+ if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+ if (errno == ENOSYS)
+ exit(KSFT_SKIP);
+ exit(KSFT_PASS);
+ }
+
+ exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+ pid_t ppid = getppid();
+ int status;
+ char *mem;
+ long ret;
+
+ if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+ perror("pipe write");
+ exit(KSFT_FAIL);
+ }
+
+ ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+ if (ret) {
+ perror("ptrace_attach");
+ exit(KSFT_FAIL);
+ }
+
+ ret = waitpid(ppid, &status, WUNTRACED);
+ if ((ret != ppid) || !(WIFSTOPPED(status))) {
+ fprintf(stderr, "weird waitppid result %ld stat %x\n",
+ ret, status);
+ exit(KSFT_FAIL);
+ }
+
+ /* this access should fail and the task should be killed */
+ ret = ptrace(PTRACE_PEEKDATA, ppid, mem, 0);
+ if (ret < 0) {
+ perror("ptrace_peek");
+ exit(KSFT_FAIL);
+ }
+
+ /* we shouldn't survive PTRACE_PEEKDATA */
+ exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+ int status;
+
+ waitpid(pid, &status, 0);
+
+ if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+ skip("%s is not supported\n", name);
+ return;
+ }
+
+ if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+ WIFSIGNALED(status)) {
+ pass("%s failed as expected\n", name);
+ return;
+ }
+
+ fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+ void (*func)(int fd, int pipefd[2]))
+{
+ int pipefd[2];
+ pid_t pid;
+ char *mem;
+
+ if (pipe(pipefd)) {
+ fail("pipe failed: %s\n", strerror(errno));
+ return;
+ }
+
+ pid = fork();
+ if (pid < 0) {
+ fail("fork failed: %s\n", strerror(errno));
+ return;
+ }
+
+ if (pid == 0) {
+ func(fd, pipefd);
+ return;
+ }
+
+ mem = mmap(NULL, page_size, prot, mode, fd, 0);
+ if (mem == MAP_FAILED) {
+ fail("Unable to mmap secret memory\n");
+ return;
+ }
+
+ ftruncate(fd, page_size);
+ memset(mem, PATTERN, page_size);
+
+ if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+ fail("pipe write: %s\n", strerror(errno));
+ return;
+ }
+
+ check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+ test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+ test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+ struct rlimit new;
+ cap_t cap = cap_init();
+
+ new.rlim_cur = max;
+ new.rlim_max = max;
+ if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+ perror("setrlimit() returns error");
+ return -1;
+ }
+
+ /* drop capabilities including CAP_IPC_LOCK */
+ if (cap_set_proc(cap)) {
+ perror("cap_set_proc() returns error");
+ return -2;
+ }
+
+ return 0;
+}
+
+static void prepare(void)
+{
+ struct rlimit rlim;
+
+ page_size = sysconf(_SC_PAGE_SIZE);
+ if (!page_size)
+ ksft_exit_fail_msg("Failed to get page size %s\n",
+ strerror(errno));
+
+ if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+ ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+ strerror(errno));
+
+ mlock_limit_cur = rlim.rlim_cur;
+ mlock_limit_max = rlim.rlim_max;
+
+ printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+ page_size, mlock_limit_cur, mlock_limit_max);
+
+ if (page_size > mlock_limit_cur)
+ mlock_limit_cur = page_size;
+ if (page_size > mlock_limit_max)
+ mlock_limit_max = page_size;
+
+ if (set_cap_limits(mlock_limit_max))
+ ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+ strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+ int fd;
+
+ prepare();
+
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ fd = memfd_secret(0);
+ if (fd < 0) {
+ if (errno == ENOSYS)
+ ksft_exit_skip("memfd_secret is not supported\n");
+ else
+ ksft_exit_fail_msg("memfd_secret failed: %s\n",
+ strerror(errno));
+ }
+
+ test_mlock_limit(fd);
+ test_file_apis(fd);
+ test_process_vm_read(fd);
+ test_ptrace(fd);
+
+ close(fd);
+
+ ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+ printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+ return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index a3f4f30f0a2e..bee7365f3cc2 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -323,4 +323,21 @@ else
exitcode=1
fi
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+ echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+ echo "[SKIP]"
+ exitcode=$ksft_skip
+else
+ echo "[FAIL]"
+ exitcode=1
+fi
+
+exit $exitcode
+
exit $exitcode
--
2.28.0
From: Mike Rapoport <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
man2/memfd_secret.2 | 176 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 176 insertions(+)
create mode 100644 man2/memfd_secret.2
diff --git a/man2/memfd_secret.2 b/man2/memfd_secret.2
new file mode 100644
index 000000000..e4ecd3662
--- /dev/null
+++ b/man2/memfd_secret.2
@@ -0,0 +1,176 @@
+.\" Copyright (c) 2020, IBM Corporation.
+.\" Written by Mike Rapoport <[email protected]>
+.\"
+.\" Based on memfd_create(2) man page
+.\" Copyright (C) 2014 Michael Kerrisk <[email protected]>
+.\" and Copyright (C) 2014 David Herrmann <[email protected]>
+.\"
+.\" %%%LICENSE_START(GPLv2+)
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_SECRET 2 2020-08-02 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_secret \- create an anonymous file to map secret memory regions
+.SH SYNOPSIS
+.nf
+.B #include <linux/secretmem.h>
+.PP
+.BI "int memfd_secret(unsigned long " flags ");"
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+.BR memfd_secret ()
+creates an anonymous file and returns a file descriptor that refers to it.
+The file can only be memory-mapped;
+the memory in such mapping
+will have stronger protection than usual memory mapped files,
+and so it can be used to store application secrets.
+Unlike a regular file, a file created with
+.BR memfd_secret ()
+lives in RAM and has a volatile backing storage.
+Once all references to the file are dropped, it is automatically released.
+The initial size of the file is set to 0.
+Following the call, the file size should be set using
+.BR ftruncate (2).
+.PP
+The memory areas obtained with
+.BR mmap (2)
+from the file descriptor are exclusive to the owning context.
+These areas are removed from the kernel page tables
+and only the page table of the process holding the file descriptor
+maps the corresponding physical memory.
+.PP
+The following values may be bitwise ORed in
+.IR flags
+to control the behavior of
+.BR memfd_secret (2):
+.TP
+.BR FD_CLOEXEC
+Set the close-on-exec flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+.TP
+.BR SECRETMEM_UNCACHED
+In addition to excluding memory areas from the kernel page tables,
+mark the memory mappings uncached in the page table of the owning process.
+Such mappings can be used to prevent speculative loads
+and cache-based side channels.
+This mode of
+.BR memfd_secret ()
+is not supported on all architectures.
+.PP
+See also NOTES below.
+.PP
+As its return value,
+.BR memfd_secret ()
+returns a new file descriptor that can be used to refer to an anonymous file.
+This file descriptor is opened for both reading and writing
+.RB ( O_RDWR )
+and
+.B O_LARGEFILE
+is set for the file descriptor.
+.PP
+With respect to
+.BR fork (2)
+and
+.BR execve (2),
+the usual semantics apply for the file descriptor created by
+.BR memfd_secret ().
+A copy of the file descriptor is inherited by the child produced by
+.BR fork (2)
+and refers to the same file.
+The file descriptor is preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.PP
+The memory regions backed with
+.BR memfd_secret ()
+are locked in the same way as
+.BR mlock (2),
+however the implementation will not try to
+populate the whole range during the
+.BR mmap ()
+call.
+The amount of memory allowed for memory mappings
+of the file descriptor obeys the same rules as
+.BR mlock (2)
+and cannot exceed
+.B RLIMIT_MEMLOCK.
+.SH RETURN VALUE
+On success,
+.BR memfd_secret ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B ENOSYS
+.BR memfd_secret ()
+is not implemented on this architecture.
+.TP
+.B EINVAL
+.I flags
+included unknown bits.
+.TP
+.B EMFILE
+The per-process limit on the number of open file descriptors has been reached.
+.TP
+.B EMFILE
+The system-wide limit on the total number of open files has been reached.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.SH VERSIONS
+The
+.BR memfd_secret (2)
+system call first appeared in Linux 5.X;
+.SH CONFORMING TO
+The
+.BR memfd_secret (2)
+system call is Linux-specific.
+.SH NOTES
+The
+.BR memfd_secret (2)
+system call provides an ability to hide information
+from the operating system.
+Normally Linux userspace mappings are protected from other users,
+but they are visible to the privileged code.
+The mappings created using
+.BR memfd_secret ()
+are hidden from the kernel as well.
+.PP
+If an architecture supports
+.B SECRETMEM_UNCACHED,
+the mappings also have protection from speculative execution vulnerabilties,
+at the expense of increased memory access latency.
+Care should be taken when using
+.B
+SECRETMEM_UNCACHED
+to avoid degrading application performance.
+.SH SEE ALSO
+.BR fcntl (2),
+.BR ftruncate (2),
+.BR mlock (2),
+.BR mmap (2),
+.BR setrlimit (2),
--
2.25.4
* Mike Rapoport:
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
You added a reference to NOTES, but then in notes there is nothing about
it. I guess you wanted to add the following to NOTES (taken from
membarrier.2):
.PP
Glibc does not provide a wrapper for this system call; call it using
.BR syscall (2).
Cheers,
Alex
On Thu, 24 Sep 2020 16:28:58 +0300 Mike Rapoport <[email protected]> wrote:
> From: Mike Rapoport <[email protected]>
>
> Hi,
>
> This is an implementation of "secret" mappings backed by a file descriptor.
> I've dropped the boot time reservation patch for now as it is not strictly
> required for the basic usage and can be easily added later either with or
> without CMA.
>
> ...
>
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will have desired protection bits set in the user page
> table. For instance, current implementation allows uncached mappings.
>
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
>
> Additionally, the secret mappings may be used as a mean to protect guest
> memory in a virtual machine host.
>
> For demonstration of secret memory usage we've created a userspace library
> [1] that does two things: the first is act as a preloader for openssl to
I can find no [1].
I'm not a fan of the enumerated footnote thing. Why not inline the url
right here so readers don't need to jump around?
On Thu, Sep 24, 2020 at 07:34:28PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2020 16:28:58 +0300 Mike Rapoport <[email protected]> wrote:
>
> > From: Mike Rapoport <[email protected]>
> >
> > Hi,
> >
> > This is an implementation of "secret" mappings backed by a file descriptor.
> > I've dropped the boot time reservation patch for now as it is not strictly
> > required for the basic usage and can be easily added later either with or
> > without CMA.
> >
> > ...
> >
> > The file descriptor backing secret memory mappings is created using a
> > dedicated memfd_secret system call The desired protection mode for the
> > memory is configured using flags parameter of the system call. The mmap()
> > of the file descriptor created with memfd_secret() will create a "secret"
> > memory mapping. The pages in that mapping will be marked as not present in
> > the direct map and will have desired protection bits set in the user page
> > table. For instance, current implementation allows uncached mappings.
> >
> > Although normally Linux userspace mappings are protected from other users,
> > such secret mappings are useful for environments where a hostile tenant is
> > trying to trick the kernel into giving them access to other tenants
> > mappings.
> >
> > Additionally, the secret mappings may be used as a mean to protect guest
> > memory in a virtual machine host.
> >
> > For demonstration of secret memory usage we've created a userspace library
> > [1] that does two things: the first is act as a preloader for openssl to
>
> I can find no [1].
Oops, sorry. It's
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git/
> I'm not a fan of the enumerated footnote thing. Why not inline the url
> right here so readers don't need to jump around?
>
>
--
Sincerely yours,
Mike.
On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Removing a PAGE_SIZE page from the direct map every time such page is
> allocated for a secret memory mapping will cause severe fragmentation of
> the direct map. This fragmentation can be reduced by using PMD-size pages
> as a pool for small pages for secret memory mappings.
>
> Add a gen_pool per secretmem inode and lazily populate this pool with
> PMD-size pages.
What's the actual efficacy of this? Since the pmd is per inode, all I
need is a lot of inodes and we're in business to destroy the directmap,
no?
Afaict there's no privs needed to use this, all a process needs is to
stay below the mlock limit, so a 'fork-bomb' that maps a single secret
page will utterly destroy the direct map.
I really don't like this, at all.
IIRC Kirill looked at merging the directmap. I think he ran into
performance issues there, but we really need something like that before
something like this lands.
On 25.09.20 09:41, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
>> From: Mike Rapoport <[email protected]>
>>
>> Removing a PAGE_SIZE page from the direct map every time such page is
>> allocated for a secret memory mapping will cause severe fragmentation of
>> the direct map. This fragmentation can be reduced by using PMD-size pages
>> as a pool for small pages for secret memory mappings.
>>
>> Add a gen_pool per secretmem inode and lazily populate this pool with
>> PMD-size pages.
>
> What's the actual efficacy of this? Since the pmd is per inode, all I
> need is a lot of inodes and we're in business to destroy the directmap,
> no?
>
> Afaict there's no privs needed to use this, all a process needs is to
> stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> page will utterly destroy the direct map.
>
> I really don't like this, at all.
As I expressed earlier, I would prefer allowing allocation of secretmem
only from a previously defined CMA area. This would physically locally
limit the pain.
But my suggestion was not well received :)
--
Thanks,
David / dhildenb
On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> On 25.09.20 09:41, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> >> From: Mike Rapoport <[email protected]>
> >>
> >> Removing a PAGE_SIZE page from the direct map every time such page is
> >> allocated for a secret memory mapping will cause severe fragmentation of
> >> the direct map. This fragmentation can be reduced by using PMD-size pages
> >> as a pool for small pages for secret memory mappings.
> >>
> >> Add a gen_pool per secretmem inode and lazily populate this pool with
> >> PMD-size pages.
> >
> > What's the actual efficacy of this? Since the pmd is per inode, all I
> > need is a lot of inodes and we're in business to destroy the directmap,
> > no?
> >
> > Afaict there's no privs needed to use this, all a process needs is to
> > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > page will utterly destroy the direct map.
> >
> > I really don't like this, at all.
>
> As I expressed earlier, I would prefer allowing allocation of secretmem
> only from a previously defined CMA area. This would physically locally
> limit the pain.
Given that this thing doesn't have a migrate hook, that seems like an
eminently reasonable contraint. Because not only will it mess up the
directmap, it will also destroy the ability of the page-allocator /
compaction to re-form high order blocks by sprinkling holes throughout.
Also, this is all very close to XPFO, yet I don't see that mentioned
anywhere.
Further still, it has this HAVE_SECRETMEM_UNCACHED nonsense which is
completely unused. I'm not at all sure exposing UNCACHED to random
userspace is a sane idea.
Hi,
Sorry to come to this so late; I've been meaning to provide feedback on
this for a while but have been indisposed for a bit due to an injury.
On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > >> From: Mike Rapoport <[email protected]>
> > >>
> > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > >> allocated for a secret memory mapping will cause severe fragmentation of
> > >> the direct map. This fragmentation can be reduced by using PMD-size pages
> > >> as a pool for small pages for secret memory mappings.
> > >>
> > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > >> PMD-size pages.
> > >
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > >
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> > >
> > > I really don't like this, at all.
> >
> > As I expressed earlier, I would prefer allowing allocation of secretmem
> > only from a previously defined CMA area. This would physically locally
> > limit the pain.
>
> Given that this thing doesn't have a migrate hook, that seems like an
> eminently reasonable contraint. Because not only will it mess up the
> directmap, it will also destroy the ability of the page-allocator /
> compaction to re-form high order blocks by sprinkling holes throughout.
>
> Also, this is all very close to XPFO, yet I don't see that mentioned
> anywhere.
Agreed. I think if we really need something like this, something between
XPFO and DEBUG_PAGEALLOC would be generally better, since:
* Secretmem puts userspace in charge of kernel internals (AFAICT without
any ulimits?), so that seems like an avenue for malicious or buggy
userspace to exploit and trigger DoS, etc. The other approaches leave
the kernel in charge at all times, and it's a system-level choice
which is easier to reason about and test.
* Secretmem interaction with existing ABIs is unclear. Should uaccess
primitives work for secretmem? If so, this means that it's not valid
to transform direct uaccesses in syscalls etc into accesses via the
linear/direct map. If not, how do we prevent syscalls? The other
approaches are clear that this should always work, but the kernel
should avoid mappings wherever possible.
* The uncached option doesn't work in a number of situations, such as
systems which are purely cache coherent at all times, or where the
hypervisor has overridden attributes. The kernel cannot even know that
whther this works as intended. On its own this doens't solve a
particular problem, and I think this is a solution looking for a
problem.
... and fundamentally, this seems like a "more security, please" option
that is going to be abused, since everyone wants security, regardless of
how we say it *should* be used. The few use-cases that may make sense
(e.g. protection of ketys and/or crypto secrrets), aren't going to be
able to rely on this (since e.g. other uses may depelete memory pools),
so this is going to be best-effort. With all that in mind, I struggle to
beleive that this is going to be worth the maintenance cost (e.g. with
any issues arising from uaccess, IO, etc).
Overall, I would prefer to not see this syscall in the kernel.
> Further still, it has this HAVE_SECRETMEM_UNCACHED nonsense which is
> completely unused. I'm not at all sure exposing UNCACHED to random
> userspace is a sane idea.
I agree the uncached stuff should be removed. It is at best misleading
since the kernel can't guarantee it does what it says, I think it's
liable to lead to issues in future (e.g. since it can cause memory
operations to raise different exceptions relative to what they can
today), and as above it seems like a solution looking for a problem.
Thanks,
Mark.
On Fri, Sep 25, 2020 at 11:31:14AM +0100, Mark Rutland wrote:
> Hi,
>
> Sorry to come to this so late; I've been meaning to provide feedback on
> this for a while but have been indisposed for a bit due to an injury.
>
> On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > >> From: Mike Rapoport <[email protected]>
> > > >>
> > > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > > >> allocated for a secret memory mapping will cause severe fragmentation of
> > > >> the direct map. This fragmentation can be reduced by using PMD-size pages
> > > >> as a pool for small pages for secret memory mappings.
> > > >>
> > > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > > >> PMD-size pages.
> > > >
> > > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > > need is a lot of inodes and we're in business to destroy the directmap,
> > > > no?
> > > >
> > > > Afaict there's no privs needed to use this, all a process needs is to
> > > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > > page will utterly destroy the direct map.
> > > >
> > > > I really don't like this, at all.
> > >
> > > As I expressed earlier, I would prefer allowing allocation of secretmem
> > > only from a previously defined CMA area. This would physically locally
> > > limit the pain.
> >
> > Given that this thing doesn't have a migrate hook, that seems like an
> > eminently reasonable contraint. Because not only will it mess up the
> > directmap, it will also destroy the ability of the page-allocator /
> > compaction to re-form high order blocks by sprinkling holes throughout.
> >
> > Also, this is all very close to XPFO, yet I don't see that mentioned
> > anywhere.
>
> Agreed. I think if we really need something like this, something between
> XPFO and DEBUG_PAGEALLOC would be generally better, since:
Perhaps we can brainstorm on this? XPFO has mostly been abandoned
because there's no good/safe way to make it faster. There was work on
eliminating TLB flushes, but that waters down the protection. When I
was last thinking about it in anger, it just seemed like it was
destined to be slow, especially on $large_num_cores machines, since
you have to flush everyone else's map too.
I think the idea of "opt in to XPFO" is mostly attractive because then
people only have to pay the slowness cost for memory they really care
about. But if there's some way to make XPFO, or some alternative
design, that may be better.
Tycho
On Thu, 2020-09-24 at 16:29 +0300, Mike Rapoport wrote:
> Introduce "memfd_secret" system call with the ability to create
> memory
> areas visible only in the context of the owning process and not
> mapped not
> only to other processes but in the kernel page tables as well.
>
> The user will create a file descriptor using the memfd_secret()
> system call
> where flags supplied as a parameter to this system call will define
> the
> desired protection mode for the memory associated with that file
> descriptor.
>
> Currently there are two protection modes:
>
> * exclusive - the memory area is unmapped from the kernel direct map
> and it
> is present only in the page tables of the owning mm.
Seems like there were some concerns raised around direct map
efficiency, but in case you are going to rework this...how does this
memory work for the existing kernel functionality that does things like
this?
get_user_pages(, &page);
ptr = kmap(page);
foo = *ptr;
Not sure if I'm missing something, but I think apps could cause the
kernel to access a not-present page and oops.
On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > From: Mike Rapoport <[email protected]>
> >
> > Removing a PAGE_SIZE page from the direct map every time such page is
> > allocated for a secret memory mapping will cause severe fragmentation of
> > the direct map. This fragmentation can be reduced by using PMD-size pages
> > as a pool for small pages for secret memory mappings.
> >
> > Add a gen_pool per secretmem inode and lazily populate this pool with
> > PMD-size pages.
>
> What's the actual efficacy of this? Since the pmd is per inode, all I
> need is a lot of inodes and we're in business to destroy the directmap,
> no?
>
> Afaict there's no privs needed to use this, all a process needs is to
> stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> page will utterly destroy the direct map.
This indeed will cause 1G pages in the direct map to be split into 2M
chunks, but I disagree with 'destroy' term here. Citing the cover letter
of an earlier version of this series:
I've tried to find some numbers that show the benefit of using larger
pages in the direct map, but I couldn't find anything so I've run a
couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U
with 32G RAM).
I've tested three variants: the default with 28G of the physical
memory covered with 1G pages, then I disabled 1G pages using
"nogbpages" in the kernel command line and at last I've forced the
entire direct map to use 4K pages using a simple patch to
arch/x86/mm/init.c. I've made runs of the benchmarks with SSD and
tmpfs.
Surprisingly, the results does not show huge advantage for large
pages. For instance, here the results for kernel build with
'make -j8', in seconds:
| 1G | 2M | 4K
----------------------+--------+--------+---------
ssd, mitigations=on | 308.75 | 317.37 | 314.9
ssd, mitigations=off | 305.25 | 295.32 | 304.92
ram, mitigations=on | 301.58 | 322.49 | 306.54
ram, mitigations=off | 299.32 | 288.44 | 310.65
All the results I have are available here:
https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing
The numbers suggest that using smaller pages in the direct map does not
necessarily leads to performance degradation and some runs produced
better results with smaller pages in the direct map.
> I really don't like this, at all.
>
> IIRC Kirill looked at merging the directmap. I think he ran into
> performance issues there, but we really need something like that before
> something like this lands.
--
Sincerely yours,
Mike.
On Tue, Sep 29, 2020 at 04:58:44AM +0000, Edgecombe, Rick P wrote:
> On Thu, 2020-09-24 at 16:29 +0300, Mike Rapoport wrote:
> > Introduce "memfd_secret" system call with the ability to create
> > memory
> > areas visible only in the context of the owning process and not
> > mapped not
> > only to other processes but in the kernel page tables as well.
> >
> > The user will create a file descriptor using the memfd_secret()
> > system call
> > where flags supplied as a parameter to this system call will define
> > the
> > desired protection mode for the memory associated with that file
> > descriptor.
> >
> > Currently there are two protection modes:
> >
> > * exclusive - the memory area is unmapped from the kernel direct map
> > and it
> > is present only in the page tables of the owning mm.
>
> Seems like there were some concerns raised around direct map
> efficiency, but in case you are going to rework this...how does this
> memory work for the existing kernel functionality that does things like
> this?
>
> get_user_pages(, &page);
> ptr = kmap(page);
> foo = *ptr;
>
> Not sure if I'm missing something, but I think apps could cause the
> kernel to access a not-present page and oops.
The idea is that this memory should not be accessible by the kernel, so
the sequence you describe should indeed fail.
Probably oops would be to noisy and in this case the report needs to be
less verbose.
--
Sincerely yours,
Mike.
On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > >> From: Mike Rapoport <[email protected]>
> > >>
> > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > >> allocated for a secret memory mapping will cause severe fragmentation of
> > >> the direct map. This fragmentation can be reduced by using PMD-size pages
> > >> as a pool for small pages for secret memory mappings.
> > >>
> > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > >> PMD-size pages.
> > >
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > >
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> > >
> > > I really don't like this, at all.
> >
> > As I expressed earlier, I would prefer allowing allocation of secretmem
> > only from a previously defined CMA area. This would physically locally
> > limit the pain.
>
> Given that this thing doesn't have a migrate hook, that seems like an
> eminently reasonable contraint. Because not only will it mess up the
> directmap, it will also destroy the ability of the page-allocator /
> compaction to re-form high order blocks by sprinkling holes throughout.
>
> Also, this is all very close to XPFO, yet I don't see that mentioned
> anywhere.
It's close to XPFO in the sense it removes pages from the kernel page
table. But unlike XPFO memfd_secret() does not mean allowing access to
these pages in the kernel until they are freed by the user. And, unlike
XPFO, it does not require TLB flushing all over the place.
> Further still, it has this HAVE_SECRETMEM_UNCACHED nonsense which is
> completely unused. I'm not at all sure exposing UNCACHED to random
> userspace is a sane idea.
The uncached mappings were originally proposed as a mean "... to prevent
or considerably restrict speculation on such pages" [1] as a comment to
my initial proposal to use mmap(MAP_EXCLUSIVE).
I've added the ability to create uncached mappings into the fd-based
implementation of the exclusive mappings as it is indeed can reduce
availability of side channels and the implementation was quite straight
forward.
[1] https://lore.kernel.org/linux-mm/2236FBA76BA1254E88B949DDB74E612BA4EEC0CE@IRSMSX102.ger.corp.intel.com/
--
Sincerely yours,
Mike.
On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> On 25.09.20 09:41, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> >> From: Mike Rapoport <[email protected]>
> >>
> >> Removing a PAGE_SIZE page from the direct map every time such page is
> >> allocated for a secret memory mapping will cause severe fragmentation of
> >> the direct map. This fragmentation can be reduced by using PMD-size pages
> >> as a pool for small pages for secret memory mappings.
> >>
> >> Add a gen_pool per secretmem inode and lazily populate this pool with
> >> PMD-size pages.
> >
> > What's the actual efficacy of this? Since the pmd is per inode, all I
> > need is a lot of inodes and we're in business to destroy the directmap,
> > no?
> >
> > Afaict there's no privs needed to use this, all a process needs is to
> > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > page will utterly destroy the direct map.
> >
> > I really don't like this, at all.
>
> As I expressed earlier, I would prefer allowing allocation of secretmem
> only from a previously defined CMA area. This would physically locally
> limit the pain.
The prevois version contained a patch that allowed reserving a memory
pool for the secretmem at boot time to avpoid splitting pages from the
direct map
> But my suggestion was not well received :)
The disagreemet was only whether to use CMA or simple boot time
reservation :-P
> --
> Thanks,
>
> David / dhildenb
>
--
Sincerely yours,
Mike.
On Fri, Sep 25, 2020 at 11:31:14AM +0100, Mark Rutland wrote:
> Hi,
>
> Agreed. I think if we really need something like this, something between
> XPFO and DEBUG_PAGEALLOC would be generally better, since:
>
> * Secretmem puts userspace in charge of kernel internals (AFAICT without
> any ulimits?), so that seems like an avenue for malicious or buggy
> userspace to exploit and trigger DoS, etc. The other approaches leave
> the kernel in charge at all times, and it's a system-level choice
> which is easier to reason about and test.
Secretmem obeys RLIMIT_MLOCK.
I don't see why it "puts userpspace in charge of kernel internals" more
than other system calls. The fact that memory is dropped from
linear/direct mapping does not make userspace in charge of the kernel
internals. The fact that this is not system-level actually makes it more
controllable and tunable, IMHO.
> * Secretmem interaction with existing ABIs is unclear. Should uaccess
> primitives work for secretmem? If so, this means that it's not valid
> to transform direct uaccesses in syscalls etc into accesses via the
> linear/direct map. If not, how do we prevent syscalls? The other
> approaches are clear that this should always work, but the kernel
> should avoid mappings wherever possible.
Our idea was that direct uaccess in the context of the process that owns
the secretmem should work and that transforming the direct uaccesses
into accesses via the linear map would be valid only when allowed
explicitly. E.g with addition of FOLL_SOMETHING to gup.
Yet, this would be required for any implementation of memory areas that
excludes pages from the linear mapping.
> * The uncached option doesn't work in a number of situations, such as
> systems which are purely cache coherent at all times, or where the
> hypervisor has overridden attributes. The kernel cannot even know that
> whther this works as intended. On its own this doens't solve a
> particular problem, and I think this is a solution looking for a
> problem.
As we discussed at one of the previous iterations, the uncached makes
sense for x86 to reduce availability of side channels and I've only
enabled uncached mappings on x86.
> ... and fundamentally, this seems like a "more security, please" option
> that is going to be abused, since everyone wants security, regardless of
> how we say it *should* be used. The few use-cases that may make sense
> (e.g. protection of ketys and/or crypto secrrets), aren't going to be
> able to rely on this (since e.g. other uses may depelete memory pools),
> so this is going to be best-effort. With all that in mind, I struggle to
> beleive that this is going to be worth the maintenance cost (e.g. with
> any issues arising from uaccess, IO, etc).
I think that making secretmem a file descriptor that only allows mmap()
already makes it quite self contained and simple. There could be several
cases that will need special treatment, but I don't think it will have
large maintenance cost.
I've run syzkaller for some time with memfd_secret() enabled and I never
hit a crash because of it.
> Thanks,
> Mark.
--
Sincerely yours,
Mike.
On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > From: Mike Rapoport <[email protected]>
> > >
> > > Removing a PAGE_SIZE page from the direct map every time such page is
> > > allocated for a secret memory mapping will cause severe fragmentation of
> > > the direct map. This fragmentation can be reduced by using PMD-size pages
> > > as a pool for small pages for secret memory mappings.
> > >
> > > Add a gen_pool per secretmem inode and lazily populate this pool with
> > > PMD-size pages.
> >
> > What's the actual efficacy of this? Since the pmd is per inode, all I
> > need is a lot of inodes and we're in business to destroy the directmap,
> > no?
> >
> > Afaict there's no privs needed to use this, all a process needs is to
> > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > page will utterly destroy the direct map.
>
> This indeed will cause 1G pages in the direct map to be split into 2M
> chunks, but I disagree with 'destroy' term here. Citing the cover letter
> of an earlier version of this series:
It will drop them down to 4k pages. Given enough inodes, and allocating
only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> I've tried to find some numbers that show the benefit of using larger
> pages in the direct map, but I couldn't find anything so I've run a
> couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U
> with 32G RAM).
Existing benchmarks suck at this, but FB had a load that had a
deterministic enough performance regression to bisect to a directmap
issue, fixed by:
7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")
> I've tested three variants: the default with 28G of the physical
> memory covered with 1G pages, then I disabled 1G pages using
> "nogbpages" in the kernel command line and at last I've forced the
> entire direct map to use 4K pages using a simple patch to
> arch/x86/mm/init.c. I've made runs of the benchmarks with SSD and
> tmpfs.
>
> Surprisingly, the results does not show huge advantage for large
> pages. For instance, here the results for kernel build with
> 'make -j8', in seconds:
Your benchmark should stress the TLB of your uarch, such that additional
pressure added by the shattered directmap shows up.
And no, I don't have one either.
> | 1G | 2M | 4K
> ----------------------+--------+--------+---------
> ssd, mitigations=on | 308.75 | 317.37 | 314.9
> ssd, mitigations=off | 305.25 | 295.32 | 304.92
> ram, mitigations=on | 301.58 | 322.49 | 306.54
> ram, mitigations=off | 299.32 | 288.44 | 310.65
These results lack error data, but assuming the reults are significant,
then this very much makes a case for 1G mappings. 5s on a kernel builds
is pretty good.
On 9/29/20 7:12 AM, Peter Zijlstra wrote:
>> | 1G | 2M | 4K
>> ----------------------+--------+--------+---------
>> ssd, mitigations=on | 308.75 | 317.37 | 314.9
>> ssd, mitigations=off | 305.25 | 295.32 | 304.92
>> ram, mitigations=on | 301.58 | 322.49 | 306.54
>> ram, mitigations=off | 299.32 | 288.44 | 310.65
> These results lack error data, but assuming the reults are significant,
> then this very much makes a case for 1G mappings. 5s on a kernel builds
> is pretty good.
Is something like secretmem all or nothing?
This seems like a similar situation to the side-channel mitigations. We
know what the most "secure" thing to do is. But, folks also disagree
about how much pain that security is worth.
That seems to indicate we're never going to come up with a
one-size-fits-all solution to this. Apps are going to have to live
without secretmem being around if they want to run on old kernels
anyway, so it seems like something we should be able to enable or
disable without ABI concerns.
Do we just include it, but disable it by default so it doesn't eat
performance? But, allow it to be reenabled by the folks who generally
prioritize hardening over performance, like Chromebooks for instance.
On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> > On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <[email protected]>
> > > >
> > > > Removing a PAGE_SIZE page from the direct map every time such page is
> > > > allocated for a secret memory mapping will cause severe fragmentation of
> > > > the direct map. This fragmentation can be reduced by using PMD-size pages
> > > > as a pool for small pages for secret memory mappings.
> > > >
> > > > Add a gen_pool per secretmem inode and lazily populate this pool with
> > > > PMD-size pages.
> > >
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > >
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> >
> > This indeed will cause 1G pages in the direct map to be split into 2M
> > chunks, but I disagree with 'destroy' term here. Citing the cover letter
> > of an earlier version of this series:
>
> It will drop them down to 4k pages. Given enough inodes, and allocating
> only a single sekrit page per pmd, we'll shatter the directmap into 4k.
Why? Secretmem allocates PMD-size page per inode and uses it as a pool
of 4K pages for that inode. This way it ensures that
__kernel_map_pages() is always called on PMD boundaries.
--
Sincerely yours,
Mike.
On Tue, 2020-09-29 at 16:12 +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> > On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <[email protected]>
> > > >
> > > > Removing a PAGE_SIZE page from the direct map every time such
> > > > page is allocated for a secret memory mapping will cause severe
> > > > fragmentation of the direct map. This fragmentation can be
> > > > reduced by using PMD-size pages as a pool for small pages for
> > > > secret memory mappings.
> > > >
> > > > Add a gen_pool per secretmem inode and lazily populate this
> > > > pool with PMD-size pages.
> > >
> > > What's the actual efficacy of this? Since the pmd is per inode,
> > > all I need is a lot of inodes and we're in business to destroy
> > > the directmap, no?
> > >
> > > Afaict there's no privs needed to use this, all a process needs
> > > is to stay below the mlock limit, so a 'fork-bomb' that maps a
> > > single secret page will utterly destroy the direct map.
> >
> > This indeed will cause 1G pages in the direct map to be split into
> > 2M chunks, but I disagree with 'destroy' term here. Citing the
> > cover letter of an earlier version of this series:
>
> It will drop them down to 4k pages. Given enough inodes, and
> allocating only a single sekrit page per pmd, we'll shatter the
> directmap into 4k.
Since the only requirement is 2M, even if this happens, which I'm not
sure it does, it's fixable to only fragment down to 2M, right?
We could also enforce a global limit in the secretmem syscall, so the
fork bomb problem can be made to go away.
Lastly, we could go back to boot time allocation as the previous patch
did, so this isn't even a fundamental problem with the patch set.
That said, I think investigation of the importance of direct map tiling
is useful, since it does fragment for other reasons, and fixing or
proving that the fragmentation doesn't matter is also something we'll
keep on investigating. But it would be useful in the meantime to
explore things which may be more fundamental issues with the approach.
Regards,
James
On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > It will drop them down to 4k pages. Given enough inodes, and allocating
> > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
>
> Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> of 4K pages for that inode. This way it ensures that
> __kernel_map_pages() is always called on PMD boundaries.
Oh, you unmap the 2m page upfront? I read it like you did the unmap at
the sekrit page alloc, not the pool alloc side of things.
Then yes, but then you're wasting gobs of memory. Basically you can pin
2M per inode while only accounting a single page.
On Tue, 2020-09-29 at 16:06 +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 04:58:44AM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2020-09-24 at 16:29 +0300, Mike Rapoport wrote:
> > > Introduce "memfd_secret" system call with the ability to create
> > > memory
> > > areas visible only in the context of the owning process and not
> > > mapped not
> > > only to other processes but in the kernel page tables as well.
> > >
> > > The user will create a file descriptor using the memfd_secret()
> > > system call
> > > where flags supplied as a parameter to this system call will
> > > define
> > > the
> > > desired protection mode for the memory associated with that file
> > > descriptor.
> > >
> > > Currently there are two protection modes:
> > >
> > > * exclusive - the memory area is unmapped from the kernel direct
> > > map
> > > and it
> > > is present only in the page tables of the owning
> > > mm.
> >
> > Seems like there were some concerns raised around direct map
> > efficiency, but in case you are going to rework this...how does
> > this
> > memory work for the existing kernel functionality that does things
> > like
> > this?
> >
> > get_user_pages(, &page);
> > ptr = kmap(page);
> > foo = *ptr;
> >
> > Not sure if I'm missing something, but I think apps could cause the
> > kernel to access a not-present page and oops.
>
> The idea is that this memory should not be accessible by the kernel,
> so
> the sequence you describe should indeed fail.
>
> Probably oops would be to noisy and in this case the report needs to
> be
> less verbose.
I was more concerned that it could cause kernel instabilities.
I see, so it should not be accessed even at the userspace address? I
wonder if it should be prevented somehow then. At least
get_user_pages() should be prevented I think. Blocking copy_*_user()
access might not be simple.
I'm also not so sure that a user would never have any possible reason
to copy data from this memory into the kernel, even if it's just
convenience. In which case a user setup could break if a specific
kernel implementation switched to get_user_pages()/kmap() from using
copy_*_user(). So seems maybe a bit thorny without fully blocking
access from the kernel, or deprecating that pattern.
You should probably call out these "no passing data to/from the kernel"
expectations, unless I missed them somewhere.
On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 04:05:29PM +0300, Mike Rapoport wrote:
> > On Fri, Sep 25, 2020 at 09:41:25AM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport <[email protected]>
> > > >
> > > > Removing a PAGE_SIZE page from the direct map every time such page is
> > > > allocated for a secret memory mapping will cause severe fragmentation of
> > > > the direct map. This fragmentation can be reduced by using PMD-size pages
> > > > as a pool for small pages for secret memory mappings.
> > > >
> > > > Add a gen_pool per secretmem inode and lazily populate this pool with
> > > > PMD-size pages.
> > >
> > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > need is a lot of inodes and we're in business to destroy the directmap,
> > > no?
> > >
> > > Afaict there's no privs needed to use this, all a process needs is to
> > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > page will utterly destroy the direct map.
> >
> > This indeed will cause 1G pages in the direct map to be split into 2M
> > chunks, but I disagree with 'destroy' term here. Citing the cover letter
> > of an earlier version of this series:
>
> It will drop them down to 4k pages. Given enough inodes, and allocating
> only a single sekrit page per pmd, we'll shatter the directmap into 4k.
>
> > I've tried to find some numbers that show the benefit of using larger
> > pages in the direct map, but I couldn't find anything so I've run a
> > couple of benchmarks from phoronix-test-suite on my laptop (i7-8650U
> > with 32G RAM).
>
> Existing benchmarks suck at this, but FB had a load that had a
I tried to dig the regression report in the mailing list, and the best I
could find is
https://lore.kernel.org/lkml/[email protected]/
which does not mention the actual performance regression but it only
complaints about kernel text mapping being split into 4K pages.
Any chance you have the regression report handy?
> deterministic enough performance regression to bisect to a directmap
> issue, fixed by:
>
> 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")
This commit talks about large page split for the text and mentions iTLB
performance.
Could it be that for data the behavoiur is different?
> > I've tested three variants: the default with 28G of the physical
> > memory covered with 1G pages, then I disabled 1G pages using
> > "nogbpages" in the kernel command line and at last I've forced the
> > entire direct map to use 4K pages using a simple patch to
> > arch/x86/mm/init.c. I've made runs of the benchmarks with SSD and
> > tmpfs.
> >
> > Surprisingly, the results does not show huge advantage for large
> > pages. For instance, here the results for kernel build with
> > 'make -j8', in seconds:
>
> Your benchmark should stress the TLB of your uarch, such that additional
> pressure added by the shattered directmap shows up.
I understand that the benchmark should stress the TLB, but it's not that
we can add something like random access to a large working set as a
kernel module and insmod it. The userspace should do something that will
cause the stress to the TLB so that entries corresponding to the direct
map will be evicted frequently. And, frankly,
> And no, I don't have one either.
>
> > | 1G | 2M | 4K
> > ----------------------+--------+--------+---------
> > ssd, mitigations=on | 308.75 | 317.37 | 314.9
> > ssd, mitigations=off | 305.25 | 295.32 | 304.92
> > ram, mitigations=on | 301.58 | 322.49 | 306.54
> > ram, mitigations=off | 299.32 | 288.44 | 310.65
>
> These results lack error data, but assuming the reults are significant,
> then this very much makes a case for 1G mappings. 5s on a kernel builds
> is pretty good.
The standard error for those are between 2.5 and 4.5 out of 3 runs for
each variant.
For kernel build 1G mappings perform better, but here 5s is only 1.6% of
300s and the direct map fragmentation was taken to the extreme here.
I'm not saying that the direct map fragmentation comes with no cost, but
the cost is not so big to dismiss features that cause the fragmentation
out of hand.
There were also benchmarks that actually performed better with 2M pages
in the direct map, so I'm still not convinced that 1G pages in the
direct map are the clear cut winner.
--
Sincerely yours,
Mike.
On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
>
> > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> >
> > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > of 4K pages for that inode. This way it ensures that
> > __kernel_map_pages() is always called on PMD boundaries.
>
> Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> the sekrit page alloc, not the pool alloc side of things.
>
> Then yes, but then you're wasting gobs of memory. Basically you can pin
> 2M per inode while only accounting a single page.
Right, quite like THP :)
I considered using a global pool of 2M pages for secretmem and handing
4K pages to each inode from that global pool. But I've decided to waste
memory in favor of simplicity.
The prevoius version of this set included additional patch that allowed
reserving chunk of the physical memory for a global secretmem pool at
boot time. We didn't reach an agreement with David H. about whether this
pool should be allocated directly from memblock or from CMA and I've
dropped the boot time reservation patch because it can always be added on
top.
--
Sincerely yours,
Mike.
On Tue, Sep 29, 2020 at 08:06:03PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2020-09-29 at 16:06 +0300, Mike Rapoport wrote:
> > On Tue, Sep 29, 2020 at 04:58:44AM +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2020-09-24 at 16:29 +0300, Mike Rapoport wrote:
> > > > Introduce "memfd_secret" system call with the ability to create
> > > > memory
> > > > areas visible only in the context of the owning process and not
> > > > mapped not
> > > > only to other processes but in the kernel page tables as well.
> > > >
> > > > The user will create a file descriptor using the memfd_secret()
> > > > system call
> > > > where flags supplied as a parameter to this system call will
> > > > define
> > > > the
> > > > desired protection mode for the memory associated with that file
> > > > descriptor.
> > > >
> > > > Currently there are two protection modes:
> > > >
> > > > * exclusive - the memory area is unmapped from the kernel direct
> > > > map
> > > > and it
> > > > is present only in the page tables of the owning
> > > > mm.
> > >
> > > Seems like there were some concerns raised around direct map
> > > efficiency, but in case you are going to rework this...how does
> > > this
> > > memory work for the existing kernel functionality that does things
> > > like
> > > this?
> > >
> > > get_user_pages(, &page);
> > > ptr = kmap(page);
> > > foo = *ptr;
> > >
> > > Not sure if I'm missing something, but I think apps could cause the
> > > kernel to access a not-present page and oops.
> >
> > The idea is that this memory should not be accessible by the kernel,
> > so
> > the sequence you describe should indeed fail.
> >
> > Probably oops would be to noisy and in this case the report needs to
> > be
> > less verbose.
>
> I was more concerned that it could cause kernel instabilities.
I think kernel recovers nicely from such sort of page fault, at least on
x86.
> I see, so it should not be accessed even at the userspace address? I
> wonder if it should be prevented somehow then. At least
> get_user_pages() should be prevented I think. Blocking copy_*_user()
> access might not be simple.
>
> I'm also not so sure that a user would never have any possible reason
> to copy data from this memory into the kernel, even if it's just
> convenience. In which case a user setup could break if a specific
> kernel implementation switched to get_user_pages()/kmap() from using
> copy_*_user(). So seems maybe a bit thorny without fully blocking
> access from the kernel, or deprecating that pattern.
>
> You should probably call out these "no passing data to/from the kernel"
> expectations, unless I missed them somewhere.
You are right, I should have been more explicit in the description of
the expected behavoir.
Our thinking was that copy_*user() would work in the context of the
process that "owns" the secretmem and gup() would not allow access in
general, unless requested with certail (yet another) FOLL_ flag.
--
Sincerely yours,
Mike.
On Wed, Sep 30, 2020 at 01:20:31PM +0300, Mike Rapoport wrote:
> I tried to dig the regression report in the mailing list, and the best I
> could find is
>
> https://lore.kernel.org/lkml/[email protected]/
>
> which does not mention the actual performance regression but it only
> complaints about kernel text mapping being split into 4K pages.
>
> Any chance you have the regression report handy?
I think the saga started here:
[email protected]
[email protected]
[email protected]
After that Thomas did the patch I referred to earlier and I endeavoured
to rewrite x86-ftrace.
I added Song to CC, maybe he can remember more.
On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > > > It will drop them down to 4k pages. Given enough inodes, and
> > > > allocating only a single sekrit page per pmd, we'll shatter the
> > > > directmap into 4k.
> > >
> > > Why? Secretmem allocates PMD-size page per inode and uses it as a
> > > pool of 4K pages for that inode. This way it ensures that
> > > __kernel_map_pages() is always called on PMD boundaries.
> >
> > Oh, you unmap the 2m page upfront? I read it like you did the unmap
> > at the sekrit page alloc, not the pool alloc side of things.
> >
> > Then yes, but then you're wasting gobs of memory. Basically you can
> > pin 2M per inode while only accounting a single page.
>
> Right, quite like THP :)
>
> I considered using a global pool of 2M pages for secretmem and
> handing 4K pages to each inode from that global pool. But I've
> decided to waste memory in favor of simplicity.
I can also add that the user space consumer of this we wrote does its
user pool allocation at a 2M granularity, so nothing is actually
wasted.
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git/
James
On 30.09.20 16:39, James Bottomley wrote:
> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
>>>> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
>>>>> It will drop them down to 4k pages. Given enough inodes, and
>>>>> allocating only a single sekrit page per pmd, we'll shatter the
>>>>> directmap into 4k.
>>>>
>>>> Why? Secretmem allocates PMD-size page per inode and uses it as a
>>>> pool of 4K pages for that inode. This way it ensures that
>>>> __kernel_map_pages() is always called on PMD boundaries.
>>>
>>> Oh, you unmap the 2m page upfront? I read it like you did the unmap
>>> at the sekrit page alloc, not the pool alloc side of things.
>>>
>>> Then yes, but then you're wasting gobs of memory. Basically you can
>>> pin 2M per inode while only accounting a single page.
>>
>> Right, quite like THP :)
>>
>> I considered using a global pool of 2M pages for secretmem and
>> handing 4K pages to each inode from that global pool. But I've
>> decided to waste memory in favor of simplicity.
>
> I can also add that the user space consumer of this we wrote does its
> user pool allocation at a 2M granularity, so nothing is actually
> wasted.
... for that specific user space consumer. (or am I missing something?)
--
Thanks,
David / dhildenb
On Wed, Sep 30, 2020 at 01:27:45PM +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> >
> > > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> > >
> > > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > > of 4K pages for that inode. This way it ensures that
> > > __kernel_map_pages() is always called on PMD boundaries.
> >
> > Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> > the sekrit page alloc, not the pool alloc side of things.
> >
> > Then yes, but then you're wasting gobs of memory. Basically you can pin
> > 2M per inode while only accounting a single page.
>
> Right, quite like THP :)
Huh? THP accounts every page it allocates. If you allocate 2MB,
it accounts 512 pages. And THP are reclaimable by vmscan, this is
obviously not.
On Wed, 2020-09-30 at 16:45 +0200, David Hildenbrand wrote:
> On 30.09.20 16:39, James Bottomley wrote:
> > On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra
> > > > > wrote:
> > > > > > It will drop them down to 4k pages. Given enough inodes,
> > > > > > and allocating only a single sekrit page per pmd, we'll
> > > > > > shatter the directmap into 4k.
> > > > >
> > > > > Why? Secretmem allocates PMD-size page per inode and uses it
> > > > > as a pool of 4K pages for that inode. This way it ensures
> > > > > that __kernel_map_pages() is always called on PMD boundaries.
> > > >
> > > > Oh, you unmap the 2m page upfront? I read it like you did the
> > > > unmap at the sekrit page alloc, not the pool alloc side of
> > > > things.
> > > >
> > > > Then yes, but then you're wasting gobs of memory. Basically you
> > > > can pin 2M per inode while only accounting a single page.
> > >
> > > Right, quite like THP :)
> > >
> > > I considered using a global pool of 2M pages for secretmem and
> > > handing 4K pages to each inode from that global pool. But I've
> > > decided to waste memory in favor of simplicity.
> >
> > I can also add that the user space consumer of this we wrote does
> > its user pool allocation at a 2M granularity, so nothing is
> > actually wasted.
>
> ... for that specific user space consumer. (or am I missing
> something?)
I'm not sure I understand what you mean? It's designed to be either
the standard wrapper or an example of how to do the standard wrapper
for the syscall. It uses the same allocator system glibc uses for
malloc/free ... which pretty much everyone uses instead of calling
sys_brk directly. If you look at the granularity glibc uses for
sys_brk, it's not 4k either.
James
On 30.09.20 17:17, James Bottomley wrote:
> On Wed, 2020-09-30 at 16:45 +0200, David Hildenbrand wrote:
>> On 30.09.20 16:39, James Bottomley wrote:
>>> On Wed, 2020-09-30 at 13:27 +0300, Mike Rapoport wrote:
>>>> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
>>>>> On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
>>>>>> On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra
>>>>>> wrote:
>>>>>>> It will drop them down to 4k pages. Given enough inodes,
>>>>>>> and allocating only a single sekrit page per pmd, we'll
>>>>>>> shatter the directmap into 4k.
>>>>>>
>>>>>> Why? Secretmem allocates PMD-size page per inode and uses it
>>>>>> as a pool of 4K pages for that inode. This way it ensures
>>>>>> that __kernel_map_pages() is always called on PMD boundaries.
>>>>>
>>>>> Oh, you unmap the 2m page upfront? I read it like you did the
>>>>> unmap at the sekrit page alloc, not the pool alloc side of
>>>>> things.
>>>>>
>>>>> Then yes, but then you're wasting gobs of memory. Basically you
>>>>> can pin 2M per inode while only accounting a single page.
>>>>
>>>> Right, quite like THP :)
>>>>
>>>> I considered using a global pool of 2M pages for secretmem and
>>>> handing 4K pages to each inode from that global pool. But I've
>>>> decided to waste memory in favor of simplicity.
>>>
>>> I can also add that the user space consumer of this we wrote does
>>> its user pool allocation at a 2M granularity, so nothing is
>>> actually wasted.
>>
>> ... for that specific user space consumer. (or am I missing
>> something?)
>
> I'm not sure I understand what you mean? It's designed to be either
> the standard wrapper or an example of how to do the standard wrapper
> for the syscall. It uses the same allocator system glibc uses for
> malloc/free ... which pretty much everyone uses instead of calling
> sys_brk directly. If you look at the granularity glibc uses for
> sys_brk, it's not 4k either.
Okay thanks, "the user space consumer of this we wrote" didn't sound as
generic to me as "the standard wrapper".
--
Thanks,
David / dhildenb
On Wed, 2020-09-30 at 13:35 +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 08:06:03PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2020-09-29 at 16:06 +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 04:58:44AM +0000, Edgecombe, Rick P
> > > wrote:
> > > > On Thu, 2020-09-24 at 16:29 +0300, Mike Rapoport wrote:
> > > > > Introduce "memfd_secret" system call with the ability to
> > > > > create
> > > > > memory
> > > > > areas visible only in the context of the owning process and
> > > > > not
> > > > > mapped not
> > > > > only to other processes but in the kernel page tables as
> > > > > well.
> > > > >
> > > > > The user will create a file descriptor using the
> > > > > memfd_secret()
> > > > > system call
> > > > > where flags supplied as a parameter to this system call will
> > > > > define
> > > > > the
> > > > > desired protection mode for the memory associated with that
> > > > > file
> > > > > descriptor.
> > > > >
> > > > > Currently there are two protection modes:
> > > > >
> > > > > * exclusive - the memory area is unmapped from the kernel
> > > > > direct
> > > > > map
> > > > > and it
> > > > > is present only in the page tables of the
> > > > > owning
> > > > > mm.
> > > >
> > > > Seems like there were some concerns raised around direct map
> > > > efficiency, but in case you are going to rework this...how does
> > > > this
> > > > memory work for the existing kernel functionality that does
> > > > things
> > > > like
> > > > this?
> > > >
> > > > get_user_pages(, &page);
> > > > ptr = kmap(page);
> > > > foo = *ptr;
> > > >
> > > > Not sure if I'm missing something, but I think apps could cause
> > > > the
> > > > kernel to access a not-present page and oops.
> > >
> > > The idea is that this memory should not be accessible by the
> > > kernel,
> > > so
> > > the sequence you describe should indeed fail.
> > >
> > > Probably oops would be to noisy and in this case the report needs
> > > to
> > > be
> > > less verbose.
> >
> > I was more concerned that it could cause kernel instabilities.
>
> I think kernel recovers nicely from such sort of page fault, at least
> on
> x86.
We are talking about the kernel taking a direct map NP fault and
oopsing? Hmm, I thought it should often recover, but stability should
be considered reduced. How could the kernel know whether to release
locks or clean up other state? Pretty sure I've seen deadlocks in this
case.
> > I see, so it should not be accessed even at the userspace address?
> > I
> > wonder if it should be prevented somehow then. At least
> > get_user_pages() should be prevented I think. Blocking
> > copy_*_user()
> > access might not be simple.
> >
> > I'm also not so sure that a user would never have any possible
> > reason
> > to copy data from this memory into the kernel, even if it's just
> > convenience. In which case a user setup could break if a specific
> > kernel implementation switched to get_user_pages()/kmap() from
> > using
> > copy_*_user(). So seems maybe a bit thorny without fully blocking
> > access from the kernel, or deprecating that pattern.
> >
> > You should probably call out these "no passing data to/from the
> > kernel"
> > expectations, unless I missed them somewhere.
>
> You are right, I should have been more explicit in the description of
> the expected behavoir.
>
> Our thinking was that copy_*user() would work in the context of the
> process that "owns" the secretmem and gup() would not allow access in
> general, unless requested with certail (yet another) FOLL_ flag.
Hmm, yes. I think one easier thing about this design over the series
Kirill sent out is that the actual page will never transition to and
from unmapped while it's mapped in userspace. If it could transition,
you'd have to worry about a race window between
get_user_pages(FOLL_foo) and the kmap() where the page might get
unmapped.
Without the ability to transition pages though, using this for KVM
guests memory remains a not completely worked through problem since it
has the get_user_pages()/kmap() pattern quite a bit. Did you have an
idea for that? (I thought I saw that use case mentioned somewhere).
On Wed, Sep 30, 2020 at 04:09:28PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 30, 2020 at 01:27:45PM +0300, Mike Rapoport wrote:
> > On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > >
> > > > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> > > >
> > > > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > > > of 4K pages for that inode. This way it ensures that
> > > > __kernel_map_pages() is always called on PMD boundaries.
> > >
> > > Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> > > the sekrit page alloc, not the pool alloc side of things.
> > >
> > > Then yes, but then you're wasting gobs of memory. Basically you can pin
> > > 2M per inode while only accounting a single page.
> >
> > Right, quite like THP :)
>
> Huh? THP accounts every page it allocates. If you allocate 2MB,
> it accounts 512 pages.
I meant that secremem allocates 2M in advance like THP and not that it
similar because only page is accounted.
Anyway, the intention was to account the entrire 2M chunk (512 pages),
so I'll recheck the accounting and I'll fix it if I missed something.
> And THP are reclaimable by vmscan, this is obviously not.
True, this is more like mlock in that sense.
--
Sincerely yours,
Mike.
Hi Mike and Michael,
Ping. :)
Thanks,
Alex
On 2020-09-24 16:55, Alejandro Colomar wrote:
> * Mike Rapoport:
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call; see NOTES.
>
> You added a reference to NOTES, but then in notes there is nothing about
> it. I guess you wanted to add the following to NOTES (taken from
> membarrier.2):
>
> .PP
> Glibc does not provide a wrapper for this system call; call it using
> .BR syscall (2).
>
> Cheers,
>
> Alex
Hi Alex,
On Sat, Oct 03, 2020 at 11:32:43AM +0200, Alejandro Colomar wrote:
> Hi Mike and Michael,
I'll add the note to the man page, thanks!
> Ping. :)
>
> Thanks,
>
> Alex
>
> On 2020-09-24 16:55, Alejandro Colomar wrote:
> > * Mike Rapoport:
> > > +.PP
> > > +.IR Note :
> > > +There is no glibc wrapper for this system call; see NOTES.
> >
> > You added a reference to NOTES, but then in notes there is nothing about
> > it.? I guess you wanted to add the following to NOTES (taken from
> > membarrier.2):
> >
> > .PP
> > Glibc does not provide a wrapper for this system call; call it using
> > .BR syscall (2).
> >
> > Cheers,
> >
> > Alex
--
Sincerely yours,
Mike.
On Wed, Sep 30, 2020 at 08:11:28PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2020-09-30 at 13:35 +0300, Mike Rapoport wrote:
> >
> > Our thinking was that copy_*user() would work in the context of the
> > process that "owns" the secretmem and gup() would not allow access in
> > general, unless requested with certail (yet another) FOLL_ flag.
>
> Hmm, yes. I think one easier thing about this design over the series
> Kirill sent out is that the actual page will never transition to and
> from unmapped while it's mapped in userspace. If it could transition,
> you'd have to worry about a race window between
> get_user_pages(FOLL_foo) and the kmap() where the page might get
> unmapped.
>
> Without the ability to transition pages though, using this for KVM
> guests memory remains a not completely worked through problem since it
> has the get_user_pages()/kmap() pattern quite a bit. Did you have an
> idea for that? (I thought I saw that use case mentioned somewhere).
I've mentioned the KVM usecase because it was dicussed at the hallway
track at KVM Forum last year and also after looking at Kirill's patches
I though that "KVM protected" memory could be implemented on top of
secretmem. Can't say I have enough expertise in KVM to have a completely
worked through solution for that.
--
Sincerely yours,
Mike.
* Mike Rapoport | 2020-09-24 16:28:58 [+0300]:
>This is an implementation of "secret" mappings backed by a file descriptor.
>I've dropped the boot time reservation patch for now as it is not strictly
>required for the basic usage and can be easily added later either with or
>without CMA.
Isn't memfd_secret currently *unnecessarily* designed to be a "one task
feature"? memfd_secret fulfills exactly two (generic) features:
- address space isolation from kernel (aka SECRET_EXCLUSIVE, not in kernel's
direct map) - hide from kernel, great
- disabling processor's memory caches against speculative-execution vulnerabilities
(spectre and friends, aka SECRET_UNCACHED), also great
But, what about the following use-case: implementing a hardened IPC mechanism
where even the kernel is not aware of any data and optionally via SECRET_UNCACHED
even the hardware caches are bypassed! With the patches we are so close to
achieving this.
How? Shared, SECRET_EXCLUSIVE and SECRET_UNCACHED mmaped pages for IPC
involved tasks required to know this mapping (and memfd_secret fd). After IPC
is done, tasks can copy sensitive data from IPC pages into memfd_secret()
pages, un-sensitive data can be used/copied everywhere.
One missing piece is still the secure zeroization of the page(s) if the
mapping is closed by last process to guarantee a secure cleanup. This can
probably done as an general mmap feature, not coupled to memfd_secret() and
can be done independently ("reverse" MAP_UNINITIALIZED feature).
PS: thank you Mike for your effort!
See the following pseudo-code as an example:
// simple assume file-descriptor and mapping is inherited
// by child for simplicity, ptr is
int fd = memfd_secret(SECRETMEM_UNCACHED);
ftruncate(fd, PAGE_SIZE);
uint32_t *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
pid_t pid_other;
void signal_handler(int sig)
{
// update IPC data on shared, uncachaed, exclusive mapped page
*ptr += 1;
// inform other
sleep(1);
kill(pid_other, SIGUSR1);
}
void ipc_loop(void)
{
signal(SIGUSR1, signal_handler);
while (1) {
sleep(1);
}
}
int main(void)
{
pid_t child_pid;
switch (child_pid = fork()) {
case 0:
pid_other = getppid();
break;
default:
pid_other = child_pid
break;
}
ipc_loop();
}
Hagen
On 24.09.20 15:28, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Hi,
>
> This is an implementation of "secret" mappings backed by a file descriptor.
> I've dropped the boot time reservation patch for now as it is not strictly
> required for the basic usage and can be easily added later either with or
> without CMA.
Hi Mike,
I'd like to stress again that I'd prefer *any* secretmem allocations
going via CMA as long as these pages are unmovable. The user can
allocate a non-significant amount of unmovable allocations only fenced
by the mlock limit, which behave very different to mlocked pages - they
are not movable for page compaction/migration.
Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
virtualized environments), eating up a significant amount of
!ZONE_MOVABLE memory dynamically at runtime can lead to non-obvious
issues. It looks like you have plenty of free memory, but the kernel
might still OOM when trying to do kernel allocations e.g., for
pagetables. With CMA we at least know what we're dealing with - it
behaves like ZONE_MOVABLE except for the owner that can place unmovable
pages there. We can use it to compute statically the amount of
ZONE_MOVABLE memory we can have in the system without doing harm to the
system.
Ideally, we would want to support page migration/compaction and allow
for allocation from ZONE_MOVABLE as well. Would involve temporarily
mapping, copying, unmapping. Sounds feasible, but not sure which
roadblocks we would find on the way.
[...]
>
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will have desired protection bits set in the user page
> table. For instance, current implementation allows uncached mappings.
>
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
>
> Additionally, the secret mappings may be used as a mean to protect guest
> memory in a virtual machine host.
>
> For demonstration of secret memory usage we've created a userspace library
> [1] that does two things: the first is act as a preloader for openssl to
> redirect all the OPENSSL_malloc calls to secret memory meaning any secret
> keys get automatically protected this way and the other thing it does is
> expose the API to the user who needs it. We anticipate that a lot of the
> use cases would be like the openssl one: many toolkits that deal with
> secret keys already have special handling for the memory to try to give
> them greater protection, so this would simply be pluggable into the
> toolkits without any need for user application modification.
>
> I've hesitated whether to continue to use new flags to memfd_create() or to
> add a new system call and I've decided to use a new system call after I've
> started to look into man pages update. There would have been two completely
> independent descriptions and I think it would have been very confusing.
This was also raised on lwn.net by "dullfire" [1]. I do wonder if it
would be the right place as well.
[1] https://lwn.net/Articles/835342/#Comments
>
> Hiding secret memory mappings behind an anonymous file allows (ab)use of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
>
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs in the future.
>
> As the fragmentation of the direct map was one of the major concerns raised
> during the previous postings, I've added an amortizing cache of PMD-size
> pages to each file descriptor that is used as an allocation pool for the
> secret memory areas.
--
Thanks,
David / dhildenb
On 02.11.20 10:11, David Hildenbrand wrote:
> On 24.09.20 15:28, Mike Rapoport wrote:
>> From: Mike Rapoport <[email protected]>
>>
>> Hi,
>>
>> This is an implementation of "secret" mappings backed by a file descriptor.
>> I've dropped the boot time reservation patch for now as it is not strictly
>> required for the basic usage and can be easily added later either with or
>> without CMA.
>
> Hi Mike,
>
> I'd like to stress again that I'd prefer *any* secretmem allocations
> going via CMA as long as these pages are unmovable. The user can
> allocate a non-significant amount of unmovable allocations only fenced
lol, "non-neglectable" or "significant". Guess I need another coffee :)
--
Thanks,
David / dhildenb
On Sun, Nov 01, 2020 at 12:09:35PM +0100, Hagen Paul Pfeifer wrote:
> * Mike Rapoport | 2020-09-24 16:28:58 [+0300]:
>
> >This is an implementation of "secret" mappings backed by a file descriptor.
> >I've dropped the boot time reservation patch for now as it is not strictly
> >required for the basic usage and can be easily added later either with or
> >without CMA.
>
> Isn't memfd_secret currently *unnecessarily* designed to be a "one task
> feature"? memfd_secret fulfills exactly two (generic) features:
>
> - address space isolation from kernel (aka SECRET_EXCLUSIVE, not in kernel's
> direct map) - hide from kernel, great
> - disabling processor's memory caches against speculative-execution vulnerabilities
> (spectre and friends, aka SECRET_UNCACHED), also great
>
> But, what about the following use-case: implementing a hardened IPC mechanism
> where even the kernel is not aware of any data and optionally via SECRET_UNCACHED
> even the hardware caches are bypassed! With the patches we are so close to
> achieving this.
>
> How? Shared, SECRET_EXCLUSIVE and SECRET_UNCACHED mmaped pages for IPC
> involved tasks required to know this mapping (and memfd_secret fd). After IPC
> is done, tasks can copy sensitive data from IPC pages into memfd_secret()
> pages, un-sensitive data can be used/copied everywhere.
As long as the task share the file descriptor, they can share the
secretmem pages, pretty much like normal memfd.
> One missing piece is still the secure zeroization of the page(s) if the
> mapping is closed by last process to guarantee a secure cleanup. This can
> probably done as an general mmap feature, not coupled to memfd_secret() and
> can be done independently ("reverse" MAP_UNINITIALIZED feature).
There are "init_on_alloc" and "init_on_free" kernel parameters that
enable zeroing of the pages on alloc and on free globally.
Anyway, I'll add zeroing of the freed memory to secretmem.
> PS: thank you Mike for your effort!
>
> See the following pseudo-code as an example:
>
>
> // simple assume file-descriptor and mapping is inherited
> // by child for simplicity, ptr is
> int fd = memfd_secret(SECRETMEM_UNCACHED);
> ftruncate(fd, PAGE_SIZE);
> uint32_t *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
The ptr here will be visible to both parent and child.
> pid_t pid_other;
>
> void signal_handler(int sig)
> {
> // update IPC data on shared, uncachaed, exclusive mapped page
> *ptr += 1;
> // inform other
> sleep(1);
> kill(pid_other, SIGUSR1);
> }
>
> void ipc_loop(void)
> {
> signal(SIGUSR1, signal_handler);
> while (1) {
> sleep(1);
> }
> }
>
> int main(void)
> {
> pid_t child_pid;
>
> switch (child_pid = fork()) {
> case 0:
> pid_other = getppid();
> break;
> default:
> pid_other = child_pid
> break;
> }
>
> ipc_loop();
> }
>
>
> Hagen
>
--
Sincerely yours,
Mike.
On Mon, Nov 02, 2020 at 10:11:12AM +0100, David Hildenbrand wrote:
> On 24.09.20 15:28, Mike Rapoport wrote:
> > From: Mike Rapoport <[email protected]>
> >
> > Hi,
> >
> > This is an implementation of "secret" mappings backed by a file descriptor.
> > I've dropped the boot time reservation patch for now as it is not strictly
> > required for the basic usage and can be easily added later either with or
> > without CMA.
>
> Hi Mike,
>
> I'd like to stress again that I'd prefer *any* secretmem allocations going
> via CMA as long as these pages are unmovable. The user can allocate a
> non-significant amount of unmovable allocations only fenced by the mlock
> limit, which behave very different to mlocked pages - they are not movable
> for page compaction/migration.
>
> Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
> virtualized environments), eating up a significant amount of !ZONE_MOVABLE
> memory dynamically at runtime can lead to non-obvious issues. It looks like
> you have plenty of free memory, but the kernel might still OOM when trying
> to do kernel allocations e.g., for pagetables. With CMA we at least know
> what we're dealing with - it behaves like ZONE_MOVABLE except for the owner
> that can place unmovable pages there. We can use it to compute statically
> the amount of ZONE_MOVABLE memory we can have in the system without doing
> harm to the system.
Why would you say that secretmem allocates from !ZONE_MOVABLE?
If we put boot time reservations aside, the memory allocation for
secretmem follows the same rules as the memory allocations for any file
descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE.
After the allocation the memory indeed becomes unmovable but it's not
like we are eating memory from other zones here.
Maybe I'm missing something, but it seems to me that using CMA for any
secretmem allocation would needlessly complicate things.
> Ideally, we would want to support page migration/compaction and allow for
> allocation from ZONE_MOVABLE as well. Would involve temporarily mapping,
> copying, unmapping. Sounds feasible, but not sure which roadblocks we would
> find on the way.
We can support migration/compaction with temporary mapping. The first
roadblock I've hit there was that migration allocates 4K destination
page and if we use it in secret map we are back to scrambling the direct
map into 4K pieces. It still sounds feasible but not as trivial :)
But again, there is nothing in the current form of secretmem that
prevents allocation from ZONE_MOVABLE.
> [...]
>
> > I've hesitated whether to continue to use new flags to memfd_create() or to
> > add a new system call and I've decided to use a new system call after I've
> > started to look into man pages update. There would have been two completely
> > independent descriptions and I think it would have been very confusing.
>
> This was also raised on lwn.net by "dullfire" [1]. I do wonder if it would
> be the right place as well.
I lean towards a dedicated syscall because, as I said, to me it would
seem less confusing.
> [1] https://lwn.net/Articles/835342/#Comments
>
> >
> > Hiding secret memory mappings behind an anonymous file allows (ab)use of
> > the page cache for tracking pages allocated for the "secret" mappings as
> > well as using address_space_operations for e.g. page migration callbacks.
> >
> > The anonymous file may be also used implicitly, like hugetlb files, to
> > implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> > ABIs in the future.
> >
> > As the fragmentation of the direct map was one of the major concerns raised
> > during the previous postings, I've added an amortizing cache of PMD-size
> > pages to each file descriptor that is used as an allocation pool for the
> > secret memory areas.
--
Sincerely yours,
Mike.
>> Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
>> virtualized environments), eating up a significant amount of !ZONE_MOVABLE
>> memory dynamically at runtime can lead to non-obvious issues. It looks like
>> you have plenty of free memory, but the kernel might still OOM when trying
>> to do kernel allocations e.g., for pagetables. With CMA we at least know
>> what we're dealing with - it behaves like ZONE_MOVABLE except for the owner
>> that can place unmovable pages there. We can use it to compute statically
>> the amount of ZONE_MOVABLE memory we can have in the system without doing
>> harm to the system.
>
> Why would you say that secretmem allocates from !ZONE_MOVABLE?
> If we put boot time reservations aside, the memory allocation for
> secretmem follows the same rules as the memory allocations for any file
> descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE.
Oh, okay - I missed that! I had the impression that pages are unmovable
and allocating from ZONE_MOVABLE would be a violation of that?
> After the allocation the memory indeed becomes unmovable but it's not
> like we are eating memory from other zones here.
... and here you have your problem. That's a no-no. We only allow it in
very special cases where it can't be avoided - e.g., vfio having to pin
guest memory when passing through memory to VMs.
Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to
unplug the memory again -> endless loop in offline_pages().
Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate
secretmem. The owner of the area tries to allocate memory - always
fails. Purpose of CMA destroyed.
>
>> Ideally, we would want to support page migration/compaction and allow for
>> allocation from ZONE_MOVABLE as well. Would involve temporarily mapping,
>> copying, unmapping. Sounds feasible, but not sure which roadblocks we would
>> find on the way.
>
> We can support migration/compaction with temporary mapping. The first
> roadblock I've hit there was that migration allocates 4K destination
> page and if we use it in secret map we are back to scrambling the direct
> map into 4K pieces. It still sounds feasible but not as trivial :)
That sounds like the proper way for me to do it then.
>
> But again, there is nothing in the current form of secretmem that
> prevents allocation from ZONE_MOVABLE.
Oh, there is something: That the pages are not movable.
--
Thanks,
David / dhildenb
On Mon, Nov 02, 2020 at 06:51:09PM +0100, David Hildenbrand wrote:
> > > Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
> > > virtualized environments), eating up a significant amount of !ZONE_MOVABLE
> > > memory dynamically at runtime can lead to non-obvious issues. It looks like
> > > you have plenty of free memory, but the kernel might still OOM when trying
> > > to do kernel allocations e.g., for pagetables. With CMA we at least know
> > > what we're dealing with - it behaves like ZONE_MOVABLE except for the owner
> > > that can place unmovable pages there. We can use it to compute statically
> > > the amount of ZONE_MOVABLE memory we can have in the system without doing
> > > harm to the system.
> >
> > Why would you say that secretmem allocates from !ZONE_MOVABLE?
> > If we put boot time reservations aside, the memory allocation for
> > secretmem follows the same rules as the memory allocations for any file
> > descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE.
>
> Oh, okay - I missed that! I had the impression that pages are unmovable and
> allocating from ZONE_MOVABLE would be a violation of that?
>
> > After the allocation the memory indeed becomes unmovable but it's not
> > like we are eating memory from other zones here.
>
> ... and here you have your problem. That's a no-no. We only allow it in very
> special cases where it can't be avoided - e.g., vfio having to pin guest
> memory when passing through memory to VMs.
>
> Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to unplug
> the memory again -> endless loop in offline_pages().
>
> Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate
> secretmem. The owner of the area tries to allocate memory - always fails.
> Purpose of CMA destroyed.
>
> >
> > > Ideally, we would want to support page migration/compaction and allow for
> > > allocation from ZONE_MOVABLE as well. Would involve temporarily mapping,
> > > copying, unmapping. Sounds feasible, but not sure which roadblocks we would
> > > find on the way.
> >
> > We can support migration/compaction with temporary mapping. The first
> > roadblock I've hit there was that migration allocates 4K destination
> > page and if we use it in secret map we are back to scrambling the direct
> > map into 4K pieces. It still sounds feasible but not as trivial :)
>
> That sounds like the proper way for me to do it then.
Although migration of secretmem pages sounds feasible now, there maybe
other issues I didn't see because I'm not very familiar with
migration/compaction code.
I've looked again at CMA and I'm inclined to agree with you that using
CMA for secretmem allocations could be the right thing.
--
Sincerely yours,
Mike.
On 03.11.20 10:52, Mike Rapoport wrote:
> On Mon, Nov 02, 2020 at 06:51:09PM +0100, David Hildenbrand wrote:
>>>> Assume you have a system with quite some ZONE_MOVABLE memory (esp. in
>>>> virtualized environments), eating up a significant amount of !ZONE_MOVABLE
>>>> memory dynamically at runtime can lead to non-obvious issues. It looks like
>>>> you have plenty of free memory, but the kernel might still OOM when trying
>>>> to do kernel allocations e.g., for pagetables. With CMA we at least know
>>>> what we're dealing with - it behaves like ZONE_MOVABLE except for the owner
>>>> that can place unmovable pages there. We can use it to compute statically
>>>> the amount of ZONE_MOVABLE memory we can have in the system without doing
>>>> harm to the system.
>>>
>>> Why would you say that secretmem allocates from !ZONE_MOVABLE?
>>> If we put boot time reservations aside, the memory allocation for
>>> secretmem follows the same rules as the memory allocations for any file
>>> descriptor. That means we allocate memory with GFP_HIGHUSER_MOVABLE.
>>
>> Oh, okay - I missed that! I had the impression that pages are unmovable and
>> allocating from ZONE_MOVABLE would be a violation of that?
>>
>>> After the allocation the memory indeed becomes unmovable but it's not
>>> like we are eating memory from other zones here.
>>
>> ... and here you have your problem. That's a no-no. We only allow it in very
>> special cases where it can't be avoided - e.g., vfio having to pin guest
>> memory when passing through memory to VMs.
>>
>> Hotplug memory, online it to ZONE_MOVABLE. Allocate secretmem. Try to unplug
>> the memory again -> endless loop in offline_pages().
>>
>> Or have a CMA area that gets used with GFP_HIGHUSER_MOVABLE. Allocate
>> secretmem. The owner of the area tries to allocate memory - always fails.
>> Purpose of CMA destroyed.
>>
>>>
>>>> Ideally, we would want to support page migration/compaction and allow for
>>>> allocation from ZONE_MOVABLE as well. Would involve temporarily mapping,
>>>> copying, unmapping. Sounds feasible, but not sure which roadblocks we would
>>>> find on the way.
>>>
>>> We can support migration/compaction with temporary mapping. The first
>>> roadblock I've hit there was that migration allocates 4K destination
>>> page and if we use it in secret map we are back to scrambling the direct
>>> map into 4K pieces. It still sounds feasible but not as trivial :)
>>
>> That sounds like the proper way for me to do it then.
>
> Although migration of secretmem pages sounds feasible now, there maybe
> other issues I didn't see because I'm not very familiar with
> migration/compaction code.
Migration of PMDs might also be feasible - and it would be even
cleaner. But I agree that that might require more work and starting with
something simpler (!movable) is the right way to move forward.
--
Thanks,
David / dhildenb
> On 11/02/2020 4:40 PM Mike Rapoport <[email protected]> wrote:
> > Isn't memfd_secret currently *unnecessarily* designed to be a "one task
> > feature"? memfd_secret fulfills exactly two (generic) features:
> >
> > - address space isolation from kernel (aka SECRET_EXCLUSIVE, not in kernel's
> > direct map) - hide from kernel, great
> > - disabling processor's memory caches against speculative-execution vulnerabilities
> > (spectre and friends, aka SECRET_UNCACHED), also great
> >
> > But, what about the following use-case: implementing a hardened IPC mechanism
> > where even the kernel is not aware of any data and optionally via SECRET_UNCACHED
> > even the hardware caches are bypassed! With the patches we are so close to
> > achieving this.
> >
> > How? Shared, SECRET_EXCLUSIVE and SECRET_UNCACHED mmaped pages for IPC
> > involved tasks required to know this mapping (and memfd_secret fd). After IPC
> > is done, tasks can copy sensitive data from IPC pages into memfd_secret()
> > pages, un-sensitive data can be used/copied everywhere.
>
> As long as the task share the file descriptor, they can share the
> secretmem pages, pretty much like normal memfd.
Including process_vm_readv() and process_vm_writev()? Let's take a hypothetical
"dbus-daemon-secure" service that receives data from process A and wants to
copy/distribute it to data areas of N other processes. Much like dbus but without
SOCK_DGRAM rather direct copy into secretmem/mmap pages (ring-buffer). Should be
possible, right?
> > One missing piece is still the secure zeroization of the page(s) if the
> > mapping is closed by last process to guarantee a secure cleanup. This can
> > probably done as an general mmap feature, not coupled to memfd_secret() and
> > can be done independently ("reverse" MAP_UNINITIALIZED feature).
>
> There are "init_on_alloc" and "init_on_free" kernel parameters that
> enable zeroing of the pages on alloc and on free globally.
> Anyway, I'll add zeroing of the freed memory to secretmem.
Great, this allows page-specific (thus runtime-performance-optimized) zeroing
of secured pages. init_on_free lowers the performance to much and is not precice
enough.
Hagen
On Tue, Nov 03, 2020 at 02:52:14PM +0100, Hagen Paul Pfeifer wrote:
> > On 11/02/2020 4:40 PM Mike Rapoport <[email protected]> wrote:
>
> > > Isn't memfd_secret currently *unnecessarily* designed to be a "one task
> > > feature"? memfd_secret fulfills exactly two (generic) features:
> > >
> > > - address space isolation from kernel (aka SECRET_EXCLUSIVE, not in kernel's
> > > direct map) - hide from kernel, great
> > > - disabling processor's memory caches against speculative-execution vulnerabilities
> > > (spectre and friends, aka SECRET_UNCACHED), also great
> > >
> > > But, what about the following use-case: implementing a hardened IPC mechanism
> > > where even the kernel is not aware of any data and optionally via SECRET_UNCACHED
> > > even the hardware caches are bypassed! With the patches we are so close to
> > > achieving this.
> > >
> > > How? Shared, SECRET_EXCLUSIVE and SECRET_UNCACHED mmaped pages for IPC
> > > involved tasks required to know this mapping (and memfd_secret fd). After IPC
> > > is done, tasks can copy sensitive data from IPC pages into memfd_secret()
> > > pages, un-sensitive data can be used/copied everywhere.
> >
> > As long as the task share the file descriptor, they can share the
> > secretmem pages, pretty much like normal memfd.
>
> Including process_vm_readv() and process_vm_writev()? Let's take a hypothetical
> "dbus-daemon-secure" service that receives data from process A and wants to
> copy/distribute it to data areas of N other processes. Much like dbus but without
> SOCK_DGRAM rather direct copy into secretmem/mmap pages (ring-buffer). Should be
> possible, right?
I'm not sure I follow you here.
For process_vm_readv() and process_vm_writev() secremem will be only
accessible on the local part, but not on the remote.
So copying data to secretmem pages using process_vm_writev wouldn't
work.
> > > One missing piece is still the secure zeroization of the page(s) if the
> > > mapping is closed by last process to guarantee a secure cleanup. This can
> > > probably done as an general mmap feature, not coupled to memfd_secret() and
> > > can be done independently ("reverse" MAP_UNINITIALIZED feature).
> >
> > There are "init_on_alloc" and "init_on_free" kernel parameters that
> > enable zeroing of the pages on alloc and on free globally.
> > Anyway, I'll add zeroing of the freed memory to secretmem.
>
> Great, this allows page-specific (thus runtime-performance-optimized) zeroing
> of secured pages. init_on_free lowers the performance to much and is not precice
> enough.
>
> Hagen
--
Sincerely yours,
Mike.
> On 11/03/2020 5:30 PM Mike Rapoport <[email protected]> wrote:
>
> > > As long as the task share the file descriptor, they can share the
> > > secretmem pages, pretty much like normal memfd.
> >
> > Including process_vm_readv() and process_vm_writev()? Let's take a hypothetical
> > "dbus-daemon-secure" service that receives data from process A and wants to
> > copy/distribute it to data areas of N other processes. Much like dbus but without
> > SOCK_DGRAM rather direct copy into secretmem/mmap pages (ring-buffer). Should be
> > possible, right?
>
> I'm not sure I follow you here.
> For process_vm_readv() and process_vm_writev() secremem will be only
> accessible on the local part, but not on the remote.
> So copying data to secretmem pages using process_vm_writev wouldn't
> work.
A hypothetical "dbus-daemon-secure" service will not be *process related* with communication
peers. E.g. a password-input process (reading a password into secured-memory page) will
transfer the password to dbus-daemon-secure and this service will hand-over the password to
two additional applications: a IPsec process on CPU0 und CPU1 (which itself use a
secured-memory page).
So four applications IPC chain:
password-input -> dbus-daemon-secure -> {IPsec0, IPsec1}
- password-input: uses a secured page to read/save the password locally after reading from TTY
- dbus-daemon-secure: uses a secured page for IPC (legitimate user can write and read into the secured page)
- IPSecN has secured page to save the password locally (and probably other data as well), IPC memory is memset'ed after copy
Goal: the whole password is never saved/touched on non secured pages during IPC transfer.
Question: maybe a *file-descriptor passing* mechanism can do the trick? I.e. dbus-daemon-secure
allocates via memfd_secret/mmap secure pages and permitted processes will get the descriptor/mmaped-page
passed so they can use the pages directly?
Hagen
On Wed, Nov 04, 2020 at 12:39:13PM +0100, Hagen Paul Pfeifer wrote:
> > On 11/03/2020 5:30 PM Mike Rapoport <[email protected]> wrote:
> >
> > > > As long as the task share the file descriptor, they can share the
> > > > secretmem pages, pretty much like normal memfd.
> > >
> > > Including process_vm_readv() and process_vm_writev()? Let's take a hypothetical
> > > "dbus-daemon-secure" service that receives data from process A and wants to
> > > copy/distribute it to data areas of N other processes. Much like dbus but without
> > > SOCK_DGRAM rather direct copy into secretmem/mmap pages (ring-buffer). Should be
> > > possible, right?
> >
> > I'm not sure I follow you here.
> > For process_vm_readv() and process_vm_writev() secremem will be only
> > accessible on the local part, but not on the remote.
> > So copying data to secretmem pages using process_vm_writev wouldn't
> > work.
>
> A hypothetical "dbus-daemon-secure" service will not be *process related* with communication
> peers. E.g. a password-input process (reading a password into secured-memory page) will
> transfer the password to dbus-daemon-secure and this service will hand-over the password to
> two additional applications: a IPsec process on CPU0 und CPU1 (which itself use a
> secured-memory page).
>
> So four applications IPC chain:
> password-input -> dbus-daemon-secure -> {IPsec0, IPsec1}
>
> - password-input: uses a secured page to read/save the password locally after reading from TTY
> - dbus-daemon-secure: uses a secured page for IPC (legitimate user can write and read into the secured page)
> - IPSecN has secured page to save the password locally (and probably other data as well), IPC memory is memset'ed after copy
>
> Goal: the whole password is never saved/touched on non secured pages during IPC transfer.
>
> Question: maybe a *file-descriptor passing* mechanism can do the trick? I.e. dbus-daemon-secure
> allocates via memfd_secret/mmap secure pages and permitted processes will get the descriptor/mmaped-page
> passed so they can use the pages directly?
Yes, this will work. The processes that share the memfd_secret file
descriptor will have access to the same memory pages, pretty much like
with shared memory.
> Hagen
--
Sincerely yours,
Mike.
> On 11/04/2020 6:02 PM Mike Rapoport <[email protected]> wrote:
>
> Yes, this will work. The processes that share the memfd_secret file
> descriptor will have access to the same memory pages, pretty much like
> with shared memory.
Perfect!
Acked-by: Hagen Paul Pfeifer <[email protected]>
Thank you for the effort Mike, if zeroize feature will also included it will
be great! The memset-all-pages after use is just overkill, a dedicated flag for
memfd_secret (or mmap) would be superior.
Hagen
From: Mike Rapoport <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Cowritten-by: Alejandro Colomar <[email protected]>
Acked-by: Alejandro Colomar <[email protected]>
Signed-off-by: Alejandro Colomar <[email protected]>
---
Hi Mike,
I added that note about not having a wrapper,
fixed a few minor formatting and wording issues,
and sorted ERRORS alphabetically.
Cheers,
Alex
man2/memfd_secret.2 | 178 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 178 insertions(+)
create mode 100644 man2/memfd_secret.2
diff --git a/man2/memfd_secret.2 b/man2/memfd_secret.2
new file mode 100644
index 000000000..4e617aa0e
--- /dev/null
+++ b/man2/memfd_secret.2
@@ -0,0 +1,178 @@
+.\" Copyright (c) 2020, IBM Corporation.
+.\" Written by Mike Rapoport <[email protected]>
+.\"
+.\" Based on memfd_create(2) man page
+.\" Copyright (C) 2014 Michael Kerrisk <[email protected]>
+.\" and Copyright (C) 2014 David Herrmann <[email protected]>
+.\"
+.\" %%%LICENSE_START(GPLv2+)
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_SECRET 2 2020-08-02 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_secret \- create an anonymous file to map secret memory regions
+.SH SYNOPSIS
+.nf
+.B #include <linux/secretmem.h>
+.PP
+.BI "int memfd_secret(unsigned long " flags ");"
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+.BR memfd_secret ()
+creates an anonymous file and returns a file descriptor that refers to it.
+The file can only be memory-mapped;
+the memory in such mapping
+will have stronger protection than usual memory mapped files,
+and so it can be used to store application secrets.
+Unlike a regular file, a file created with
+.BR memfd_secret ()
+lives in RAM and has a volatile backing storage.
+Once all references to the file are dropped, it is automatically released.
+The initial size of the file is set to 0.
+Following the call, the file size should be set using
+.BR ftruncate (2).
+.PP
+The memory areas obtained with
+.BR mmap (2)
+from the file descriptor are exclusive to the owning context.
+These areas are removed from the kernel page tables
+and only the page table of the process holding the file descriptor
+maps the corresponding physical memory.
+.PP
+The following values may be bitwise ORed in
+.IR flags
+to control the behavior of
+.BR memfd_secret (2):
+.TP
+.BR FD_CLOEXEC
+Set the close-on-exec flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+.TP
+.BR SECRETMEM_UNCACHED
+In addition to excluding memory areas from the kernel page tables,
+mark the memory mappings uncached in the page table of the owning process.
+Such mappings can be used to prevent speculative loads
+and cache-based side channels.
+This mode of
+.BR memfd_secret ()
+is not supported on all architectures.
+.PP
+See also NOTES below.
+.PP
+As its return value,
+.BR memfd_secret ()
+returns a new file descriptor that can be used to refer to an anonymous file.
+This file descriptor is opened for both reading and writing
+.RB ( O_RDWR )
+and
+.B O_LARGEFILE
+is set for the file descriptor.
+.PP
+With respect to
+.BR fork (2)
+and
+.BR execve (2),
+the usual semantics apply for the file descriptor created by
+.BR memfd_secret ().
+A copy of the file descriptor is inherited by the child produced by
+.BR fork (2)
+and refers to the same file.
+The file descriptor is preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.PP
+The memory regions backed with
+.BR memfd_secret ()
+are locked in the same way as
+.BR mlock (2),
+however the implementation will not try to
+populate the whole range during the
+.BR mmap ()
+call.
+The amount of memory allowed for memory mappings
+of the file descriptor obeys the same rules as
+.BR mlock (2)
+and cannot exceed
+.BR RLIMIT_MEMLOCK .
+.SH RETURN VALUE
+On success,
+.BR memfd_secret ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EINVAL
+.I flags
+included unknown bits.
+.TP
+.B EMFILE
+The per-process limit on the number of open file descriptors has been reached.
+.TP
+.B EMFILE
+The system-wide limit on the total number of open files has been reached.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.TP
+.B ENOSYS
+.BR memfd_secret ()
+is not implemented on this architecture.
+.SH VERSIONS
+The
+.BR memfd_secret (2)
+system call first appeared in Linux 5.X;
+.SH CONFORMING TO
+The
+.BR memfd_secret (2)
+system call is Linux-specific.
+.SH NOTES
+The
+.BR memfd_secret (2)
+system call provides an ability to hide information
+from the operating system.
+Normally Linux userspace mappings are protected from other users,
+but they are visible to privileged code.
+The mappings created using
+.BR memfd_secret ()
+are hidden from the kernel as well.
+.PP
+If an architecture supports
+.BR SECRETMEM_UNCACHED ,
+the mappings also have protection from speculative execution vulnerabilties,
+at the expense of increased memory access latency.
+Care should be taken when using
+.B SECRETMEM_UNCACHED
+to avoid degrading application performance.
+.PP
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+.SH SEE ALSO
+.BR fcntl (2),
+.BR ftruncate (2),
+.BR mlock (2),
+.BR mmap (2),
+.BR setrlimit (2)
--
2.29.2
On Mon, Nov 16, 2020 at 10:01:37PM +0100, Alejandro Colomar wrote:
> From: Mike Rapoport <[email protected]>
>
> Signed-off-by: Mike Rapoport <[email protected]>
> Cowritten-by: Alejandro Colomar <[email protected]>
> Acked-by: Alejandro Colomar <[email protected]>
> Signed-off-by: Alejandro Colomar <[email protected]>
> ---
>
> Hi Mike,
>
> I added that note about not having a wrapper,
> fixed a few minor formatting and wording issues,
> and sorted ERRORS alphabetically.
Thanks, Alejandro!
> Cheers,
>
> Alex
>
> man2/memfd_secret.2 | 178 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 178 insertions(+)
> create mode 100644 man2/memfd_secret.2
>
> diff --git a/man2/memfd_secret.2 b/man2/memfd_secret.2
> new file mode 100644
> index 000000000..4e617aa0e
> --- /dev/null
> +++ b/man2/memfd_secret.2
> @@ -0,0 +1,178 @@
> +.\" Copyright (c) 2020, IBM Corporation.
> +.\" Written by Mike Rapoport <[email protected]>
> +.\"
> +.\" Based on memfd_create(2) man page
> +.\" Copyright (C) 2014 Michael Kerrisk <[email protected]>
> +.\" and Copyright (C) 2014 David Herrmann <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(GPLv2+)
> +.\"
> +.\" This program is free software; you can redistribute it and/or modify
> +.\" it under the terms of the GNU General Public License as published by
> +.\" the Free Software Foundation; either version 2 of the License, or
> +.\" (at your option) any later version.
> +.\"
> +.\" This program is distributed in the hope that it will be useful,
> +.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
> +.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +.\" GNU General Public License for more details.
> +.\"
> +.\" You should have received a copy of the GNU General Public
> +.\" License along with this manual; if not, see
> +.\" <http://www.gnu.org/licenses/>.
> +.\" %%%LICENSE_END
> +.\"
> +.TH MEMFD_SECRET 2 2020-08-02 Linux "Linux Programmer's Manual"
> +.SH NAME
> +memfd_secret \- create an anonymous file to map secret memory regions
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/secretmem.h>
> +.PP
> +.BI "int memfd_secret(unsigned long " flags ");"
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +.BR memfd_secret ()
> +creates an anonymous file and returns a file descriptor that refers to it.
> +The file can only be memory-mapped;
> +the memory in such mapping
> +will have stronger protection than usual memory mapped files,
> +and so it can be used to store application secrets.
> +Unlike a regular file, a file created with
> +.BR memfd_secret ()
> +lives in RAM and has a volatile backing storage.
> +Once all references to the file are dropped, it is automatically released.
> +The initial size of the file is set to 0.
> +Following the call, the file size should be set using
> +.BR ftruncate (2).
> +.PP
> +The memory areas obtained with
> +.BR mmap (2)
> +from the file descriptor are exclusive to the owning context.
> +These areas are removed from the kernel page tables
> +and only the page table of the process holding the file descriptor
> +maps the corresponding physical memory.
> +.PP
> +The following values may be bitwise ORed in
> +.IR flags
> +to control the behavior of
> +.BR memfd_secret (2):
> +.TP
> +.BR FD_CLOEXEC
> +Set the close-on-exec flag on the new file descriptor.
> +See the description of the
> +.B O_CLOEXEC
> +flag in
> +.BR open (2)
> +for reasons why this may be useful.
> +.PP
> +.TP
> +.BR SECRETMEM_UNCACHED
> +In addition to excluding memory areas from the kernel page tables,
> +mark the memory mappings uncached in the page table of the owning process.
> +Such mappings can be used to prevent speculative loads
> +and cache-based side channels.
> +This mode of
> +.BR memfd_secret ()
> +is not supported on all architectures.
> +.PP
> +See also NOTES below.
> +.PP
> +As its return value,
> +.BR memfd_secret ()
> +returns a new file descriptor that can be used to refer to an anonymous file.
> +This file descriptor is opened for both reading and writing
> +.RB ( O_RDWR )
> +and
> +.B O_LARGEFILE
> +is set for the file descriptor.
> +.PP
> +With respect to
> +.BR fork (2)
> +and
> +.BR execve (2),
> +the usual semantics apply for the file descriptor created by
> +.BR memfd_secret ().
> +A copy of the file descriptor is inherited by the child produced by
> +.BR fork (2)
> +and refers to the same file.
> +The file descriptor is preserved across
> +.BR execve (2),
> +unless the close-on-exec flag has been set.
> +.PP
> +The memory regions backed with
> +.BR memfd_secret ()
> +are locked in the same way as
> +.BR mlock (2),
> +however the implementation will not try to
> +populate the whole range during the
> +.BR mmap ()
> +call.
> +The amount of memory allowed for memory mappings
> +of the file descriptor obeys the same rules as
> +.BR mlock (2)
> +and cannot exceed
> +.BR RLIMIT_MEMLOCK .
> +.SH RETURN VALUE
> +On success,
> +.BR memfd_secret ()
> +returns a new file descriptor.
> +On error, \-1 is returned and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +.I flags
> +included unknown bits.
> +.TP
> +.B EMFILE
> +The per-process limit on the number of open file descriptors has been reached.
> +.TP
> +.B EMFILE
> +The system-wide limit on the total number of open files has been reached.
> +.TP
> +.B ENOMEM
> +There was insufficient memory to create a new anonymous file.
> +.TP
> +.B ENOSYS
> +.BR memfd_secret ()
> +is not implemented on this architecture.
> +.SH VERSIONS
> +The
> +.BR memfd_secret (2)
> +system call first appeared in Linux 5.X;
> +.SH CONFORMING TO
> +The
> +.BR memfd_secret (2)
> +system call is Linux-specific.
> +.SH NOTES
> +The
> +.BR memfd_secret (2)
> +system call provides an ability to hide information
> +from the operating system.
> +Normally Linux userspace mappings are protected from other users,
> +but they are visible to privileged code.
> +The mappings created using
> +.BR memfd_secret ()
> +are hidden from the kernel as well.
> +.PP
> +If an architecture supports
> +.BR SECRETMEM_UNCACHED ,
> +the mappings also have protection from speculative execution vulnerabilties,
> +at the expense of increased memory access latency.
> +Care should be taken when using
> +.B SECRETMEM_UNCACHED
> +to avoid degrading application performance.
> +.PP
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +.SH SEE ALSO
> +.BR fcntl (2),
> +.BR ftruncate (2),
> +.BR mlock (2),
> +.BR mmap (2),
> +.BR setrlimit (2)
> --
> 2.29.2
>
--
Sincerely yours,
Mike.