LinuxLists.cc - [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-02-08 08:58:24

Subject: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

From: Mike Rapoport <[email protected]>

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/[email protected]/

v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default

v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
the userspace

v15: https://lore.kernel.org/lkml/[email protected]
* Add riscv/Kconfig update to disable set_memory operations for nommu
builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
(patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems

v14: https://lore.kernel.org/lkml/[email protected]
* Finally s/mod_node_page_state/mod_lruvec_page_state/

v13: https://lore.kernel.org/lkml/[email protected]
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested

Older history:
v12: https://lore.kernel.org/lkml/[email protected]
v11: https://lore.kernel.org/lkml/[email protected]
v10: https://lore.kernel.org/lkml/[email protected]
v9: https://lore.kernel.org/lkml/[email protected]
v8: https://lore.kernel.org/lkml/[email protected]
v7: https://lore.kernel.org/lkml/[email protected]
v6: https://lore.kernel.org/lkml/[email protected]
v5: https://lore.kernel.org/lkml/[email protected]
v4: https://lore.kernel.org/lkml/[email protected]
v3: https://lore.kernel.org/lkml/[email protected]
v2: https://lore.kernel.org/lkml/[email protected]
v1: https://lore.kernel.org/lkml/[email protected]
rfc-v2: https://lore.kernel.org/lkml/[email protected]/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/[email protected]/

Arnd Bergmann (1):
arm64: kfence: fix header inclusion

Mike Rapoport (9):
mm: add definition of PMD_PAGE_ORDER
mmap: make mlock_future_check() global
riscv/Kconfig: make direct map manipulation options depend on MMU
set_memory: allow set_direct_map_*_noflush() for multiple pages
set_memory: allow querying whether set_direct_map_*() is actually enabled
mm: introduce memfd_secret system call to create "secret" memory areas
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call where relevant
secretmem: test: add basic selftest for memfd_secret(2)

arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/kfence.h | 2 +-
arch/arm64/include/asm/set_memory.h | 17 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 23 +-
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/set_memory.h | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/riscv/mm/pageattr.c | 8 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/set_memory.h | 4 +-
arch/x86/mm/pat/set_memory.c | 8 +-
fs/dax.c | 11 +-
include/linux/pgtable.h | 3 +
include/linux/secretmem.h | 30 +++
include/linux/set_memory.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 6 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/power/snapshot.c | 4 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/gup.c | 10 +
mm/internal.h | 3 +
mm/mlock.c | 3 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 261 +++++++++++++++++++
mm/vmalloc.c | 5 +-
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 ++
39 files changed, 726 insertions(+), 53 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c

--
2.28.0

2021-02-08 08:59:23

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 02/10] mmap: make mlock_future_check() global

From: Mike Rapoport <[email protected]>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
mm/internal.h | 3 +++
mm/mmap.c | 5 ++---
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..8e9c660f33ca 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -353,6 +353,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
extern void mlock_vma_page(struct page *page);
extern unsigned int munlock_vma_page(struct page *page);

+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len);
+
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 28ef5e29152a..10b9b8b88913 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1346,9 +1346,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
return hint;
}

-static inline int mlock_future_check(struct mm_struct *mm,
- unsigned long flags,
- unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len)
{
unsigned long locked, lock_limit;

--
2.28.0

2021-02-08 08:59:23

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 03/10] riscv/Kconfig: make direct map manipulation options depend on MMU

From: Mike Rapoport <[email protected]>

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Signed-off-by: Mike Rapoport <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
arch/riscv/Kconfig | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e338e9579f3e..9d941794f11c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,8 +25,8 @@ config RISCV
select ARCH_HAS_KCOV
select ARCH_HAS_MMIOWB
select ARCH_HAS_PTE_SPECIAL
- select ARCH_HAS_SET_DIRECT_MAP
- select ARCH_HAS_SET_MEMORY
+ select ARCH_HAS_SET_DIRECT_MAP if MMU
+ select ARCH_HAS_SET_MEMORY if MMU
select ARCH_HAS_STRICT_KERNEL_RWX if MMU
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
--
2.28.0

2021-02-08 08:59:46

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 04/10] set_memory: allow set_direct_map_*_noflush() for multiple pages

From: Mike Rapoport <[email protected]>

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Catalin Marinas <[email protected]> [arm64]
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/include/asm/cacheflush.h | 4 ++--
arch/arm64/mm/pageattr.c | 10 ++++++----
arch/riscv/include/asm/set_memory.h | 4 ++--
arch/riscv/mm/pageattr.c | 8 ++++----
arch/x86/include/asm/set_memory.h | 4 ++--
arch/x86/mm/pat/set_memory.c | 8 ++++----
include/linux/set_memory.h | 4 ++--
kernel/power/snapshot.c | 4 ++--
mm/vmalloc.c | 5 +++--
9 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 45217f21f1fe..d3598419a284 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -138,8 +138,8 @@ static __always_inline void __flush_icache_all(void)

int set_memory_valid(unsigned long addr, int numpages, int enable);

-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
bool kernel_page_present(struct page *page);

#include <asm-generic/cacheflush.h>
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..b53ef37bf95a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -148,34 +148,36 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
__pgprot(PTE_VALID));
}

-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
{
struct page_change_data data = {
.set_mask = __pgprot(0),
.clear_mask = __pgprot(PTE_VALID),
};
+ unsigned long size = PAGE_SIZE * numpages;

if (!debug_pagealloc_enabled() && !rodata_full)
return 0;

return apply_to_page_range(&init_mm,
(unsigned long)page_address(page),
- PAGE_SIZE, change_page_range, &data);
+ size, change_page_range, &data);
}

-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
{
struct page_change_data data = {
.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
.clear_mask = __pgprot(PTE_RDONLY),
};
+ unsigned long size = PAGE_SIZE * numpages;

if (!debug_pagealloc_enabled() && !rodata_full)
return 0;

return apply_to_page_range(&init_mm,
(unsigned long)page_address(page),
- PAGE_SIZE, change_page_range, &data);
+ size, change_page_range, &data);
}

#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 211eb8244a45..1aaf2720b8f6 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -26,8 +26,8 @@ static inline void protect_kernel_text_data(void) {};
static inline int set_memory_rw_nx(unsigned long addr, int numpages) { return 0; }
#endif

-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
bool kernel_page_present(struct page *page);

#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 5e49e4b4a4cc..9618181b70be 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -156,11 +156,11 @@ int set_memory_nx(unsigned long addr, int numpages)
return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
}

-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
{
int ret;
unsigned long start = (unsigned long)page_address(page);
- unsigned long end = start + PAGE_SIZE;
+ unsigned long end = start + PAGE_SIZE * numpages;
struct pageattr_masks masks = {
.set_mask = __pgprot(0),
.clear_mask = __pgprot(_PAGE_PRESENT)
@@ -173,11 +173,11 @@ int set_direct_map_invalid_noflush(struct page *page)
return ret;
}

-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
{
int ret;
unsigned long start = (unsigned long)page_address(page);
- unsigned long end = start + PAGE_SIZE;
+ unsigned long end = start + PAGE_SIZE * numpages;
struct pageattr_masks masks = {
.set_mask = PAGE_KERNEL,
.clear_mask = __pgprot(0)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4352f08bfbb5..6224cb291f6c 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -80,8 +80,8 @@ int set_pages_wb(struct page *page, int numpages);
int set_pages_ro(struct page *page, int numpages);
int set_pages_rw(struct page *page, int numpages);

-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
bool kernel_page_present(struct page *page);

extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..d157fd617c99 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page, int numpages)
return __change_page_attr_set_clr(&cpa, 0);
}

-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
{
- return __set_pages_np(page, 1);
+ return __set_pages_np(page, numpages);
}

-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
{
- return __set_pages_p(page, 1);
+ return __set_pages_p(page, numpages);
}

#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..c650f82db813 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -15,11 +15,11 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
#endif

#ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(struct page *page, int numpages)
{
return 0;
}
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(struct page *page, int numpages)
{
return 0;
}
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d63560e1cf87..64b7aab9aee4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -86,7 +86,7 @@ static inline void hibernate_restore_unprotect_page(void *page_address) {}
static inline void hibernate_map_page(struct page *page)
{
if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
- int ret = set_direct_map_default_noflush(page);
+ int ret = set_direct_map_default_noflush(page, 1);

if (ret)
pr_warn_once("Failed to remap page\n");
@@ -99,7 +99,7 @@ static inline void hibernate_unmap_page(struct page *page)
{
if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
unsigned long addr = (unsigned long)page_address(page);
- int ret = set_direct_map_invalid_noflush(page);
+ int ret = set_direct_map_invalid_noflush(page, 1);

if (ret)
pr_warn_once("Failed to remap page\n");
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5f2a84e488a..1da9cd1d0758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2195,13 +2195,14 @@ struct vm_struct *remove_vm_area(const void *addr)
}

static inline void set_area_direct_map(const struct vm_struct *area,
- int (*set_direct_map)(struct page *page))
+ int (*set_direct_map)(struct page *page,
+ int numpages))
{
int i;

for (i = 0; i < area->nr_pages; i++)
if (page_address(area->pages[i]))
- set_direct_map(area->pages[i]);
+ set_direct_map(area->pages[i], 1);
}

/* Handle removing and resetting vm mappings related to the vm_struct. */
--
2.28.0

2021-02-08 09:01:19

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 05/10] set_memory: allow querying whether set_direct_map_*() is actually enabled

From: Mike Rapoport <[email protected]>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map. This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

Signed-off-by: Mike Rapoport <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 ------
arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +++---
arch/arm64/mm/pageattr.c | 13 +++++++++----
include/linux/set_memory.h | 12 ++++++++++++
7 files changed, 42 insertions(+), 14 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
generic-y += mcs_spinlock.h
generic-y += qrwlock.h
generic-y += qspinlock.h
-generic-y += set_memory.h
generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d3598419a284..b1bdf83a73db 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -136,12 +136,6 @@ static __always_inline void __flush_icache_all(void)
dsb(ish);
}

-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
#include <asm-generic/cacheflush.h>

#endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index a0b144cfaea7..0cbc50c4fa5a 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/page-flags.h>
+#include <linux/set_memory.h>
#include <linux/smp.h>

#include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9445eb77e3da..bd8521637120 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
#include <linux/io.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
+#include <linux/set_memory.h>

#include <asm/barrier.h>
#include <asm/cputype.h>
@@ -492,7 +493,7 @@ static void __init map_mem(pgd_t *pgdp)
int flags = 0;
u64 i;

- if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+ if (can_set_direct_map() || crash_mem_map)
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

/*
@@ -1470,8 +1471,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
* KFENCE requires linear map to be mapped at page granularity, so that
* it is possible to protect/unprotect single pages in the KFENCE pool.
*/
- if (rodata_full || debug_pagealloc_enabled() ||
- IS_ENABLED(CONFIG_KFENCE))
+ if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {

bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);

+bool can_set_direct_map(void)
+{
+ return rodata_full || debug_pagealloc_enabled();
+}
+
static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
{
struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
};
unsigned long size = PAGE_SIZE * numpages;

- if (!debug_pagealloc_enabled() && !rodata_full)
+ if (!can_set_direct_map())
return 0;

return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
};
unsigned long size = PAGE_SIZE * numpages;

- if (!debug_pagealloc_enabled() && !rodata_full)
+ if (!can_set_direct_map())
return 0;

return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
#ifdef CONFIG_DEBUG_PAGEALLOC
void __kernel_map_pages(struct page *page, int numpages, int enable)
{
- if (!debug_pagealloc_enabled() && !rodata_full)
+ if (!can_set_direct_map())
return;

set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
pte_t *ptep;
unsigned long addr = (unsigned long)page_address(page);

- if (!debug_pagealloc_enabled() && !rodata_full)
+ if (!can_set_direct_map())
return true;

pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
{
return true;
}
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+ return true;
+}
+#define can_set_direct_map can_set_direct_map
#endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */

#ifndef set_mce_nospec
static inline int set_mce_nospec(unsigned long pfn, bool unmap)
--
2.28.0

2021-02-08 09:01:35

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 06/10] arm64: kfence: fix header inclusion

From: Arnd Bergmann <[email protected]>

Randconfig builds started warning about a missing function declaration
after set_memory_valid() is moved to a new file:

In file included from mm/kfence/core.c:26:
arch/arm64/include/asm/kfence.h:17:2: error: implicit declaration of function 'set_memory_valid' [-Werror,-Wimplicit-function-declaration]

Include the correct header again.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9e18ec3cfabd ("set_memory: allow querying whether set_direct_map_*() is actually enabled")
Fixes: 204555ff8bd6 ("arm64, kfence: enable KFENCE for ARM64")
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/arm64/include/asm/kfence.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
#ifndef __ASM_KFENCE_H
#define __ASM_KFENCE_H

-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>

static inline bool arch_kfence_init_pool(void) { return true; }

--
2.28.0

2021-02-08 09:03:23

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

From: Mike Rapoport <[email protected]>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
include/linux/secretmem.h | 6 ++++++
kernel/power/hibernate.c | 5 ++++-
mm/secretmem.c | 15 +++++++++++++++
3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 70e7db9f94fe..907a6734059c 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -6,6 +6,7 @@

bool vma_is_secretmem(struct vm_area_struct *vma);
bool page_is_secretmem(struct page *page);
+bool secretmem_active(void);

#else

@@ -19,6 +20,11 @@ static inline bool page_is_secretmem(struct page *page)
return false;
}

+static inline bool secretmem_active(void)
+{
+ return false;
+}
+
#endif /* CONFIG_SECRETMEM */

#endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
#include <linux/genhd.h>
#include <linux/ktime.h>
#include <linux/security.h>
+#include <linux/secretmem.h>
#include <trace/events/power.h>

#include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)

bool hibernation_available(void)
{
- return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+ return nohibernate == 0 &&
+ !security_locked_down(LOCKDOWN_HIBERNATION) &&
+ !secretmem_active();
}

/**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fa6738e860c2..f2ae3f32a193 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
MODULE_PARM_DESC(secretmem_enable,
"Enable secretmem and memfd_secret(2) system call");

+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+ return !!atomic_read(&secretmem_users);
+}
+
static vm_fault_t secretmem_fault(struct vm_fault *vmf)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
.fault = secretmem_fault,
};

+static int secretmem_release(struct inode *inode, struct file *file)
+{
+ atomic_dec(&secretmem_users);
+ return 0;
+}
+
static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
{
unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
}

static const struct file_operations secretmem_fops = {
+ .release = secretmem_release,
.mmap = secretmem_mmap,
};

@@ -212,6 +226,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
file->f_flags |= O_LARGEFILE;

fd_install(fd, file);
+ atomic_inc(&secretmem_users);
return fd;

err_put_fd:
--
2.28.0

2021-02-08 09:08:25

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 09/10] arch, mm: wire up memfd_secret system call where relevant

From: Mike Rapoport <[email protected]>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/riscv/include/asm/unistd.h | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 6 +++++-
scripts/checksyscalls.sh | 4 ++++
7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
#define __ARCH_WANT_SET_GET_RLIMIT
#define __ARCH_WANT_TIME32_SYSCALLS
#define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET

#include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
*/

#define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET

#include <uapi/asm/unistd.h>

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a1c9f496fca6..34f04076a140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -447,3 +447,4 @@
440 i386 process_madvise sys_process_madvise
441 i386 epoll_pwait2 sys_epoll_pwait2 compat_sys_epoll_pwait2
442 i386 mount_setattr sys_mount_setattr
+443 i386 memfd_secret sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7bf01cbe582f..bd3783edf27f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -364,6 +364,7 @@
440 common process_madvise sys_process_madvise
441 common epoll_pwait2 sys_epoll_pwait2
442 common mount_setattr sys_mount_setattr
+443 common memfd_secret sys_memfd_secret

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cd7b5c817ba2..ad7ac9717884 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1041,6 +1041,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ce58cff99b66..7ac0732dbaa4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,9 +863,13 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
__SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
#define __NR_mount_setattr 442
__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 443
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif

#undef __NR_syscalls
-#define __NR_syscalls 443
+#define __NR_syscalls 444

/*
* 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
#define __IGNORE_setrlimit /* setrlimit */
#endif

+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
/* Missing flags argument */
#define __IGNORE_renameat /* renameat2 */

--
2.28.0

2021-02-08 09:10:27

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

From: Mike Rapoport <[email protected]>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Hagen Paul Pfeifer <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
---
include/linux/secretmem.h | 24 ++++
include/uapi/linux/magic.h | 1 +
kernel/sys_ni.c | 2 +
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/gup.c | 10 ++
mm/mlock.c | 3 +-
mm/secretmem.c | 246 +++++++++++++++++++++++++++++++++++++
8 files changed, 289 insertions(+), 1 deletion(-)
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+ return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+ return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define Z3FOLD_MAGIC 0x33
#define PPC_CMM_MAGIC 0xc7571590
+#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */

#endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@ COND_SYSCALL(pkey_mprotect);
COND_SYSCALL(pkey_alloc);
COND_SYSCALL(pkey_free);

+/* memfd_secret */
+COND_SYSCALL(memfd_secret);

/*
* Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS
config KMAP_LOCAL
bool

+config SECRETMEM
+ def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..3e086b073624 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
#include <linux/rmap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/secretmem.h>

#include <linux/sched/signal.h>
#include <linux/rwsem.h>
@@ -759,6 +760,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
struct follow_page_context ctx = { NULL };
struct page *page;

+ if (vma_is_secretmem(vma))
+ return NULL;
+
page = follow_page_mask(vma, address, foll_flags, &ctx);
if (ctx.pgmap)
put_dev_pagemap(ctx.pgmap);
@@ -892,6 +896,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;

+ if (vma_is_secretmem(vma))
+ return -EFAULT;
+
if (write) {
if (!(vm_flags & VM_WRITE)) {
if (!(gup_flags & FOLL_FORCE))
@@ -2031,6 +2038,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

+ if (page_is_secretmem(page))
+ goto pte_unmap;
+
head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73960bb3464d..127e72dcac3d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
#include <linux/hugetlb.h>
#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
+#include <linux/secretmem.h>

#include "internal.h"

@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,

if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
- vma_is_dax(vma))
+ vma_is_dax(vma) || vma_is_secretmem(vma))
/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
goto out;

diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK (0x0)
+#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+ "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ pgoff_t offset = vmf->pgoff;
+ gfp_t gfp = vmf->gfp_mask;
+ unsigned long addr;
+ struct page *page;
+ int err;
+
+ if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+ return vmf_error(-EINVAL);
+
+retry:
+ page = find_lock_page(mapping, offset);
+ if (!page) {
+ page = alloc_page(gfp | __GFP_ZERO);
+ if (!page)
+ return VM_FAULT_OOM;
+
+ err = set_direct_map_invalid_noflush(page, 1);
+ if (err) {
+ put_page(page);
+ return vmf_error(err);
+ }
+
+ __SetPageUptodate(page);
+ err = add_to_page_cache_lru(page, mapping, offset, gfp);
+ if (unlikely(err)) {
+ put_page(page);
+ /*
+ * If a split of large page was required, it
+ * already happened when we marked the page invalid
+ * which guarantees that this call won't fail
+ */
+ set_direct_map_default_noflush(page, 1);
+ if (err == -EEXIST)
+ goto retry;
+
+ return vmf_error(err);
+ }
+
+ addr = (unsigned long)page_address(page);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ }
+
+ vmf->page = page;
+ return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+ .fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ unsigned long len = vma->vm_end - vma->vm_start;
+
+ if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+ return -EINVAL;
+
+ if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+ return -EAGAIN;
+
+ vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+ vma->vm_ops = &secretmem_vm_ops;
+
+ return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+ return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+ .mmap = secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+ return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
+{
+ return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+ set_direct_map_default_noflush(page, 1);
+ clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+ .freepage = secretmem_freepage,
+ .migratepage = secretmem_migratepage,
+ .isolate_page = secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
+ if (!mapping)
+ return false;
+
+ return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+ struct file *file = ERR_PTR(-ENOMEM);
+ struct inode *inode;
+
+ inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+
+ file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+ O_RDWR, &secretmem_fops);
+ if (IS_ERR(file))
+ goto err_free_inode;
+
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_unevictable(inode->i_mapping);
+
+ inode->i_mapping->a_ops = &secretmem_aops;
+
+ /* pretend we are a normal file with zero size */
+ inode->i_mode |= S_IFREG;
+ inode->i_size = 0;
+
+ return file;
+
+err_free_inode:
+ iput(inode);
+ return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+ struct file *file;
+ int fd, err;
+
+ /* make sure local flags do not confict with global fcntl.h */
+ BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+ if (!secretmem_enable)
+ return -ENOSYS;
+
+ if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+ return -EINVAL;
+
+ fd = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (fd < 0)
+ return fd;
+
+ file = secretmem_file_create(flags);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ goto err_put_fd;
+ }
+
+ file->f_flags |= O_LARGEFILE;
+
+ fd_install(fd, file);
+ return fd;
+
+err_put_fd:
+ put_unused_fd(fd);
+ return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+ return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+ .name = "secretmem",
+ .init_fs_context = secretmem_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+ int ret = 0;
+
+ if (!secretmem_enable)
+ return ret;
+
+ secretmem_mnt = kern_mount(&secretmem_fs);
+ if (IS_ERR(secretmem_mnt))
+ ret = PTR_ERR(secretmem_mnt);
+
+ return ret;
+}
+fs_initcall(secretmem_init);
--
2.28.0

2021-02-08 09:11:07

by Mike Rapoport

[permalink] [raw]

Subject: [PATCH v17 10/10] secretmem: test: add basic selftest for memfd_secret(2)

From: Mike Rapoport <[email protected]>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Hagen Paul Pfeifer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rick Edgecombe <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Deacon <[email protected]>
---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 17 ++
4 files changed, 316 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 9a35c3f6a557..c8deddc81e7a 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,4 +21,5 @@ va_128TBswitch
map_fixed_noreplace
write_to_hugetlbfs
hmm-tests
+memfd_secret
local_config.*
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d42115e4284d..0200fb61646c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
TEST_GEN_FILES += map_fixed_noreplace
TEST_GEN_FILES += map_hugetlb
TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
TEST_GEN_FILES += mlock-random-test
TEST_GEN_FILES += mlock2-tests
TEST_GEN_FILES += mremap_dontunmap
@@ -133,7 +134,7 @@ warn_32bit_failure:
endif
endif

-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap

$(OUTPUT)/gup_test: ../../../../mm/gup_test.h

diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..c878c2b841fc
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <[email protected]>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN 0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+ return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+ char buf[64];
+
+ if ((read(fd, buf, sizeof(buf)) >= 0) ||
+ (write(fd, buf, sizeof(buf)) >= 0) ||
+ (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+ (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+ fail("unexpected file IO\n");
+ else
+ pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+ size_t len;
+ char *mem;
+
+ len = mlock_limit_cur;
+ mem = mmap(NULL, len, prot, mode, fd, 0);
+ if (mem == MAP_FAILED) {
+ fail("unable to mmap secret memory\n");
+ return;
+ }
+ munmap(mem, len);
+
+ len = mlock_limit_max * 2;
+ mem = mmap(NULL, len, prot, mode, fd, 0);
+ if (mem != MAP_FAILED) {
+ fail("unexpected mlock limit violation\n");
+ munmap(mem, len);
+ return;
+ }
+
+ pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+ struct iovec liov, riov;
+ char buf[64];
+ char *mem;
+
+ if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+ fail("pipe write: %s\n", strerror(errno));
+ exit(KSFT_FAIL);
+ }
+
+ liov.iov_len = riov.iov_len = sizeof(buf);
+ liov.iov_base = buf;
+ riov.iov_base = mem;
+
+ if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+ if (errno == ENOSYS)
+ exit(KSFT_SKIP);
+ exit(KSFT_PASS);
+ }
+
+ exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+ pid_t ppid = getppid();
+ int status;
+ char *mem;
+ long ret;
+
+ if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+ perror("pipe write");
+ exit(KSFT_FAIL);
+ }
+
+ ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+ if (ret) {
+ perror("ptrace_attach");
+ exit(KSFT_FAIL);
+ }
+
+ ret = waitpid(ppid, &status, WUNTRACED);
+ if ((ret != ppid) || !(WIFSTOPPED(status))) {
+ fprintf(stderr, "weird waitppid result %ld stat %x\n",
+ ret, status);
+ exit(KSFT_FAIL);
+ }
+
+ if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+ exit(KSFT_PASS);
+
+ exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+ int status;
+
+ waitpid(pid, &status, 0);
+
+ if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+ skip("%s is not supported\n", name);
+ return;
+ }
+
+ if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+ WIFSIGNALED(status)) {
+ pass("%s is blocked as expected\n", name);
+ return;
+ }
+
+ fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+ void (*func)(int fd, int pipefd[2]))
+{
+ int pipefd[2];
+ pid_t pid;
+ char *mem;
+
+ if (pipe(pipefd)) {
+ fail("pipe failed: %s\n", strerror(errno));
+ return;
+ }
+
+ pid = fork();
+ if (pid < 0) {
+ fail("fork failed: %s\n", strerror(errno));
+ return;
+ }
+
+ if (pid == 0) {
+ func(fd, pipefd);
+ return;
+ }
+
+ mem = mmap(NULL, page_size, prot, mode, fd, 0);
+ if (mem == MAP_FAILED) {
+ fail("Unable to mmap secret memory\n");
+ return;
+ }
+
+ ftruncate(fd, page_size);
+ memset(mem, PATTERN, page_size);
+
+ if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+ fail("pipe write: %s\n", strerror(errno));
+ return;
+ }
+
+ check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+ test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+ test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+ struct rlimit new;
+ cap_t cap = cap_init();
+
+ new.rlim_cur = max;
+ new.rlim_max = max;
+ if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+ perror("setrlimit() returns error");
+ return -1;
+ }
+
+ /* drop capabilities including CAP_IPC_LOCK */
+ if (cap_set_proc(cap)) {
+ perror("cap_set_proc() returns error");
+ return -2;
+ }
+
+ return 0;
+}
+
+static void prepare(void)
+{
+ struct rlimit rlim;
+
+ page_size = sysconf(_SC_PAGE_SIZE);
+ if (!page_size)
+ ksft_exit_fail_msg("Failed to get page size %s\n",
+ strerror(errno));
+
+ if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+ ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+ strerror(errno));
+
+ mlock_limit_cur = rlim.rlim_cur;
+ mlock_limit_max = rlim.rlim_max;
+
+ printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+ page_size, mlock_limit_cur, mlock_limit_max);
+
+ if (page_size > mlock_limit_cur)
+ mlock_limit_cur = page_size;
+ if (page_size > mlock_limit_max)
+ mlock_limit_max = page_size;
+
+ if (set_cap_limits(mlock_limit_max))
+ ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+ strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+ int fd;
+
+ prepare();
+
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ fd = memfd_secret(0);
+ if (fd < 0) {
+ if (errno == ENOSYS)
+ ksft_exit_skip("memfd_secret is not supported\n");
+ else
+ ksft_exit_fail_msg("memfd_secret failed: %s\n",
+ strerror(errno));
+ }
+
+ test_mlock_limit(fd);
+ test_file_apis(fd);
+ test_process_vm_read(fd);
+ test_ptrace(fd);
+
+ close(fd);
+
+ ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+ printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+ return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -346,4 +346,21 @@ else
exitcode=1
fi

+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+ echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+ echo "[SKIP]"
+ exitcode=$ksft_skip
+else
+ echo "[FAIL]"
+ exitcode=1
+fi
+
+exit $exitcode
+
exit $exitcode
--
2.28.0

2021-02-08 09:45:39

Subject: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: [PATCH v17 02/10] mmap: make mlock_future_check() global

Subject: [PATCH v17 03/10] riscv/Kconfig: make direct map manipulation options depend on MMU

Subject: [PATCH v17 04/10] set_memory: allow set_direct_map_*_noflush() for multiple pages

Subject: [PATCH v17 05/10] set_memory: allow querying whether set_direct_map_*() is actually enabled

Subject: [PATCH v17 06/10] arm64: kfence: fix header inclusion

Subject: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: [PATCH v17 09/10] arch, mm: wire up memfd_secret system call where relevant

Subject: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: [PATCH v17 10/10] secretmem: test: add basic selftest for memfd_secret(2)

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users

Subject: Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users