2022-01-20 10:38:27

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

This is the v4 of this series which try to implement the fd-based KVM
guest private memory. The patches are based on latest kvm/queue branch
commit:

fea31d169094 KVM: x86/pmu: Fix available_event_types check for
REF_CPU_CYCLES event

Introduction
------------
In general this patch series introduce fd-based memslot which provides
guest memory through memory file descriptor fd[offset,size] instead of
hva/size. The fd can be created from a supported memory filesystem
like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
and the the memory backing store exchange callbacks when such memslot
gets created. At runtime KVM will call into callbacks provided by the
backing store to get the pfn with the fd+offset. Memory backing store
will also call into KVM callbacks when userspace fallocate/punch hole
on the fd to notify KVM to map/unmap secondary MMU page tables.

Comparing to existing hva-based memslot, this new type of memslot allows
guest memory unmapped from host userspace like QEMU and even the kernel
itself, therefore reduce attack surface and prevent bugs.

Based on this fd-based memslot, we can build guest private memory that
is going to be used in confidential computing environments such as Intel
TDX and AMD SEV. When supported, the memory backing store can provide
more enforcement on the fd and KVM can use a single memslot to hold both
the private and shared part of the guest memory.

mm extension
---------------------
Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
flag for memfd_create(), the file created with these flags cannot read(),
write() or mmap() etc via normal MMU operations. The file content can
only be used with the newly introduced memfile_notifier extension.

The memfile_notifier extension provides two sets of callbacks for KVM to
interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify
KVM when memory gets allocated/invalidated.
- memfile_pfn_ops: callbacks for KVM to call into memory backing store
to request memory pages for guest private memory.

memslot extension
-----------------
Add the private fd and the fd offset to existing 'shared' memslot so that
both private/shared guest memory can live in one single memslot. A page in
the memslot is either private or shared. A page is private only when it's
already allocated in the backing store fd, all the other cases it's treated
as shared, this includes those already mapped as shared as well as those
having not been mapped. This means the memory backing store is the place
which tells the truth of which page is private.

Private memory map/unmap and conversion
---------------------------------------
Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
- map: default fallocate() with mode=0.
- unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
secondary MMU page tables.

Test
----
To test the new functionalities of this patch TDX patchset is needed.
Since TDX patchset has not been merged so I did two kinds of test:

- Regresion test on kvm/queue (this patch)
Most new code are not covered. I only tested building and booting.

- New Funational test on latest TDX code
The patch is rebased to latest TDX code and tested the new
funcationalities.

For TDX test please see below repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v4.3
QEMU: https://github.com/chao-p/qemu/tree/privmem-v4

And an example QEMU command line:
-object tdx-guest,id=tdx \
-object memory-backend-memfd-private,id=ram1,size=2G \
-machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1

Changelog
----------
v4:
- Decoupled the callbacks between KVM/mm from memfd and use new
name 'memfile_notifier'.
- Supported register multiple memslots to the same backing store.
- Added per-memslot pfn_ops instead of per-system.
- Reworked the invalidation part.
- Improved new KVM uAPIs (private memslot extension and memory
error) per Sean's suggestions.
- Addressed many other minor fixes for comments from v3.
v3:
- Added locking protection when calling
invalidate_page_range/fallocate callbacks.
- Changed memslot structure to keep use useraddr for shared memory.
- Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
- Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
- Commit message improvement.
- Many small fixes for comments from the last version.

Links of previous discussions
-----------------------------
[1] Original design proposal:
https://lkml.kernel.org/kvm/[email protected]/
[2] Updated proposal and RFC patch v1:
https://lkml.kernel.org/linux-fsdevel/[email protected]/
[3] Patch v3: https://lkml.org/lkml/2021/12/23/283

Chao Peng (11):
mm/memfd: Introduce MFD_INACCESSIBLE flag
mm: Introduce memfile_notifier
mm/shmem: Support memfile_notifier
KVM: Extend the memslot to support fd-based private memory
KVM: Use kvm_userspace_memory_region_ext
KVM: Add KVM_EXIT_MEMORY_ERROR exit
KVM: Use memfile_pfn_ops to obtain pfn for private pages
KVM: Handle page fault for private memory
KVM: Register private memslot to memory backing store
KVM: Zap existing KVM mappings when pages changed in the private fd
KVM: Expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
mm/shmem: Introduce F_SEAL_INACCESSIBLE

arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 73 +++++++++++-
arch/x86/kvm/mmu/paging_tmpl.h | 11 +-
arch/x86/kvm/x86.c | 12 +-
include/linux/kvm_host.h | 49 +++++++-
include/linux/memfile_notifier.h | 53 +++++++++
include/linux/shmem_fs.h | 4 +
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/kvm.h | 17 +++
include/uapi/linux/memfd.h | 1 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/memfd.c | 20 +++-
mm/memfile_notifier.c | 99 ++++++++++++++++
mm/shmem.c | 121 +++++++++++++++++++-
virt/kvm/kvm_main.c | 188 +++++++++++++++++++++++++++----
16 files changed, 614 insertions(+), 41 deletions(-)
create mode 100644 include/linux/memfile_notifier.h
create mode 100644 mm/memfile_notifier.c

--
2.17.1


2022-01-20 10:38:49

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

From: "Kirill A. Shutemov" <[email protected]>

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace through ordinary MMU access
(e.g., read/write/mmap). However, the file content can be accessed
via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/fcntl.h | 1 +
mm/shmem.c | 40 ++++++++++++++++++++++++++++++++++++--
2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..09ef34754dfa 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_INACCESSIBLE 0x0020 /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
/* (1U << 31) is reserved for signed error codes */

/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 18f93c2d68f1..72185630e7c4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1098,6 +1098,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
(newsize > oldsize && (info->seals & F_SEAL_GROW)))
return -EPERM;

+ if (info->seals & F_SEAL_INACCESSIBLE) {
+ if(i_size_read(inode))
+ return -EPERM;
+ if (newsize & ~PAGE_MASK)
+ return -EINVAL;
+ }
+
if (newsize != oldsize) {
error = shmem_reacct_size(SHMEM_I(inode)->flags,
oldsize, newsize);
@@ -1364,6 +1371,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
goto redirty;
if (!total_swap_pages)
goto redirty;
+ if (info->seals & F_SEAL_INACCESSIBLE)
+ goto redirty;

/*
* Our capabilities prevent regular writeback or sync from ever calling
@@ -2262,6 +2271,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
if (ret)
return ret;

+ if (info->seals & F_SEAL_INACCESSIBLE)
+ return -EPERM;
+
/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;

@@ -2459,12 +2471,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
pgoff_t index = pos >> PAGE_SHIFT;

/* i_rwsem is held by caller */
- if (unlikely(info->seals & (F_SEAL_GROW |
- F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
+ F_SEAL_FUTURE_WRITE |
+ F_SEAL_INACCESSIBLE))) {
if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
+ if (info->seals & F_SEAL_INACCESSIBLE)
+ return -EPERM;
}

return shmem_getpage(inode, index, pagep, SGP_WRITE);
@@ -2538,6 +2553,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
end_index = i_size >> PAGE_SHIFT;
if (index > end_index)
break;
+
+ /*
+ * inode_lock protects setting up seals as well as write to
+ * i_size. Setting F_SEAL_INACCESSIBLE only allowed with
+ * i_size == 0.
+ *
+ * Check F_SEAL_INACCESSIBLE after i_size. It effectively
+ * serialize read vs. setting F_SEAL_INACCESSIBLE without
+ * taking inode_lock in read path.
+ */
+ if (SHMEM_I(inode)->seals & F_SEAL_INACCESSIBLE) {
+ error = -EPERM;
+ break;
+ }
+
if (index == end_index) {
nr = i_size & ~PAGE_MASK;
if (nr <= offset)
@@ -2663,6 +2693,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
goto out;
}

+ if ((info->seals & F_SEAL_INACCESSIBLE) &&
+ (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
+ error = -EINVAL;
+ goto out;
+ }
+
shmem_falloc.waitq = &shmem_falloc_waitq;
shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
--
2.17.1

2022-01-20 10:39:54

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

Introduce a new memfd_create() flag indicating the content of the
created memfd is inaccessible from userspace. It does this by force
setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
with MFD_ALLOW_SEALING.

The pages backed by such memfd will be used as guest private memory in
confidential computing environments such as Intel TDX/AMD SEV. Since
page migration/swapping is not yet supported for such usages so these
pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
them behave like long-term pinned pages.

Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/memfd.h | 1 +
mm/memfd.c | 20 +++++++++++++++++++-
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
#define MFD_CLOEXEC 0x0001U
#define MFD_ALLOW_SEALING 0x0002U
#define MFD_HUGETLB 0x0004U
+#define MFD_INACCESSIBLE 0x0008U

/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 9f80f162791a..26998d96dc11 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,16 +245,19 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)

-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+ MFD_INACCESSIBLE)

SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
unsigned int, flags)
{
+ struct address_space *mapping;
unsigned int *file_seals;
struct file *file;
int fd, error;
char *name;
+ gfp_t gfp;
long len;

if (!(flags & MFD_HUGETLB)) {
@@ -267,6 +270,10 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
}

+ /* Disallow sealing when MFD_INACCESSIBLE is set. */
+ if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
+ return -EINVAL;
+
/* length includes terminating zero */
len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
if (len <= 0)
@@ -315,6 +322,17 @@ SYSCALL_DEFINE2(memfd_create,
*file_seals &= ~F_SEAL_SEAL;
}

+ if (flags & MFD_INACCESSIBLE) {
+ mapping = file_inode(file)->i_mapping;
+ gfp = mapping_gfp_mask(mapping);
+ gfp &= ~__GFP_MOVABLE;
+ mapping_set_gfp_mask(mapping, gfp);
+ mapping_set_unevictable(mapping);
+
+ file_seals = memfd_file_seals_ptr(file);
+ *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;
+ }
+
fd_install(fd, file);
kfree(name);
return fd;
--
2.17.1

2022-01-20 10:41:01

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 03/12] mm: Introduce memfile_notifier

This patch introduces memfile_notifier facility so existing memory file
subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
third kernel component to make use of memory bookmarked in the memory
file and gets notified when the pages in the memory file become
allocated/invalidated.

It will be used for KVM to use a file descriptor as the guest memory
backing store and KVM will use this memfile_notifier interface to
interact with memory file subsystems. In the future there might be other
consumers (e.g. VFIO with encrypted device memory).

It consists two sets of callbacks:
- memfile_notifier_ops: callbacks for memory backing store to notify
KVM when memory gets allocated/invalidated.
- memfile_pfn_ops: callbacks for KVM to call into memory backing store
to request memory pages for guest private memory.

Userspace is in charge of guest memory lifecycle: it first allocates
pages in memory backing store and then passes the fd to KVM and lets KVM
register each memory slot to memory backing store via
memfile_register_notifier.

The supported memory backing store should maintain a memfile_notifier list
and provide routine for memfile_notifier to get the list head address and
memfile_pfn_ops callbacks for memfile_register_notifier. It also should call
memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked
memory gets allocated/invalidated.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/memfile_notifier.h | 53 +++++++++++++++++++
mm/Kconfig | 4 ++
mm/Makefile | 1 +
mm/memfile_notifier.c | 89 ++++++++++++++++++++++++++++++++
4 files changed, 147 insertions(+)
create mode 100644 include/linux/memfile_notifier.h
create mode 100644 mm/memfile_notifier.c

diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h
new file mode 100644
index 000000000000..a03bebdd1322
--- /dev/null
+++ b/include/linux/memfile_notifier.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMFILE_NOTIFIER_H
+#define _LINUX_MEMFILE_NOTIFIER_H
+
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/srcu.h>
+#include <linux/fs.h>
+
+struct memfile_notifier;
+
+struct memfile_notifier_ops {
+ void (*invalidate)(struct memfile_notifier *notifier,
+ pgoff_t start, pgoff_t end);
+ void (*fallocate)(struct memfile_notifier *notifier,
+ pgoff_t start, pgoff_t end);
+};
+
+struct memfile_pfn_ops {
+ long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
+ void (*put_unlock_pfn)(unsigned long pfn);
+};
+
+struct memfile_notifier {
+ struct list_head list;
+ struct memfile_notifier_ops *ops;
+};
+
+struct memfile_notifier_list {
+ struct list_head head;
+ spinlock_t lock;
+};
+
+#ifdef CONFIG_MEMFILE_NOTIFIER
+static inline void memfile_notifier_list_init(struct memfile_notifier_list *list)
+{
+ INIT_LIST_HEAD(&list->head);
+ spin_lock_init(&list->lock);
+}
+
+extern void memfile_notifier_invalidate(struct memfile_notifier_list *list,
+ pgoff_t start, pgoff_t end);
+extern void memfile_notifier_fallocate(struct memfile_notifier_list *list,
+ pgoff_t start, pgoff_t end);
+extern int memfile_register_notifier(struct inode *inode,
+ struct memfile_notifier *notifier,
+ struct memfile_pfn_ops **pfn_ops);
+extern void memfile_unregister_notifier(struct inode *inode,
+ struct memfile_notifier *notifier);
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
+#endif /* _LINUX_MEMFILE_NOTIFIER_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 28edafc820ad..fa31eda3c895 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,10 @@ config IO_MAPPING
config SECRETMEM
def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED

+config MEMFILE_NOTIFIER
+ bool
+ select SRCU
+
source "mm/damon/Kconfig"

endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d6c0042e3aa0..80588f7c3bc2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -130,3 +130,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
+obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o
diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
new file mode 100644
index 000000000000..8171d4601a04
--- /dev/null
+++ b/mm/memfile_notifier.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * linux/mm/memfile_notifier.c
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Chao Peng <[email protected]>
+ */
+
+#include <linux/memfile_notifier.h>
+#include <linux/srcu.h>
+
+DEFINE_STATIC_SRCU(srcu);
+
+void memfile_notifier_invalidate(struct memfile_notifier_list *list,
+ pgoff_t start, pgoff_t end)
+{
+ struct memfile_notifier *notifier;
+ int id;
+
+ id = srcu_read_lock(&srcu);
+ list_for_each_entry_srcu(notifier, &list->head, list,
+ srcu_read_lock_held(&srcu)) {
+ if (notifier->ops && notifier->ops->invalidate)
+ notifier->ops->invalidate(notifier, start, end);
+ }
+ srcu_read_unlock(&srcu, id);
+}
+
+void memfile_notifier_fallocate(struct memfile_notifier_list *list,
+ pgoff_t start, pgoff_t end)
+{
+ struct memfile_notifier *notifier;
+ int id;
+
+ id = srcu_read_lock(&srcu);
+ list_for_each_entry_srcu(notifier, &list->head, list,
+ srcu_read_lock_held(&srcu)) {
+ if (notifier->ops && notifier->ops->fallocate)
+ notifier->ops->fallocate(notifier, start, end);
+ }
+ srcu_read_unlock(&srcu, id);
+}
+
+static int memfile_get_notifier_info(struct inode *inode,
+ struct memfile_notifier_list **list,
+ struct memfile_pfn_ops **ops)
+{
+ return -EOPNOTSUPP;
+}
+
+int memfile_register_notifier(struct inode *inode,
+ struct memfile_notifier *notifier,
+ struct memfile_pfn_ops **pfn_ops)
+{
+ struct memfile_notifier_list *list;
+ int ret;
+
+ if (!inode || !notifier | !pfn_ops)
+ return -EINVAL;
+
+ ret = memfile_get_notifier_info(inode, &list, pfn_ops);
+ if (ret)
+ return ret;
+
+ spin_lock(&list->lock);
+ list_add_rcu(&notifier->list, &list->head);
+ spin_unlock(&list->lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(memfile_register_notifier);
+
+void memfile_unregister_notifier(struct inode *inode,
+ struct memfile_notifier *notifier)
+{
+ struct memfile_notifier_list *list;
+
+ if (!inode || !notifier)
+ return;
+
+ BUG_ON(memfile_get_notifier_info(inode, &list, NULL));
+
+ spin_lock(&list->lock);
+ list_del_rcu(&notifier->list);
+ spin_unlock(&list->lock);
+
+ synchronize_srcu(&srcu);
+}
+EXPORT_SYMBOL_GPL(memfile_unregister_notifier);
--
2.17.1

2022-01-20 10:41:03

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 05/12] KVM: Extend the memslot to support fd-based private memory

Extend the memslot definition to provide fd-based private memory support
by adding two new fields (private_fd/private_offset). The memslot then
can maintain memory for both shared pages and private pages in a single
memslot. Shared pages are provided by existing userspace_addr(hva) field
and private pages are provided through the new private_fd/private_offset
fields.

Since there is no 'hva' concept anymore for private memory so we cannot
rely on get_user_pages() to get a pfn, instead we use the newly added
memfile_notifier to complete the same job.

This new extension is indicated by a new flag KVM_MEM_PRIVATE.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 7 +++++++
include/uapi/linux/kvm.h | 8 ++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f079820f52b5..5011ac35bc50 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -458,8 +458,15 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+ struct file *private_file;
+ loff_t private_offset;
};

+static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+{
+ return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index fbfd70d965c6..5d6dceb1b93e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,13 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};

+struct kvm_userspace_memory_region_ext {
+ struct kvm_userspace_memory_region region;
+ __u64 private_offset;
+ __u32 private_fd;
+ __u32 padding[5];
+};
+
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
@@ -110,6 +117,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_PRIVATE (1UL << 2)

/* for KVM_IRQ_LINE */
struct kvm_irq_level {
--
2.17.1

2022-01-20 10:41:32

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 06/12] KVM: Use kvm_userspace_memory_region_ext

Use the new extended memslot structure kvm_userspace_memory_region_ext.
The extended part (private_fd/ private_offset) will be copied from
userspace only when KVM_MEM_PRIVATE is set. Internally old
kvm_userspace_memory_region will still be used for places where the
extended fields are not needed.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/x86.c | 12 ++++++------
include/linux/kvm_host.h | 4 ++--
virt/kvm/kvm_main.c | 30 ++++++++++++++++++++----------
3 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c194a8cbd25f..7f8d87463391 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11572,13 +11572,13 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
}

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
- struct kvm_userspace_memory_region m;
+ struct kvm_userspace_memory_region_ext m;

- m.slot = id | (i << 16);
- m.flags = 0;
- m.guest_phys_addr = gpa;
- m.userspace_addr = hva;
- m.memory_size = size;
+ m.region.slot = id | (i << 16);
+ m.region.flags = 0;
+ m.region.guest_phys_addr = gpa;
+ m.region.userspace_addr = hva;
+ m.region.memory_size = size;
r = __kvm_set_memory_region(kvm, &m);
if (r < 0)
return ERR_PTR_USR(r);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5011ac35bc50..26118a45f0bb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -977,9 +977,9 @@ enum kvm_mr_change {
};

int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region_ext *region_ext);
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region_ext *region_ext);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 168d0ab93c88..ecf94e2548f7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1815,8 +1815,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
* Must be called holding kvm->slots_lock for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *region_ext)
{
+ const struct kvm_userspace_memory_region *mem = &region_ext->region;
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
enum kvm_mr_change change;
@@ -1919,24 +1920,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);

int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region_ext *region_ext)
{
int r;

mutex_lock(&kvm->slots_lock);
- r = __kvm_set_memory_region(kvm, mem);
+ r = __kvm_set_memory_region(kvm, region_ext);
mutex_unlock(&kvm->slots_lock);
return r;
}
EXPORT_SYMBOL_GPL(kvm_set_memory_region);

static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
- struct kvm_userspace_memory_region *mem)
+ struct kvm_userspace_memory_region_ext *region_ext)
{
- if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
+ if ((u16)region_ext->region.slot >= KVM_USER_MEM_SLOTS)
return -EINVAL;

- return kvm_set_memory_region(kvm, mem);
+ return kvm_set_memory_region(kvm, region_ext);
}

#ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
@@ -4482,14 +4483,23 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
case KVM_SET_USER_MEMORY_REGION: {
- struct kvm_userspace_memory_region kvm_userspace_mem;
+ struct kvm_userspace_memory_region_ext region_ext;

r = -EFAULT;
- if (copy_from_user(&kvm_userspace_mem, argp,
- sizeof(kvm_userspace_mem)))
+ if (copy_from_user(&region_ext, argp,
+ sizeof(struct kvm_userspace_memory_region)))
goto out;
+ if (region_ext.region.flags & KVM_MEM_PRIVATE) {
+ int offset = offsetof(
+ struct kvm_userspace_memory_region_ext,
+ private_offset);
+ if (copy_from_user(&region_ext.private_offset,
+ argp + offset,
+ sizeof(region_ext) - offset))
+ goto out;
+ }

- r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
+ r = kvm_vm_ioctl_set_memory_region(kvm, &region_ext);
break;
}
case KVM_GET_DIRTY_LOG: {
--
2.17.1

2022-01-20 10:41:47

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 08/12] KVM: Use memfile_pfn_ops to obtain pfn for private pages

Private pages are not mmap-ed into userspace so can not reply on
get_user_pages() to obtain the pfn. Instead we add a memfile_pfn_ops
pointer pfn_ops in each private memslot and use it to obtain the pfn
for a gfn. To do that, KVM should convert the gfn to the offset into
the fd and then call get_lock_pfn callback. Once KVM completes its job
it should call put_unlock_pfn to unlock the pfn. Note the pfn(page) is
locked between get_lock_pfn/put_unlock_pfn to ensure pfn is valid when
KVM uses it to establish the mapping in the secondary MMU page table.

The pfn_ops is initialized via memfile_register_notifier from the memory
backing store that provided the private_fd.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ebc8ce9ec917..5d5bebaad9e7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -47,6 +47,7 @@ config KVM
select SRCU
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
+ select MEMFILE_NOTIFIER
help
Support hosting fully virtualized guest machines using hardware
virtualization extensions. You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 26118a45f0bb..927e7f44a02a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -42,6 +42,7 @@

#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
+#include <linux/memfile_notifier.h>

#ifndef KVM_MAX_VCPU_IDS
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -460,6 +461,7 @@ struct kvm_memory_slot {
u16 as_id;
struct file *private_file;
loff_t private_offset;
+ struct memfile_pfn_ops *pfn_ops;
};

static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
@@ -810,6 +812,7 @@ static inline void kvm_irqfd_exit(void)
{
}
#endif
+
int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
struct module *module);
void kvm_exit(void);
@@ -2103,4 +2106,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536

+#ifdef CONFIG_MEMFILE_NOTIFIER
+static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+ int *order)
+{
+ pgoff_t index = gfn - slot->base_gfn +
+ (slot->private_offset >> PAGE_SHIFT);
+
+ return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file),
+ index, order);
+}
+
+static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
+ kvm_pfn_t pfn)
+{
+ slot->pfn_ops->put_unlock_pfn(pfn);
+}
+
+#else
+static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+ int *order)
+{
+ return -1;
+}
+
+static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
+ kvm_pfn_t pfn)
+{
+}
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
#endif
--
2.17.1

2022-01-20 10:42:07

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

It maintains a memfile_notifier list in shmem_inode_info structure and
implements memfile_pfn_ops callbacks defined by memfile_notifier. It
then exposes them to memfile_notifier via
shmem_get_memfile_notifier_info.

We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
allocated by userspace for private memory. If there is no pages
allocated at the offset then error should be returned so KVM knows that
the memory is not private memory.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/shmem_fs.h | 4 ++
mm/memfile_notifier.c | 12 +++++-
mm/shmem.c | 81 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..461633587eaf 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
#include <linux/percpu_counter.h>
#include <linux/xattr.h>
#include <linux/fs_parser.h>
+#include <linux/memfile_notifier.h>

/* inode in-kernel data */

@@ -24,6 +25,9 @@ struct shmem_inode_info {
struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
atomic_t stop_eviction; /* hold when working on inode */
+#ifdef CONFIG_MEMFILE_NOTIFIER
+ struct memfile_notifier_list memfile_notifiers;
+#endif
struct inode vfs_inode;
};

diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
index 8171d4601a04..b4699cbf629e 100644
--- a/mm/memfile_notifier.c
+++ b/mm/memfile_notifier.c
@@ -41,11 +41,21 @@ void memfile_notifier_fallocate(struct memfile_notifier_list *list,
srcu_read_unlock(&srcu, id);
}

+#ifdef CONFIG_SHMEM
+extern int shmem_get_memfile_notifier_info(struct inode *inode,
+ struct memfile_notifier_list **list,
+ struct memfile_pfn_ops **ops);
+#endif
+
static int memfile_get_notifier_info(struct inode *inode,
struct memfile_notifier_list **list,
struct memfile_pfn_ops **ops)
{
- return -EOPNOTSUPP;
+ int ret = -EOPNOTSUPP;
+#ifdef CONFIG_SHMEM
+ ret = shmem_get_memfile_notifier_info(inode, list, ops);
+#endif
+ return ret;
}

int memfile_register_notifier(struct inode *inode,
diff --git a/mm/shmem.c b/mm/shmem.c
index 72185630e7c4..00af869d26ce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -906,6 +906,28 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
return split_huge_page(page) >= 0;
}

+static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFILE_NOTIFIER
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ memfile_notifier_fallocate(&info->memfile_notifiers, start, end);
+#endif
+}
+
+static void notify_invalidate_page(struct inode *inode, struct page *page,
+ pgoff_t start, pgoff_t end)
+{
+#ifdef CONFIG_MEMFILE_NOTIFIER
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ start = max(start, page->index);
+ end = min(end, page->index + thp_nr_pages(page));
+
+ memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
+#endif
+}
+
/*
* Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -949,6 +971,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
index += thp_nr_pages(page) - 1;

+ notify_invalidate_page(inode, page, start, end);
+
if (!unfalloc || !PageUptodate(page))
truncate_inode_page(mapping, page);
unlock_page(page);
@@ -1025,6 +1049,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index--;
break;
}
+
+ notify_invalidate_page(inode, page, start, end);
+
VM_BUG_ON_PAGE(PageWriteback(page), page);
if (shmem_punch_compound(page, start, end))
truncate_inode_page(mapping, page);
@@ -2313,6 +2340,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
info->flags = flags & VM_NORESERVE;
INIT_LIST_HEAD(&info->shrinklist);
INIT_LIST_HEAD(&info->swaplist);
+#ifdef CONFIG_MEMFILE_NOTIFIER
+ memfile_notifier_list_init(&info->memfile_notifiers);
+#endif
simple_xattrs_init(&info->xattrs);
cache_no_acl(inode);
mapping_set_large_folios(inode->i_mapping);
@@ -2818,6 +2848,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
i_size_write(inode, offset + len);
inode->i_ctime = current_time(inode);
+ notify_fallocate(inode, start, end);
undone:
spin_lock(&inode->i_lock);
inode->i_private = NULL;
@@ -4002,6 +4033,56 @@ struct kobj_attribute shmem_enabled_attr =
__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */

+#ifdef CONFIG_MEMFILE_NOTIFIER
+static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order)
+{
+ struct page *page;
+ int ret;
+
+ ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC);
+ if (ret)
+ return ret;
+
+ *order = thp_order(compound_head(page));
+
+ return page_to_pfn(page);
+}
+
+static void shmem_put_unlock_pfn(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ set_page_dirty(page);
+ unlock_page(page);
+ put_page(page);
+}
+
+static struct memfile_pfn_ops shmem_pfn_ops = {
+ .get_lock_pfn = shmem_get_lock_pfn,
+ .put_unlock_pfn = shmem_put_unlock_pfn,
+};
+
+int shmem_get_memfile_notifier_info(struct inode *inode,
+ struct memfile_notifier_list **list,
+ struct memfile_pfn_ops **ops)
+{
+ struct shmem_inode_info *info;
+
+ if (!shmem_mapping(inode->i_mapping))
+ return -EINVAL;
+
+ info = SHMEM_I(inode);
+ *list = &info->memfile_notifiers;
+ if (ops)
+ *ops = &shmem_pfn_ops;
+
+ return 0;
+}
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
#else /* !CONFIG_SHMEM */

/*
--
2.17.1

2022-01-20 10:43:19

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 09/12] KVM: Handle page fault for private memory

When page fault happens for a memslot with KVM_MEM_PRIVATE, we use
kvm_memfile_get_pfn() which further calls into memfile_pfn_ops callbacks
defined for each memslot to request the pfn from the memory backing store.

One assumption is that private pages are persistent and pre-allocated in
the private memory fd (backing store) so KVM uses this information as an
indicator for a page is private or shared (i.e. the private fd is the
final source of truth as to whether or not a GPA is private).

Depending on the access is private or shared, we go different paths:
- For private access, KVM checks if the page is already allocated in
the memory backing store, if yes KVM establishes the mapping,
otherwise exits to userspace to convert a shared page to private one.

- For shared access, KVM also checks if the page is already allocated
in the memory backing store, if yes then exit to userspace to
convert a private page to shared one, otherwise it's treated as a
traditional hva-based shared memory, KVM lets existing code to obtain
a pfn with get_user_pages() and establish the mapping.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
2 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d275e9d76b5..df526ab7e657 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2873,6 +2873,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;

+ if (kvm_slot_is_private(slot))
+ return max_level;
+
host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
return min(host_level, max_level);
}
@@ -3903,7 +3906,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
}

-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+ /*
+ * At this time private gfn has not been supported yet. Other patch
+ * that enables it should change this.
+ */
+ return false;
+}
+
+static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault,
+ bool *is_private_pfn, int *r)
+{
+ int order;
+ unsigned int flags = 0;
+ struct kvm_memory_slot *slot = fault->slot;
+ long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order);
+
+ if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
+ if (pfn < 0)
+ flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
+ else {
+ fault->pfn = pfn;
+ if (slot->flags & KVM_MEM_READONLY)
+ fault->map_writable = false;
+ else
+ fault->map_writable = true;
+
+ if (order == 0)
+ fault->max_level = PG_LEVEL_4K;
+ *is_private_pfn = true;
+ *r = RET_PF_FIXED;
+ return true;
+ }
+ } else {
+ if (pfn < 0)
+ return false;
+
+ kvm_memfile_put_pfn(slot, pfn);
+ }
+
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
+ vcpu->run->memory.flags = flags;
+ vcpu->run->memory.padding = 0;
+ vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+ vcpu->run->memory.size = PAGE_SIZE;
+ fault->pfn = -1;
+ *r = -1;
+ return true;
+}
+
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+ bool *is_private_pfn, int *r)
{
struct kvm_memory_slot *slot = fault->slot;
bool async;
@@ -3937,6 +3992,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
}
}

+ if (kvm_slot_is_private(slot) &&
+ kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r))
+ return *r == RET_PF_FIXED ? false : true;
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
fault->write, &fault->map_writable,
@@ -3997,6 +4056,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);

unsigned long mmu_seq;
+ bool is_private_pfn = false;
int r;

fault->gfn = fault->addr >> PAGE_SHIFT;
@@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (kvm_faultin_pfn(vcpu, fault, &r))
+ if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
return r;

if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
@@ -4029,7 +4089,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
else
write_lock(&vcpu->kvm->mmu_lock);

- if (is_page_fault_stale(vcpu, fault, mmu_seq))
+ if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
goto out_unlock;

r = make_mmu_pages_available(vcpu);
@@ -4046,7 +4106,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
read_unlock(&vcpu->kvm->mmu_lock);
else
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+
+ if (is_private_pfn)
+ kvm_memfile_put_pfn(fault->slot, fault->pfn);
+ else
+ kvm_release_pfn_clean(fault->pfn);
+
return r;
}

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..a1d26b50a5ec 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
int r;
unsigned long mmu_seq;
bool is_self_change_mapping;
+ bool is_private_pfn = false;
+

pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
WARN_ON_ONCE(fault->is_tdp);
@@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (kvm_faultin_pfn(vcpu, fault, &r))
+ if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r))
return r;

if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
@@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);

- if (is_page_fault_stale(vcpu, fault, mmu_seq))
+ if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq))
goto out_unlock;

kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
@@ -913,7 +915,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault

out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
+ if (is_private_pfn)
+ kvm_memfile_put_pfn(fault->slot, fault->pfn);
+ else
+ kvm_release_pfn_clean(fault->pfn);
return r;
}

--
2.17.1

2022-01-20 10:43:40

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 07/12] KVM: Add KVM_EXIT_MEMORY_ERROR exit

This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

After private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared memory <-> private memory conversion in
memory encryption usage.

In such usage, typically there are two kind of memory conversions:
- explicit conversion: happens when guest explicitly calls into KVM to
map a range (as private or shared), KVM then exits to userspace to
do the map/unmap operations.
- implicit conversion: happens in KVM page fault handler.
* if the fault is due to a private memory access then causes a
userspace exit for a shared->private conversion request when the
page has not been allocated in the private memory backend.
* If the fault is due to a shared memory access then causes a
userspace exit for a private->shared conversion request when the
page has already been allocated in the private memory backend.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/uapi/linux/kvm.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5d6dceb1b93e..52d8938a4ba1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -278,6 +278,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
#define KVM_EXIT_RISCV_SBI 35
+#define KVM_EXIT_MEMORY_ERROR 36

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -495,6 +496,14 @@ struct kvm_run {
unsigned long args[6];
unsigned long ret[2];
} riscv_sbi;
+ /* KVM_EXIT_MEMORY_ERROR */
+ struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
+ __u32 flags;
+ __u32 padding;
+ __u64 gpa;
+ __u64 size;
+ } memory;
/* Fix the size of the union. */
char padding[256];
};
--
2.17.1

2022-01-20 10:44:21

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 10/12] KVM: Register private memslot to memory backing store

Add 'notifier' to memslot to make it a memfile_notifier node and then
register it to memory backing store via memfile_register_notifier() when
memslot gets created. When memslot is deleted, do the reverse with
memfile_unregister_notifier(). Note each KVM memslot can be registered
to different memory backing stores (or the same backing store but at
different offset) independently.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 75 ++++++++++++++++++++++++++++++++++++----
2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 927e7f44a02a..667efe839767 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -462,6 +462,7 @@ struct kvm_memory_slot {
struct file *private_file;
loff_t private_offset;
struct memfile_pfn_ops *pfn_ops;
+ struct memfile_notifier notifier;
};

static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ecf94e2548f7..6b78ddef7880 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -846,6 +846,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)

#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */

+#ifdef CONFIG_MEMFILE_NOTIFIER
+static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
+{
+ return memfile_register_notifier(file_inode(slot->private_file),
+ &slot->notifier,
+ &slot->pfn_ops);
+}
+
+static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
+{
+ if (slot->private_file) {
+ memfile_unregister_notifier(file_inode(slot->private_file),
+ &slot->notifier);
+ fput(slot->private_file);
+ slot->private_file = NULL;
+ }
+}
+
+#else /* !CONFIG_MEMFILE_NOTIFIER */
+
+static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot)
+{
+}
+
+#endif /* CONFIG_MEMFILE_NOTIFIER */
+
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
static int kvm_pm_notifier_call(struct notifier_block *bl,
unsigned long state,
@@ -890,6 +921,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
/* This does not remove the slot from struct kvm_memslots data structures */
static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
+ if (slot->flags & KVM_MEM_PRIVATE)
+ kvm_memfile_unregister(slot);
+
kvm_destroy_dirty_bitmap(slot);

kvm_arch_free_memslot(kvm, slot);
@@ -1744,6 +1778,12 @@ static int kvm_set_memslot(struct kvm *kvm,
kvm_invalidate_memslot(kvm, old, invalid_slot);
}

+ if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) {
+ r = kvm_memfile_register(new);
+ if (r)
+ return r;
+ }
+
r = kvm_prepare_memory_region(kvm, old, new, change);
if (r) {
/*
@@ -1758,6 +1798,10 @@ static int kvm_set_memslot(struct kvm *kvm,
} else {
mutex_unlock(&kvm->slots_arch_lock);
}
+
+ if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE)
+ kvm_memfile_unregister(new);
+
return r;
}

@@ -1823,6 +1867,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
enum kvm_mr_change change;
unsigned long npages;
gfn_t base_gfn;
+ struct file *file = NULL;
int as_id, id;
int r;

@@ -1896,14 +1941,24 @@ int __kvm_set_memory_region(struct kvm *kvm,
return 0;
}

+ if (mem->flags & KVM_MEM_PRIVATE) {
+ file = fdget(region_ext->private_fd).file;
+ if (!file)
+ return -EINVAL;
+ }
+
if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
- kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages))
- return -EEXIST;
+ kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) {
+ r = -EEXIST;
+ goto out;
+ }

/* Allocate a slot that will persist in the memslot. */
new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT);
- if (!new)
- return -ENOMEM;
+ if (!new) {
+ r = -ENOMEM;
+ goto out;
+ }

new->as_id = as_id;
new->id = id;
@@ -1911,10 +1966,18 @@ int __kvm_set_memory_region(struct kvm *kvm,
new->npages = npages;
new->flags = mem->flags;
new->userspace_addr = mem->userspace_addr;
+ new->private_file = file;
+ new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
+ region_ext->private_offset : 0;

r = kvm_set_memslot(kvm, old, new, change);
- if (r)
- kfree(new);
+ if (!r)
+ return r;
+
+ kfree(new);
+out:
+ if (file)
+ fput(file);
return r;
}
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
--
2.17.1

2022-01-20 10:45:00

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 11/12] KVM: Zap existing KVM mappings when pages changed in the private fd

KVM gets notified when memory pages changed in the memory backing store.
When userspace allocates the memory with fallocate() or frees memory
with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into
KVM fallocate/invalidate callbacks respectively. To ensure KVM never
maps both the private and shared variants of a GPA into the guest, in
the fallocate callback, we should zap the existing shared mapping and
in the invalidate callback we should zap the existing private mapping.

In the callbacks, KVM firstly converts the offset range into the
gfn_range and then calls existing kvm_unmap_gfn_range() which will zap
the shared or private mapping. Both callbacks pass in a memslot
reference but we need 'kvm' so add a reference in memslot structure.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 3 ++-
virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 667efe839767..117cf0da9c5e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -235,7 +235,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -463,6 +463,7 @@ struct kvm_memory_slot {
loff_t private_offset;
struct memfile_pfn_ops *pfn_ops;
struct memfile_notifier notifier;
+ struct kvm *kvm;
};

static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b78ddef7880..10e553215618 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -847,8 +847,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */

#ifdef CONFIG_MEMFILE_NOTIFIER
+static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
+ pgoff_t start, pgoff_t end)
+{
+ int idx;
+ struct kvm_memory_slot *slot = container_of(notifier,
+ struct kvm_memory_slot,
+ notifier);
+ struct kvm_gfn_range gfn_range = {
+ .slot = slot,
+ .start = start - (slot->private_offset >> PAGE_SHIFT),
+ .end = end - (slot->private_offset >> PAGE_SHIFT),
+ .may_block = true,
+ };
+ struct kvm *kvm = slot->kvm;
+
+ gfn_range.start = max(gfn_range.start, slot->base_gfn);
+ gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
+
+ if (gfn_range.start >= gfn_range.end)
+ return;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ KVM_MMU_LOCK(kvm);
+ kvm_unmap_gfn_range(kvm, &gfn_range);
+ kvm_flush_remote_tlbs(kvm);
+ KVM_MMU_UNLOCK(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
+ .invalidate = kvm_memfile_notifier_handler,
+ .fallocate = kvm_memfile_notifier_handler,
+};
+
static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
{
+ slot->notifier.ops = &kvm_memfile_notifier_ops;
return memfile_register_notifier(file_inode(slot->private_file),
&slot->notifier,
&slot->pfn_ops);
@@ -1969,6 +2004,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
new->private_file = file;
new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
region_ext->private_offset : 0;
+ new->kvm = kvm;

r = kvm_set_memslot(kvm, old, new, change);
if (!r)
--
2.17.1

2022-01-20 10:45:12

by Chao Peng

[permalink] [raw]
Subject: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
on it by implementing kvm_arch_private_memory_supported().

Also private memslot cannot be movable and the same file+offset can not
be mapped into different GFNs.

Signed-off-by: Yu Zhang <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++++++++------
2 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 117cf0da9c5e..444b390261c0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1328,6 +1328,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
int kvm_arch_post_init_vm(struct kvm *kvm);
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_private_memory_supported(struct kvm *kvm);

#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 10e553215618..51d0f08a8601 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1491,10 +1491,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
}

-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
+{
+ return false;
+}
+
+static int check_memory_region_flags(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

+ if (kvm_arch_private_memory_supported(kvm))
+ valid_flags |= KVM_MEM_PRIVATE;
+
#ifdef __KVM_HAVE_READONLY_MEM
valid_flags |= KVM_MEM_READONLY;
#endif
@@ -1873,15 +1882,32 @@ static int kvm_set_memslot(struct kvm *kvm,
}

static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
- gfn_t start, gfn_t end)
+ struct file *file,
+ gfn_t start, gfn_t end,
+ loff_t start_off, loff_t end_off)
{
struct kvm_memslot_iter iter;
+ struct kvm_memory_slot *slot;
+ struct inode *inode;
+ int bkt;

kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
if (iter.slot->id != id)
return true;
}

+ /* Disallow mapping the same file+offset into multiple gfns. */
+ if (file) {
+ inode = file_inode(file);
+ kvm_for_each_memslot(slot, bkt, slots) {
+ if (slot->private_file &&
+ file_inode(slot->private_file) == inode &&
+ !(end_off <= slot->private_offset ||
+ start_off >= slot->private_offset
+ + (slot->npages >> PAGE_SHIFT)))
+ return true;
+ }
+ }
return false;
}

@@ -1906,7 +1932,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;

- r = check_memory_region_flags(mem);
+ r = check_memory_region_flags(kvm, mem);
if (r)
return r;

@@ -1919,10 +1945,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if (mem->guest_phys_addr & (PAGE_SIZE - 1))
return -EINVAL;
- /* We can read the guest memory with __xxx_user() later on. */
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
- !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+ (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
+ return -EINVAL;
+ /* We can read the guest memory with __xxx_user() later on. */
+ if (!(mem->flags & KVM_MEM_PRIVATE) &&
+ !access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
@@ -1963,6 +1991,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
return -EINVAL;
} else { /* Modify an existing slot. */
+ /* Private memslots are immutable, they can only be deleted. */
+ if (mem->flags & KVM_MEM_PRIVATE)
+ return -EINVAL;
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
((mem->flags ^ old->flags) & KVM_MEM_READONLY))
@@ -1983,7 +2014,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
}

if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
- kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) {
+ kvm_check_memslot_overlap(slots, id, file,
+ base_gfn, base_gfn + npages,
+ region_ext->private_offset,
+ region_ext->private_offset +
+ mem->memory_size)) {
r = -EEXIST;
goto out;
}
--
2.17.1

2022-01-22 01:55:42

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

On 18/01/2022 13:21, Chao Peng wrote:
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace. It does this by force
> setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
> F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
> with MFD_ALLOW_SEALING.
>
> The pages backed by such memfd will be used as guest private memory in
> confidential computing environments such as Intel TDX/AMD SEV. Since
> page migration/swapping is not yet supported for such usages so these
> pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
> them behave like long-term pinned pages.
>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/uapi/linux/memfd.h | 1 +
> mm/memfd.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
> #define MFD_CLOEXEC 0x0001U
> #define MFD_ALLOW_SEALING 0x0002U
> #define MFD_HUGETLB 0x0004U
> +#define MFD_INACCESSIBLE 0x0008U
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 9f80f162791a..26998d96dc11 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -245,16 +245,19 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> + MFD_INACCESSIBLE)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> unsigned int, flags)
> {
> + struct address_space *mapping;
> unsigned int *file_seals;
> struct file *file;
> int fd, error;
> char *name;
> + gfp_t gfp;
> long len;
>
> if (!(flags & MFD_HUGETLB)) {
> @@ -267,6 +270,10 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> + /* Disallow sealing when MFD_INACCESSIBLE is set. */
> + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> + return -EINVAL;
> +
> /* length includes terminating zero */
> len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> if (len <= 0)
> @@ -315,6 +322,17 @@ SYSCALL_DEFINE2(memfd_create,
> *file_seals &= ~F_SEAL_SEAL;
> }
>
> + if (flags & MFD_INACCESSIBLE) {
> + mapping = file_inode(file)->i_mapping;
> + gfp = mapping_gfp_mask(mapping);
> + gfp &= ~__GFP_MOVABLE;
> + mapping_set_gfp_mask(mapping, gfp);
> + mapping_set_unevictable(mapping);
> +
> + file_seals = memfd_file_seals_ptr(file);
> + *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;

This looks backwards - the flags should be set on *file_seals, but here
you are unsetting all other flags.

Steve

> + }
> +
> fd_install(fd, file);
> kfree(name);
> return fd;
>

2022-01-24 19:18:34

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

On Fri, Jan 21, 2022 at 03:50:55PM +0000, Steven Price wrote:
> On 18/01/2022 13:21, Chao Peng wrote:
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace. It does this by force
> > setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
> > F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
> > with MFD_ALLOW_SEALING.
> >
> > The pages backed by such memfd will be used as guest private memory in
> > confidential computing environments such as Intel TDX/AMD SEV. Since
> > page migration/swapping is not yet supported for such usages so these
> > pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
> > them behave like long-term pinned pages.
> >
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > include/uapi/linux/memfd.h | 1 +
> > mm/memfd.c | 20 +++++++++++++++++++-
> > 2 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 7a8a26751c23..48750474b904 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -8,6 +8,7 @@
> > #define MFD_CLOEXEC 0x0001U
> > #define MFD_ALLOW_SEALING 0x0002U
> > #define MFD_HUGETLB 0x0004U
> > +#define MFD_INACCESSIBLE 0x0008U
> >
> > /*
> > * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 9f80f162791a..26998d96dc11 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -245,16 +245,19 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> > + MFD_INACCESSIBLE)
> >
> > SYSCALL_DEFINE2(memfd_create,
> > const char __user *, uname,
> > unsigned int, flags)
> > {
> > + struct address_space *mapping;
> > unsigned int *file_seals;
> > struct file *file;
> > int fd, error;
> > char *name;
> > + gfp_t gfp;
> > long len;
> >
> > if (!(flags & MFD_HUGETLB)) {
> > @@ -267,6 +270,10 @@ SYSCALL_DEFINE2(memfd_create,
> > return -EINVAL;
> > }
> >
> > + /* Disallow sealing when MFD_INACCESSIBLE is set. */
> > + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> > + return -EINVAL;
> > +
> > /* length includes terminating zero */
> > len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> > if (len <= 0)
> > @@ -315,6 +322,17 @@ SYSCALL_DEFINE2(memfd_create,
> > *file_seals &= ~F_SEAL_SEAL;
> > }
> >
> > + if (flags & MFD_INACCESSIBLE) {
> > + mapping = file_inode(file)->i_mapping;
> > + gfp = mapping_gfp_mask(mapping);
> > + gfp &= ~__GFP_MOVABLE;
> > + mapping_set_gfp_mask(mapping, gfp);
> > + mapping_set_unevictable(mapping);
> > +
> > + file_seals = memfd_file_seals_ptr(file);
> > + *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;
>
> This looks backwards - the flags should be set on *file_seals, but here
> you are unsetting all other flags.

Thanks Steve. '|=' actually should be used here.

Chao
>
> Steve
>
> > + }
> > +
> > fd_install(fd, file);
> > kfree(name);
> > return fd;
> >

2022-01-26 09:27:12

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

On 18.01.2022 14:21, Chao Peng wrote:
> KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
> on it by implementing kvm_arch_private_memory_supported().
>
> Also private memslot cannot be movable and the same file+offset can not
> be mapped into different GFNs.
>
> Signed-off-by: Yu Zhang <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
(..)
>
> static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> - gfn_t start, gfn_t end)
> + struct file *file,
> + gfn_t start, gfn_t end,
> + loff_t start_off, loff_t end_off)
> {
> struct kvm_memslot_iter iter;
> + struct kvm_memory_slot *slot;
> + struct inode *inode;
> + int bkt;
>
> kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> if (iter.slot->id != id)
> return true;
> }
>
> + /* Disallow mapping the same file+offset into multiple gfns. */
> + if (file) {
> + inode = file_inode(file);
> + kvm_for_each_memslot(slot, bkt, slots) {
> + if (slot->private_file &&
> + file_inode(slot->private_file) == inode &&
> + !(end_off <= slot->private_offset ||
> + start_off >= slot->private_offset
> + + (slot->npages >> PAGE_SHIFT)))
> + return true;
> + }
> + }

That's a linear scan of all memslots on each CREATE (and MOVE) operation
with a fd - we just spent more than a year rewriting similar linear scans
into more efficient operations in KVM.

Thanks,
Maciej

2022-01-31 11:21:18

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

On 18/01/2022 13:21, Chao Peng wrote:
> This is the v4 of this series which try to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
> fea31d169094 KVM: x86/pmu: Fix available_event_types check for
> REF_CPU_CYCLES event
>
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
>
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
>
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory.

This looks like it will be useful for Arm's Confidential Compute
Architecture (CCA) too - in particular we need a way of ensuring that
user space cannot 'trick' the kernel into accessing memory which has
been delegated to a realm (i.e. protected guest), and a memfd seems like
a good match.

Some comments below.

> mm extension
> ---------------------
> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
> flag for memfd_create(), the file created with these flags cannot read(),
> write() or mmap() etc via normal MMU operations. The file content can
> only be used with the newly introduced memfile_notifier extension.

For Arm CCA we are expecting to seed the realm with an initial memory
contents (e.g. kernel and initrd) which will then be measured before
execution starts. The 'obvious' way of doing this with a memfd would be
to populate parts of the memfd then seal it with F_SEAL_INACCESSIBLE.

However as things stand it's not possible to set the INACCESSIBLE seal
after creating a memfd (F_ALL_SEALS hasn't been updated to include it).

One potential workaround would be for arm64 to provide a custom KVM
ioctl to effectively memcpy() into the guest's protected memory which
would only be accessible before the guest has started. The drawback is
that it requires two copies of the data during guest setup.

Do you think things could be relaxed so the F_SEAL_INACCESSIBLE flag
could be set after a memfd has been created (and partially populated)?

Thanks,

Steve

> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
> - memfile_notifier_ops: callbacks for memory backing store to notify
> KVM when memory gets allocated/invalidated.
> - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> to request memory pages for guest private memory.
>
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
>
> Private memory map/unmap and conversion
> ---------------------------------------
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
> - map: default fallocate() with mode=0.
> - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.
>
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
>
> - Regresion test on kvm/queue (this patch)
> Most new code are not covered. I only tested building and booting.
>
> - New Funational test on latest TDX code
> The patch is rebased to latest TDX code and tested the new
> funcationalities.
>
> For TDX test please see below repos:
> Linux: https://github.com/chao-p/linux/tree/privmem-v4.3
> QEMU: https://github.com/chao-p/qemu/tree/privmem-v4
>
> And an example QEMU command line:
> -object tdx-guest,id=tdx \
> -object memory-backend-memfd-private,id=ram1,size=2G \
> -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
>
> Changelog
> ----------
> v4:
> - Decoupled the callbacks between KVM/mm from memfd and use new
> name 'memfile_notifier'.
> - Supported register multiple memslots to the same backing store.
> - Added per-memslot pfn_ops instead of per-system.
> - Reworked the invalidation part.
> - Improved new KVM uAPIs (private memslot extension and memory
> error) per Sean's suggestions.
> - Addressed many other minor fixes for comments from v3.
> v3:
> - Added locking protection when calling
> invalidate_page_range/fallocate callbacks.
> - Changed memslot structure to keep use useraddr for shared memory.
> - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> - Commit message improvement.
> - Many small fixes for comments from the last version.
>
> Links of previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/[email protected]/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/[email protected]/
> [3] Patch v3: https://lkml.org/lkml/2021/12/23/283
>
> Chao Peng (11):
> mm/memfd: Introduce MFD_INACCESSIBLE flag
> mm: Introduce memfile_notifier
> mm/shmem: Support memfile_notifier
> KVM: Extend the memslot to support fd-based private memory
> KVM: Use kvm_userspace_memory_region_ext
> KVM: Add KVM_EXIT_MEMORY_ERROR exit
> KVM: Use memfile_pfn_ops to obtain pfn for private pages
> KVM: Handle page fault for private memory
> KVM: Register private memslot to memory backing store
> KVM: Zap existing KVM mappings when pages changed in the private fd
> KVM: Expose KVM_MEM_PRIVATE
>
> Kirill A. Shutemov (1):
> mm/shmem: Introduce F_SEAL_INACCESSIBLE
>
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/mmu/mmu.c | 73 +++++++++++-
> arch/x86/kvm/mmu/paging_tmpl.h | 11 +-
> arch/x86/kvm/x86.c | 12 +-
> include/linux/kvm_host.h | 49 +++++++-
> include/linux/memfile_notifier.h | 53 +++++++++
> include/linux/shmem_fs.h | 4 +
> include/uapi/linux/fcntl.h | 1 +
> include/uapi/linux/kvm.h | 17 +++
> include/uapi/linux/memfd.h | 1 +
> mm/Kconfig | 4 +
> mm/Makefile | 1 +
> mm/memfd.c | 20 +++-
> mm/memfile_notifier.c | 99 ++++++++++++++++
> mm/shmem.c | 121 +++++++++++++++++++-
> virt/kvm/kvm_main.c | 188 +++++++++++++++++++++++++++----
> 16 files changed, 614 insertions(+), 41 deletions(-)
> create mode 100644 include/linux/memfile_notifier.h
> create mode 100644 mm/memfile_notifier.c
>

2022-02-02 10:26:56

by Nakajima, Jun

[permalink] [raw]
Subject: Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory


> On Jan 28, 2022, at 8:47 AM, Steven Price <[email protected]> wrote:
>
> On 18/01/2022 13:21, Chao Peng wrote:
>> This is the v4 of this series which try to implement the fd-based KVM
>> guest private memory. The patches are based on latest kvm/queue branch
>> commit:
>>
>> fea31d169094 KVM: x86/pmu: Fix available_event_types check for
>> REF_CPU_CYCLES event
>>
>> Introduction
>> ------------
>> In general this patch series introduce fd-based memslot which provides
>> guest memory through memory file descriptor fd[offset,size] instead of
>> hva/size. The fd can be created from a supported memory filesystem
>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>> and the the memory backing store exchange callbacks when such memslot
>> gets created. At runtime KVM will call into callbacks provided by the
>> backing store to get the pfn with the fd+offset. Memory backing store
>> will also call into KVM callbacks when userspace fallocate/punch hole
>> on the fd to notify KVM to map/unmap secondary MMU page tables.
>>
>> Comparing to existing hva-based memslot, this new type of memslot allows
>> guest memory unmapped from host userspace like QEMU and even the kernel
>> itself, therefore reduce attack surface and prevent bugs.
>>
>> Based on this fd-based memslot, we can build guest private memory that
>> is going to be used in confidential computing environments such as Intel
>> TDX and AMD SEV. When supported, the memory backing store can provide
>> more enforcement on the fd and KVM can use a single memslot to hold both
>> the private and shared part of the guest memory.
>
> This looks like it will be useful for Arm's Confidential Compute
> Architecture (CCA) too - in particular we need a way of ensuring that
> user space cannot 'trick' the kernel into accessing memory which has
> been delegated to a realm (i.e. protected guest), and a memfd seems like
> a good match.

Good to hear that it will be useful for ARM’s CCA as well.

>
> Some comments below.
>
>> mm extension
>> ---------------------
>> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
>> flag for memfd_create(), the file created with these flags cannot read(),
>> write() or mmap() etc via normal MMU operations. The file content can
>> only be used with the newly introduced memfile_notifier extension.
>
> For Arm CCA we are expecting to seed the realm with an initial memory
> contents (e.g. kernel and initrd) which will then be measured before
> execution starts. The 'obvious' way of doing this with a memfd would be
> to populate parts of the memfd then seal it with F_SEAL_INACCESSIBLE.

As far as I understand, we have the same problem with TDX, where a guest TD (Trust Domain) starts in private memory. We seed the private memory typically with a guest firmware, and the initial image (plaintext) is copied to somewhere in QEMU memory (from disk, for example) for that purpose; this location is not associated with the target GPA.

Upon a (new) ioctl from QEMU, KVM requests the TDX Module to copy the pages to private memory (by encrypting) specifying the target GPA, using a TDX interface function (TDH.MEM.PAGE.ADD). The actual pages for the private memory is allocated by the callbacks provided by the backing store during the “copy” operation.

We extended the existing KVM_MEMORY_ENCRYPT_OP (ioctl) for the above.

>
> However as things stand it's not possible to set the INACCESSIBLE seal
> after creating a memfd (F_ALL_SEALS hasn't been updated to include it).
>
> One potential workaround would be for arm64 to provide a custom KVM
> ioctl to effectively memcpy() into the guest's protected memory which
> would only be accessible before the guest has started. The drawback is
> that it requires two copies of the data during guest setup.

So, the guest pages are not encrypted in the realm?

I think you could do the same thing, i.e. KVM copies the pages to the realm, where pages are allocated by the backing store. But, yes, it will have two copies of the data at that time unless encrypted. .

>
> Do you think things could be relaxed so the F_SEAL_INACCESSIBLE flag
> could be set after a memfd has been created (and partially populated)?
>

I think F_SEAL_INACCESSIBLE could be deferred to the point where measurement of the initial image is done (we call “build-time” measurement in TDX). For example, if we add a callback to activate F_SEAL_INACCESSIBLE and KVM calls it before such the measurement time, does that work for you?

---
Jun



2022-02-04 04:26:11

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

Hi Jun,

On 02/02/2022 02:28, Nakajima, Jun wrote:
>
>> On Jan 28, 2022, at 8:47 AM, Steven Price <[email protected]> wrote:
>>
>> On 18/01/2022 13:21, Chao Peng wrote:
>>> This is the v4 of this series which try to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>> fea31d169094 KVM: x86/pmu: Fix available_event_types check for
>>> REF_CPU_CYCLES event
>>>
>>> Introduction
>>> ------------
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>> and the the memory backing store exchange callbacks when such memslot
>>> gets created. At runtime KVM will call into callbacks provided by the
>>> backing store to get the pfn with the fd+offset. Memory backing store
>>> will also call into KVM callbacks when userspace fallocate/punch hole
>>> on the fd to notify KVM to map/unmap secondary MMU page tables.
>>>
>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>> itself, therefore reduce attack surface and prevent bugs.
>>>
>>> Based on this fd-based memslot, we can build guest private memory that
>>> is going to be used in confidential computing environments such as Intel
>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>> the private and shared part of the guest memory.
>>
>> This looks like it will be useful for Arm's Confidential Compute
>> Architecture (CCA) too - in particular we need a way of ensuring that
>> user space cannot 'trick' the kernel into accessing memory which has
>> been delegated to a realm (i.e. protected guest), and a memfd seems like
>> a good match.
>
> Good to hear that it will be useful for ARM’s CCA as well.
>
>>
>> Some comments below.
>>
>>> mm extension
>>> ---------------------
>>> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
>>> flag for memfd_create(), the file created with these flags cannot read(),
>>> write() or mmap() etc via normal MMU operations. The file content can
>>> only be used with the newly introduced memfile_notifier extension.
>>
>> For Arm CCA we are expecting to seed the realm with an initial memory
>> contents (e.g. kernel and initrd) which will then be measured before
>> execution starts. The 'obvious' way of doing this with a memfd would be
>> to populate parts of the memfd then seal it with F_SEAL_INACCESSIBLE.
>
> As far as I understand, we have the same problem with TDX, where a guest TD (Trust Domain) starts in private memory. We seed the private memory typically with a guest firmware, and the initial image (plaintext) is copied to somewhere in QEMU memory (from disk, for example) for that purpose; this location is not associated with the target GPA.
>
> Upon a (new) ioctl from QEMU, KVM requests the TDX Module to copy the pages to private memory (by encrypting) specifying the target GPA, using a TDX interface function (TDH.MEM.PAGE.ADD). The actual pages for the private memory is allocated by the callbacks provided by the backing store during the “copy” operation.
>
> We extended the existing KVM_MEMORY_ENCRYPT_OP (ioctl) for the above.

Ok, so if I understand correctly QEMU would do something along the lines of:

1. Use memfd_create(...MFD_INACCESSIBLE) to allocate private memory for
the guest.

2. ftruncate/fallocate the memfd to back the appropriate areas of the memfd.

3. Create a memslot in KVM pointing to the memfd

4. Load the 'guest firmware' (kernel/initrd or similar) into VMM memory

5. Use the KVM_MEMORY_ENCRYPT_OP to request the 'guest firmware' be
copied into the private memory. The ioctl would temporarily pin the
pages and ask the TDX module to copy (& encrypt) the data into the
private memory, unpinning after the copy.

6. QEMU can then free the unencrypted copy of the guest firmware.

>>
>> However as things stand it's not possible to set the INACCESSIBLE seal
>> after creating a memfd (F_ALL_SEALS hasn't been updated to include it).
>>
>> One potential workaround would be for arm64 to provide a custom KVM
>> ioctl to effectively memcpy() into the guest's protected memory which
>> would only be accessible before the guest has started. The drawback is
>> that it requires two copies of the data during guest setup.
>
> So, the guest pages are not encrypted in the realm?

The pages are likely to be encrypted, but architecturally it doesn't
matter - the hardware prevents the 'Normal World' accessing the pages
when they are assigned to the realm. Encryption is only necessary to
protect against hardware attacks (e.g. bus snooping).

> I think you could do the same thing, i.e. KVM copies the pages to the realm, where pages are allocated by the backing store. But, yes, it will have two copies of the data at that time unless encrypted. .

I'm not sure I follow the "unless encrypted" part of that.

>>
>> Do you think things could be relaxed so the F_SEAL_INACCESSIBLE flag
>> could be set after a memfd has been created (and partially populated)?
>>
>
> I think F_SEAL_INACCESSIBLE could be deferred to the point where measurement of the initial image is done (we call “build-time” measurement in TDX). For example, if we add a callback to activate F_SEAL_INACCESSIBLE and KVM calls it before such the measurement time, does that work for you?

Yes, if it's possible to defer setting the F_SEAL_INACCESSIBLE then it
should be possible for QEMU to load the initial 'guest firmware'
straight into the memfd. Then to launch the guest the trusted component
only needs to protect and measure the populated pages - there's no need
to copy the data from one set of pages to another. This removes the need
to have two copies of the initial image in memory at the point of measuring.

Thanks,

Steve

2022-02-08 15:32:43

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

On 07.02.22 19:51, Vlastimil Babka wrote:
> On 1/18/22 14:21, Chao Peng wrote:
>> Introduce a new memfd_create() flag indicating the content of the
>> created memfd is inaccessible from userspace. It does this by force
>> setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
>> F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
>> with MFD_ALLOW_SEALING.
>>
>> The pages backed by such memfd will be used as guest private memory in
>> confidential computing environments such as Intel TDX/AMD SEV. Since
>> page migration/swapping is not yet supported for such usages so these
>> pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
>> them behave like long-term pinned pages.
>
> Shouldn't the amount of such memory allocations be restricted? E.g. similar
> to secretmem_mmap() doing mlock_future_check().

I've raised this already in the past and Kirill wanted to look into it [1].

We'll most certainly need a way to limit/control the amount of
unswappable + unmovable ("worse than mlock" memory) a user/process can
consume via this mechanism.


[1]
https://lkml.kernel.org/r/[email protected]


--
Thanks,

David / dhildenb


2022-02-09 07:37:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

On 1/18/22 14:21, Chao Peng wrote:
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace. It does this by force
> setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
> F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
> with MFD_ALLOW_SEALING.
>
> The pages backed by such memfd will be used as guest private memory in
> confidential computing environments such as Intel TDX/AMD SEV. Since
> page migration/swapping is not yet supported for such usages so these
> pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
> them behave like long-term pinned pages.

Shouldn't the amount of such memory allocations be restricted? E.g. similar
to secretmem_mmap() doing mlock_future_check().

> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/uapi/linux/memfd.h | 1 +
> mm/memfd.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
> #define MFD_CLOEXEC 0x0001U
> #define MFD_ALLOW_SEALING 0x0002U
> #define MFD_HUGETLB 0x0004U
> +#define MFD_INACCESSIBLE 0x0008U
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 9f80f162791a..26998d96dc11 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -245,16 +245,19 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> + MFD_INACCESSIBLE)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> unsigned int, flags)
> {
> + struct address_space *mapping;
> unsigned int *file_seals;
> struct file *file;
> int fd, error;
> char *name;
> + gfp_t gfp;
> long len;
>
> if (!(flags & MFD_HUGETLB)) {
> @@ -267,6 +270,10 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> + /* Disallow sealing when MFD_INACCESSIBLE is set. */
> + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> + return -EINVAL;
> +
> /* length includes terminating zero */
> len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> if (len <= 0)
> @@ -315,6 +322,17 @@ SYSCALL_DEFINE2(memfd_create,
> *file_seals &= ~F_SEAL_SEAL;
> }
>
> + if (flags & MFD_INACCESSIBLE) {
> + mapping = file_inode(file)->i_mapping;
> + gfp = mapping_gfp_mask(mapping);
> + gfp &= ~__GFP_MOVABLE;
> + mapping_set_gfp_mask(mapping, gfp);
> + mapping_set_unevictable(mapping);
> +
> + file_seals = memfd_file_seals_ptr(file);
> + *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;
> + }
> +
> fd_install(fd, file);
> kfree(name);
> return fd;


2022-02-09 09:23:08

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On 1/18/22 14:21, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> the file is inaccessible from userspace through ordinary MMU access
> (e.g., read/write/mmap). However, the file content can be accessed
> via a different mechanism (e.g. KVM MMU) indirectly.
>
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this seal set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>
> At this time only shmem implements this seal.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/uapi/linux/fcntl.h | 1 +
> mm/shmem.c | 40 ++++++++++++++++++++++++++++++++++++--
> 2 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..09ef34754dfa 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
> #define F_SEAL_GROW 0x0004 /* prevent file from growing */
> #define F_SEAL_WRITE 0x0008 /* prevent writes */
> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
> +#define F_SEAL_INACCESSIBLE 0x0020 /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */
> /* (1U << 31) is reserved for signed error codes */
>
> /*
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 18f93c2d68f1..72185630e7c4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1098,6 +1098,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
> (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> return -EPERM;
>
> + if (info->seals & F_SEAL_INACCESSIBLE) {
> + if(i_size_read(inode))

Is this needed? The rest of the function seems to trust oldsize obtained by
plain reading inode->i_size well enough, so why be suddenly paranoid here?

> + return -EPERM;
> + if (newsize & ~PAGE_MASK)
> + return -EINVAL;
> + }
> +
> if (newsize != oldsize) {
> error = shmem_reacct_size(SHMEM_I(inode)->flags,
> oldsize, newsize);
> @@ -1364,6 +1371,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> goto redirty;
> if (!total_swap_pages)
> goto redirty;
> + if (info->seals & F_SEAL_INACCESSIBLE)
> + goto redirty;
>
> /*
> * Our capabilities prevent regular writeback or sync from ever calling
> @@ -2262,6 +2271,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> if (ret)
> return ret;
>
> + if (info->seals & F_SEAL_INACCESSIBLE)
> + return -EPERM;
> +
> /* arm64 - allow memory tagging on RAM-based files */
> vma->vm_flags |= VM_MTE_ALLOWED;
>
> @@ -2459,12 +2471,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> pgoff_t index = pos >> PAGE_SHIFT;
>
> /* i_rwsem is held by caller */
> - if (unlikely(info->seals & (F_SEAL_GROW |
> - F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
> + if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE |
> + F_SEAL_FUTURE_WRITE |
> + F_SEAL_INACCESSIBLE))) {
> if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
> return -EPERM;
> if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
> return -EPERM;
> + if (info->seals & F_SEAL_INACCESSIBLE)
> + return -EPERM;
> }
>
> return shmem_getpage(inode, index, pagep, SGP_WRITE);
> @@ -2538,6 +2553,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> end_index = i_size >> PAGE_SHIFT;
> if (index > end_index)
> break;
> +
> + /*
> + * inode_lock protects setting up seals as well as write to
> + * i_size. Setting F_SEAL_INACCESSIBLE only allowed with
> + * i_size == 0.
> + *
> + * Check F_SEAL_INACCESSIBLE after i_size. It effectively
> + * serialize read vs. setting F_SEAL_INACCESSIBLE without
> + * taking inode_lock in read path.
> + */
> + if (SHMEM_I(inode)->seals & F_SEAL_INACCESSIBLE) {
> + error = -EPERM;
> + break;
> + }
> +
> if (index == end_index) {
> nr = i_size & ~PAGE_MASK;
> if (nr <= offset)
> @@ -2663,6 +2693,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> goto out;
> }
>
> + if ((info->seals & F_SEAL_INACCESSIBLE) &&
> + (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {

Could we use PAGE_ALIGNED()?

> + error = -EINVAL;
> + goto out;
> + }
> +
> shmem_falloc.waitq = &shmem_falloc_waitq;
> shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
> shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;


2022-02-09 09:53:36

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

(addded linux-api)

On Tue, Jan 18, 2022 at 09:21:09PM +0800, Chao Peng wrote:
> This is the v4 of this series which try to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
> fea31d169094 KVM: x86/pmu: Fix available_event_types check for
> REF_CPU_CYCLES event
>
> Introduction
> ------------
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
>
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
>
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory.
>
> mm extension
> ---------------------
> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
> flag for memfd_create(), the file created with these flags cannot read(),
> write() or mmap() etc via normal MMU operations. The file content can
> only be used with the newly introduced memfile_notifier extension.

It would be great to see man page draft for new ABI flags

> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
> - memfile_notifier_ops: callbacks for memory backing store to notify
> KVM when memory gets allocated/invalidated.
> - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> to request memory pages for guest private memory.
>
> memslot extension
> -----------------
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
>
> Private memory map/unmap and conversion
> ---------------------------------------
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
> - map: default fallocate() with mode=0.
> - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.
>
> Test
> ----
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
>
> - Regresion test on kvm/queue (this patch)
> Most new code are not covered. I only tested building and booting.
>
> - New Funational test on latest TDX code
> The patch is rebased to latest TDX code and tested the new
> funcationalities.
>
> For TDX test please see below repos:
> Linux: https://github.com/chao-p/linux/tree/privmem-v4.3
> QEMU: https://github.com/chao-p/qemu/tree/privmem-v4
>
> And an example QEMU command line:
> -object tdx-guest,id=tdx \
> -object memory-backend-memfd-private,id=ram1,size=2G \
> -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
>
> Changelog
> ----------
> v4:
> - Decoupled the callbacks between KVM/mm from memfd and use new
> name 'memfile_notifier'.
> - Supported register multiple memslots to the same backing store.
> - Added per-memslot pfn_ops instead of per-system.
> - Reworked the invalidation part.
> - Improved new KVM uAPIs (private memslot extension and memory
> error) per Sean's suggestions.
> - Addressed many other minor fixes for comments from v3.
> v3:
> - Added locking protection when calling
> invalidate_page_range/fallocate callbacks.
> - Changed memslot structure to keep use useraddr for shared memory.
> - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
> - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
> - Commit message improvement.
> - Many small fixes for comments from the last version.
>
> Links of previous discussions
> -----------------------------
> [1] Original design proposal:
> https://lkml.kernel.org/kvm/[email protected]/
> [2] Updated proposal and RFC patch v1:
> https://lkml.kernel.org/linux-fsdevel/[email protected]/
> [3] Patch v3: https://lkml.org/lkml/2021/12/23/283
>
> Chao Peng (11):
> mm/memfd: Introduce MFD_INACCESSIBLE flag
> mm: Introduce memfile_notifier
> mm/shmem: Support memfile_notifier
> KVM: Extend the memslot to support fd-based private memory
> KVM: Use kvm_userspace_memory_region_ext
> KVM: Add KVM_EXIT_MEMORY_ERROR exit
> KVM: Use memfile_pfn_ops to obtain pfn for private pages
> KVM: Handle page fault for private memory
> KVM: Register private memslot to memory backing store
> KVM: Zap existing KVM mappings when pages changed in the private fd
> KVM: Expose KVM_MEM_PRIVATE
>
> Kirill A. Shutemov (1):
> mm/shmem: Introduce F_SEAL_INACCESSIBLE
>
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/mmu/mmu.c | 73 +++++++++++-
> arch/x86/kvm/mmu/paging_tmpl.h | 11 +-
> arch/x86/kvm/x86.c | 12 +-
> include/linux/kvm_host.h | 49 +++++++-
> include/linux/memfile_notifier.h | 53 +++++++++
> include/linux/shmem_fs.h | 4 +
> include/uapi/linux/fcntl.h | 1 +
> include/uapi/linux/kvm.h | 17 +++
> include/uapi/linux/memfd.h | 1 +
> mm/Kconfig | 4 +
> mm/Makefile | 1 +
> mm/memfd.c | 20 +++-
> mm/memfile_notifier.c | 99 ++++++++++++++++
> mm/shmem.c | 121 +++++++++++++++++++-
> virt/kvm/kvm_main.c | 188 +++++++++++++++++++++++++++----
> 16 files changed, 614 insertions(+), 41 deletions(-)
> create mode 100644 include/linux/memfile_notifier.h
> create mode 100644 mm/memfile_notifier.c
>
> --
> 2.17.1
>
>

--
Sincerely yours,
Mike.

2022-02-09 10:34:58

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

On Tue, Feb 08, 2022 at 09:49:35AM +0100, David Hildenbrand wrote:
> On 07.02.22 19:51, Vlastimil Babka wrote:
> > On 1/18/22 14:21, Chao Peng wrote:
> >> Introduce a new memfd_create() flag indicating the content of the
> >> created memfd is inaccessible from userspace. It does this by force
> >> setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
> >> F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
> >> with MFD_ALLOW_SEALING.
> >>
> >> The pages backed by such memfd will be used as guest private memory in
> >> confidential computing environments such as Intel TDX/AMD SEV. Since
> >> page migration/swapping is not yet supported for such usages so these
> >> pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
> >> them behave like long-term pinned pages.
> >
> > Shouldn't the amount of such memory allocations be restricted? E.g. similar
> > to secretmem_mmap() doing mlock_future_check().

Heh, for me it was easy, I had the VMA :)

> I've raised this already in the past and Kirill wanted to look into it [1].
>
> We'll most certainly need a way to limit/control the amount of
> unswappable + unmovable ("worse than mlock" memory) a user/process can
> consume via this mechanism.

I think the accounting can be handled in notify_fallocate() and
notify_invalidate_page().

> [1] https://lkml.kernel.org/r/[email protected]
>
> --
> Thanks,
>
> David / dhildenb

--
Sincerely yours,
Mike.

2022-02-09 13:27:53

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

Hi,

On Tue, Jan 18, 2022 at 09:21:13PM +0800, Chao Peng wrote:
> It maintains a memfile_notifier list in shmem_inode_info structure and
> implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> then exposes them to memfile_notifier via
> shmem_get_memfile_notifier_info.
>
> We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> allocated by userspace for private memory. If there is no pages
> allocated at the offset then error should be returned so KVM knows that
> the memory is not private memory.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/shmem_fs.h | 4 ++
> mm/memfile_notifier.c | 12 +++++-
> mm/shmem.c | 81 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 96 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 166158b6e917..461633587eaf 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
> #include <linux/percpu_counter.h>
> #include <linux/xattr.h>
> #include <linux/fs_parser.h>
> +#include <linux/memfile_notifier.h>
>
> /* inode in-kernel data */
>
> @@ -24,6 +25,9 @@ struct shmem_inode_info {
> struct shared_policy policy; /* NUMA memory alloc policy */
> struct simple_xattrs xattrs; /* list of xattrs */
> atomic_t stop_eviction; /* hold when working on inode */
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> + struct memfile_notifier_list memfile_notifiers;
> +#endif
> struct inode vfs_inode;
> };
>
> diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
> index 8171d4601a04..b4699cbf629e 100644
> --- a/mm/memfile_notifier.c
> +++ b/mm/memfile_notifier.c
> @@ -41,11 +41,21 @@ void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> srcu_read_unlock(&srcu, id);
> }
>
> +#ifdef CONFIG_SHMEM
> +extern int shmem_get_memfile_notifier_info(struct inode *inode,
> + struct memfile_notifier_list **list,
> + struct memfile_pfn_ops **ops);
> +#endif
> +
> static int memfile_get_notifier_info(struct inode *inode,
> struct memfile_notifier_list **list,
> struct memfile_pfn_ops **ops)
> {
> - return -EOPNOTSUPP;
> + int ret = -EOPNOTSUPP;
> +#ifdef CONFIG_SHMEM
> + ret = shmem_get_memfile_notifier_info(inode, list, ops);
> +#endif

This looks backwards. Can we have some register method for memory backing
store and call it from shmem.c?

> + return ret;
> }
>
> int memfile_register_notifier(struct inode *inode,

--
Sincerely yours,
Mike.

2022-02-12 11:26:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

On 1/18/22 05:21, Chao Peng wrote:
> It maintains a memfile_notifier list in shmem_inode_info structure and
> implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> then exposes them to memfile_notifier via
> shmem_get_memfile_notifier_info.
>
> We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> allocated by userspace for private memory. If there is no pages
> allocated at the offset then error should be returned so KVM knows that
> the memory is not private memory.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Chao Peng <[email protected]>

> static int memfile_get_notifier_info(struct inode *inode,
> struct memfile_notifier_list **list,
> struct memfile_pfn_ops **ops)
> {
> - return -EOPNOTSUPP;
> + int ret = -EOPNOTSUPP;
> +#ifdef CONFIG_SHMEM
> + ret = shmem_get_memfile_notifier_info(inode, list, ops);
> +#endif
> + return ret;
> }

> +int shmem_get_memfile_notifier_info(struct inode *inode,
> + struct memfile_notifier_list **list,
> + struct memfile_pfn_ops **ops)
> +{
> + struct shmem_inode_info *info;
> +
> + if (!shmem_mapping(inode->i_mapping))
> + return -EINVAL;
> +
> + info = SHMEM_I(inode);
> + *list = &info->memfile_notifiers;
> + if (ops)
> + *ops = &shmem_pfn_ops;
> +
> + return 0;

I can't wrap my head around exactly who is supposed to call these
functions and when, but there appears to be a missing check that the
inode is actually a shmem inode.

What is this code trying to do? It's very abstract.

2022-02-12 13:09:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On 1/18/22 05:21, Chao Peng wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> the file is inaccessible from userspace through ordinary MMU access
> (e.g., read/write/mmap). However, the file content can be accessed
> via a different mechanism (e.g. KVM MMU) indirectly.
>
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this seal set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>
> At this time only shmem implements this seal.
>

I don't dislike this *that* much, but I do dislike this.
F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type
of object. While this can apparently be done successfully and without
races (as in this code), it's at least awkward. I think that either
creating a special inaccessible memfd should be a single operation that
create the correct type of object or there should be a clear
justification for why it's a two-step process.

(Imagine if the way to create an eventfd would be to call
timerfd_create() and then do a special fcntl to turn it into an eventfd
but only if it's not currently armed. This would be weird.)

2022-02-17 13:18:33

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> On 1/18/22 05:21, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > the file is inaccessible from userspace through ordinary MMU access
> > (e.g., read/write/mmap). However, the file content can be accessed
> > via a different mechanism (e.g. KVM MMU) indirectly.
> >
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this seal set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >
> > At this time only shmem implements this seal.
> >
>
> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
> essentially transmutes a memfd into a different type of object. While this
> can apparently be done successfully and without races (as in this code),
> it's at least awkward. I think that either creating a special inaccessible
> memfd should be a single operation that create the correct type of object or
> there should be a clear justification for why it's a two-step process.

Now one justification maybe from Stever's comment to patch-00: for ARM
usage it can be used with creating a normal memfd, (partially)populate
it with initial guest memory content (e.g. firmware), and then
F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
KVM (definitely the current code needs to be changed to support that).

Thanks,
Chao
>
> (Imagine if the way to create an eventfd would be to call timerfd_create()
> and then do a special fcntl to turn it into an eventfd but only if it's not
> currently armed. This would be weird.)

2022-02-17 18:25:12

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

On Tue, Jan 25, 2022 at 09:20:39PM +0100, Maciej S. Szmigiero wrote:
> On 18.01.2022 14:21, Chao Peng wrote:
> > KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
> > on it by implementing kvm_arch_private_memory_supported().
> >
> > Also private memslot cannot be movable and the same file+offset can not
> > be mapped into different GFNs.
> >
> > Signed-off-by: Yu Zhang <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> (..)
> > static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > - gfn_t start, gfn_t end)
> > + struct file *file,
> > + gfn_t start, gfn_t end,
> > + loff_t start_off, loff_t end_off)
> > {
> > struct kvm_memslot_iter iter;
> > + struct kvm_memory_slot *slot;
> > + struct inode *inode;
> > + int bkt;
> > kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > if (iter.slot->id != id)
> > return true;
> > }
> > + /* Disallow mapping the same file+offset into multiple gfns. */
> > + if (file) {
> > + inode = file_inode(file);
> > + kvm_for_each_memslot(slot, bkt, slots) {
> > + if (slot->private_file &&
> > + file_inode(slot->private_file) == inode &&
> > + !(end_off <= slot->private_offset ||
> > + start_off >= slot->private_offset
> > + + (slot->npages >> PAGE_SHIFT)))
> > + return true;
> > + }
> > + }
>
> That's a linear scan of all memslots on each CREATE (and MOVE) operation
> with a fd - we just spent more than a year rewriting similar linear scans
> into more efficient operations in KVM.

In the last version I tried to solve this problem by using interval tree
(just like existing hva_tree), but finally we realized that in one VM we
can have multiple fds with overlapped offsets so that approach is
incorrect. See https://lkml.org/lkml/2021/12/28/480 for the discussion.

So linear scan is used before I can find a better way.

Chao

2022-02-17 21:26:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>> On 1/18/22 05:21, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" <[email protected]>
>> >
>> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>> > the file is inaccessible from userspace through ordinary MMU access
>> > (e.g., read/write/mmap). However, the file content can be accessed
>> > via a different mechanism (e.g. KVM MMU) indirectly.
>> >
>> > It provides semantics required for KVM guest private memory support
>> > that a file descriptor with this seal set is going to be used as the
>> > source of guest memory in confidential computing environments such
>> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
>> >
>> > At this time only shmem implements this seal.
>> >
>>
>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>> essentially transmutes a memfd into a different type of object. While this
>> can apparently be done successfully and without races (as in this code),
>> it's at least awkward. I think that either creating a special inaccessible
>> memfd should be a single operation that create the correct type of object or
>> there should be a clear justification for why it's a two-step process.
>
> Now one justification maybe from Stever's comment to patch-00: for ARM
> usage it can be used with creating a normal memfd, (partially)populate
> it with initial guest memory content (e.g. firmware), and then
> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> KVM (definitely the current code needs to be changed to support that).

Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work.

In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.

Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.

2022-02-17 22:55:19

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On Mon, Feb 07, 2022 at 01:24:42PM +0100, Vlastimil Babka wrote:
> On 1/18/22 14:21, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > /*
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 18f93c2d68f1..72185630e7c4 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1098,6 +1098,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
> > (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> > return -EPERM;
> >
> > + if (info->seals & F_SEAL_INACCESSIBLE) {
> > + if(i_size_read(inode))
>
> Is this needed? The rest of the function seems to trust oldsize obtained by
> plain reading inode->i_size well enough, so why be suddenly paranoid here?

oldsize sounds enough here, unless kirill has different mind.

>
> > + return -EPERM;
> > + if (newsize & ~PAGE_MASK)
> > + return -EINVAL;
> > + }
> > +
> > if (newsize != oldsize) {
> > error = shmem_reacct_size(SHMEM_I(inode)->flags,
> > + if ((info->seals & F_SEAL_INACCESSIBLE) &&
> > + (offset & ~PAGE_MASK || len & ~PAGE_MASK)) {
>
> Could we use PAGE_ALIGNED()?

Yes, definitely, thanks.

Chao
>
> > + error = -EINVAL;
> > + goto out;
> > + }
> > +
> > shmem_falloc.waitq = &shmem_falloc_waitq;
> > shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
> > shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;

2022-02-17 23:27:43

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

On Tue, Feb 08, 2022 at 08:29:56PM +0200, Mike Rapoport wrote:
> Hi,
>
> On Tue, Jan 18, 2022 at 09:21:13PM +0800, Chao Peng wrote:
> > It maintains a memfile_notifier list in shmem_inode_info structure and
> > implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> > then exposes them to memfile_notifier via
> > shmem_get_memfile_notifier_info.
> >
> > We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> > allocated by userspace for private memory. If there is no pages
> > allocated at the offset then error should be returned so KVM knows that
> > the memory is not private memory.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
> > ---
> > include/linux/shmem_fs.h | 4 ++
> > mm/memfile_notifier.c | 12 +++++-
> > mm/shmem.c | 81 ++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 96 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index 166158b6e917..461633587eaf 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -9,6 +9,7 @@
> > #include <linux/percpu_counter.h>
> > #include <linux/xattr.h>
> > #include <linux/fs_parser.h>
> > +#include <linux/memfile_notifier.h>
> >
> > /* inode in-kernel data */
> >
> > @@ -24,6 +25,9 @@ struct shmem_inode_info {
> > struct shared_policy policy; /* NUMA memory alloc policy */
> > struct simple_xattrs xattrs; /* list of xattrs */
> > atomic_t stop_eviction; /* hold when working on inode */
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > + struct memfile_notifier_list memfile_notifiers;
> > +#endif
> > struct inode vfs_inode;
> > };
> >
> > diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
> > index 8171d4601a04..b4699cbf629e 100644
> > --- a/mm/memfile_notifier.c
> > +++ b/mm/memfile_notifier.c
> > @@ -41,11 +41,21 @@ void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> > srcu_read_unlock(&srcu, id);
> > }
> >
> > +#ifdef CONFIG_SHMEM
> > +extern int shmem_get_memfile_notifier_info(struct inode *inode,
> > + struct memfile_notifier_list **list,
> > + struct memfile_pfn_ops **ops);
> > +#endif
> > +
> > static int memfile_get_notifier_info(struct inode *inode,
> > struct memfile_notifier_list **list,
> > struct memfile_pfn_ops **ops)
> > {
> > - return -EOPNOTSUPP;
> > + int ret = -EOPNOTSUPP;
> > +#ifdef CONFIG_SHMEM
> > + ret = shmem_get_memfile_notifier_info(inode, list, ops);
> > +#endif
>
> This looks backwards. Can we have some register method for memory backing
> store and call it from shmem.c?

Agreed. That would be clearer.

Chao
>
> > + return ret;
> > }
> >
> > int memfile_register_notifier(struct inode *inode,
>
> --
> Sincerely yours,
> Mike.

2022-02-18 00:07:08

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

On Tue, Feb 08, 2022 at 08:33:18PM +0200, Mike Rapoport wrote:
> (addded linux-api)
>
> On Tue, Jan 18, 2022 at 09:21:09PM +0800, Chao Peng wrote:
> > This is the v4 of this series which try to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> > fea31d169094 KVM: x86/pmu: Fix available_event_types check for
> > REF_CPU_CYCLES event
> >
> > Introduction
> > ------------
> > In general this patch series introduce fd-based memslot which provides
> > guest memory through memory file descriptor fd[offset,size] instead of
> > hva/size. The fd can be created from a supported memory filesystem
> > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> > and the the memory backing store exchange callbacks when such memslot
> > gets created. At runtime KVM will call into callbacks provided by the
> > backing store to get the pfn with the fd+offset. Memory backing store
> > will also call into KVM callbacks when userspace fallocate/punch hole
> > on the fd to notify KVM to map/unmap secondary MMU page tables.
> >
> > Comparing to existing hva-based memslot, this new type of memslot allows
> > guest memory unmapped from host userspace like QEMU and even the kernel
> > itself, therefore reduce attack surface and prevent bugs.
> >
> > Based on this fd-based memslot, we can build guest private memory that
> > is going to be used in confidential computing environments such as Intel
> > TDX and AMD SEV. When supported, the memory backing store can provide
> > more enforcement on the fd and KVM can use a single memslot to hold both
> > the private and shared part of the guest memory.
> >
> > mm extension
> > ---------------------
> > Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
> > flag for memfd_create(), the file created with these flags cannot read(),
> > write() or mmap() etc via normal MMU operations. The file content can
> > only be used with the newly introduced memfile_notifier extension.
>
> It would be great to see man page draft for new ABI flags

Yes I can provide the man page.

Thanks,
Chao

2022-02-18 00:19:37

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

On Fri, Feb 11, 2022 at 03:40:09PM -0800, Andy Lutomirski wrote:
> On 1/18/22 05:21, Chao Peng wrote:
> > It maintains a memfile_notifier list in shmem_inode_info structure and
> > implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> > then exposes them to memfile_notifier via
> > shmem_get_memfile_notifier_info.
> >
> > We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> > allocated by userspace for private memory. If there is no pages
> > allocated at the offset then error should be returned so KVM knows that
> > the memory is not private memory.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Chao Peng <[email protected]>
>
> > static int memfile_get_notifier_info(struct inode *inode,
> > struct memfile_notifier_list **list,
> > struct memfile_pfn_ops **ops)
> > {
> > - return -EOPNOTSUPP;
> > + int ret = -EOPNOTSUPP;
> > +#ifdef CONFIG_SHMEM
> > + ret = shmem_get_memfile_notifier_info(inode, list, ops);
> > +#endif
> > + return ret;
> > }
>
> > +int shmem_get_memfile_notifier_info(struct inode *inode,
> > + struct memfile_notifier_list **list,
> > + struct memfile_pfn_ops **ops)
> > +{
> > + struct shmem_inode_info *info;
> > +
> > + if (!shmem_mapping(inode->i_mapping))
> > + return -EINVAL;
> > +
> > + info = SHMEM_I(inode);
> > + *list = &info->memfile_notifiers;
> > + if (ops)
> > + *ops = &shmem_pfn_ops;
> > +
> > + return 0;
>
> I can't wrap my head around exactly who is supposed to call these functions
> and when, but there appears to be a missing check that the inode is actually
> a shmem inode.
>
> What is this code trying to do? It's very abstract.

This is to be called by memfile_(un)register_notifier in patch-03 to
allow shmem to be connected to memfile_notifer. But as Mike pointed out,
probably introducing a memfile_notifier_register_backing_store() sounds
better so backing store (e.g. shmem) can register itself to
memfile_notifier.

Chao

2022-02-22 05:19:19

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

On 17.02.2022 14:45, Chao Peng wrote:
> On Tue, Jan 25, 2022 at 09:20:39PM +0100, Maciej S. Szmigiero wrote:
>> On 18.01.2022 14:21, Chao Peng wrote:
>>> KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
>>> on it by implementing kvm_arch_private_memory_supported().
>>>
>>> Also private memslot cannot be movable and the same file+offset can not
>>> be mapped into different GFNs.
>>>
>>> Signed-off-by: Yu Zhang <[email protected]>
>>> Signed-off-by: Chao Peng <[email protected]>
>>> ---
>> (..)
>>> static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>>> - gfn_t start, gfn_t end)
>>> + struct file *file,
>>> + gfn_t start, gfn_t end,
>>> + loff_t start_off, loff_t end_off)
>>> {
>>> struct kvm_memslot_iter iter;
>>> + struct kvm_memory_slot *slot;
>>> + struct inode *inode;
>>> + int bkt;
>>> kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
>>> if (iter.slot->id != id)
>>> return true;
>>> }
>>> + /* Disallow mapping the same file+offset into multiple gfns. */
>>> + if (file) {
>>> + inode = file_inode(file);
>>> + kvm_for_each_memslot(slot, bkt, slots) {
>>> + if (slot->private_file &&
>>> + file_inode(slot->private_file) == inode &&
>>> + !(end_off <= slot->private_offset ||
>>> + start_off >= slot->private_offset
>>> + + (slot->npages >> PAGE_SHIFT)))
>>> + return true;
>>> + }
>>> + }
>>
>> That's a linear scan of all memslots on each CREATE (and MOVE) operation
>> with a fd - we just spent more than a year rewriting similar linear scans
>> into more efficient operations in KVM.
>
> In the last version I tried to solve this problem by using interval tree
> (just like existing hva_tree), but finally we realized that in one VM we
> can have multiple fds with overlapped offsets so that approach is
> incorrect. See https://lkml.org/lkml/2021/12/28/480 for the discussion.

That's right, in this case a two-level structure would be necessary:
the first level matching a file, then the second level matching that
file ranges.
However, if such data is going to be used just for checking possible
overlap at memslot add or move time it is almost certainly an overkill.

> So linear scan is used before I can find a better way.

Another option would be to simply not check for overlap at add or move
time, declare such configuration undefined behavior under KVM API and
make sure in MMU notifiers that nothing bad happens to the host kernel
if it turns out somebody actually set up a VM this way (it could be
inefficient in this case, since it's not supposed to ever happen
unless there is a bug somewhere in the userspace part).

> Chao

Thanks,
Maciej

2022-02-23 13:44:49

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> >> On 1/18/22 05:21, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" <[email protected]>
> >> >
> >> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> >> > the file is inaccessible from userspace through ordinary MMU access
> >> > (e.g., read/write/mmap). However, the file content can be accessed
> >> > via a different mechanism (e.g. KVM MMU) indirectly.
> >> >
> >> > It provides semantics required for KVM guest private memory support
> >> > that a file descriptor with this seal set is going to be used as the
> >> > source of guest memory in confidential computing environments such
> >> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >> >
> >> > At this time only shmem implements this seal.
> >> >
> >>
> >> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
> >> essentially transmutes a memfd into a different type of object. While this
> >> can apparently be done successfully and without races (as in this code),
> >> it's at least awkward. I think that either creating a special inaccessible
> >> memfd should be a single operation that create the correct type of object or
> >> there should be a clear justification for why it's a two-step process.
> >
> > Now one justification maybe from Stever's comment to patch-00: for ARM
> > usage it can be used with creating a normal memfd, (partially)populate
> > it with initial guest memory content (e.g. firmware), and then
> > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> > KVM (definitely the current code needs to be changed to support that).
>
> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work.

Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
need to make sure access to existing mmap-ed area should be prevented,
but that is hard.

>
> In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.

Yes, TDX requires a ioctl. Steven may comment on the ARM part.

Chao
>
> Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.



2022-02-23 15:45:29

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On 23/02/2022 11:49, Chao Peng wrote:
> On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
>> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
>>> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>>>> On 1/18/22 05:21, Chao Peng wrote:
>>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>>
>>>>> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>>>>> the file is inaccessible from userspace through ordinary MMU access
>>>>> (e.g., read/write/mmap). However, the file content can be accessed
>>>>> via a different mechanism (e.g. KVM MMU) indirectly.
>>>>>
>>>>> It provides semantics required for KVM guest private memory support
>>>>> that a file descriptor with this seal set is going to be used as the
>>>>> source of guest memory in confidential computing environments such
>>>>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>>>>>
>>>>> At this time only shmem implements this seal.
>>>>>
>>>>
>>>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>>>> essentially transmutes a memfd into a different type of object. While this
>>>> can apparently be done successfully and without races (as in this code),
>>>> it's at least awkward. I think that either creating a special inaccessible
>>>> memfd should be a single operation that create the correct type of object or
>>>> there should be a clear justification for why it's a two-step process.
>>>
>>> Now one justification maybe from Stever's comment to patch-00: for ARM
>>> usage it can be used with creating a normal memfd, (partially)populate
>>> it with initial guest memory content (e.g. firmware), and then
>>> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
>>> KVM (definitely the current code needs to be changed to support that).
>>
>> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work.
>
> Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
> need to make sure access to existing mmap-ed area should be prevented,
> but that is hard.
>
>>
>> In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.
>
> Yes, TDX requires a ioctl. Steven may comment on the ARM part.

The Arm story is evolving so I can't give a definite answer yet. Our
current prototyping works by creating the initial VM content in a
memslot as with a normal VM and then calling an ioctl which throws the
big switch and converts all the (populated) pages to be protected. At
this point the RMM performs a measurement of the data that the VM is
being populated with.

The above (in our prototype) suffers from all the expected problems with
a malicious VMM being able to trick the host kernel into accessing those
pages after they have been protected (causing a fault detected by the
hardware).

The ideal (from our perspective) approach would be to follow the same
flow but where the VMM populates a memfd rather than normal anonymous
pages. The memfd could then be sealed and the pages converted to
protected ones (with the RMM measuring them in the process).

The question becomes how is that memfd populated? It would be nice if
that could be done using normal operations on a memfd (i.e. using
mmap()) and therefore this code could be (relatively) portable. This
would mean that any pages mapped from the memfd would either need to
block the sealing or be revoked at the time of sealing.

The other approach is we could of course implement a special ioctl which
effectively does a memcpy into the (created empty and sealed) memfd and
does the necessary dance with the RMM to measure the contents. This
would match the "transcript of the series of operations" described above
- but seems much less ideal from the viewpoint of the VMM.

Steve

> Chao
>>
>> Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.
>
>
>

2022-02-23 20:13:52

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

On Tue, Feb 22, 2022 at 02:16:46AM +0100, Maciej S. Szmigiero wrote:
> On 17.02.2022 14:45, Chao Peng wrote:
> > On Tue, Jan 25, 2022 at 09:20:39PM +0100, Maciej S. Szmigiero wrote:
> > > On 18.01.2022 14:21, Chao Peng wrote:
> > > > KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
> > > > on it by implementing kvm_arch_private_memory_supported().
> > > >
> > > > Also private memslot cannot be movable and the same file+offset can not
> > > > be mapped into different GFNs.
> > > >
> > > > Signed-off-by: Yu Zhang <[email protected]>
> > > > Signed-off-by: Chao Peng <[email protected]>
> > > > ---
> > > (..)
> > > > static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > > - gfn_t start, gfn_t end)
> > > > + struct file *file,
> > > > + gfn_t start, gfn_t end,
> > > > + loff_t start_off, loff_t end_off)
> > > > {
> > > > struct kvm_memslot_iter iter;
> > > > + struct kvm_memory_slot *slot;
> > > > + struct inode *inode;
> > > > + int bkt;
> > > > kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > > if (iter.slot->id != id)
> > > > return true;
> > > > }
> > > > + /* Disallow mapping the same file+offset into multiple gfns. */
> > > > + if (file) {
> > > > + inode = file_inode(file);
> > > > + kvm_for_each_memslot(slot, bkt, slots) {
> > > > + if (slot->private_file &&
> > > > + file_inode(slot->private_file) == inode &&
> > > > + !(end_off <= slot->private_offset ||
> > > > + start_off >= slot->private_offset
> > > > + + (slot->npages >> PAGE_SHIFT)))
> > > > + return true;
> > > > + }
> > > > + }
> > >
> > > That's a linear scan of all memslots on each CREATE (and MOVE) operation
> > > with a fd - we just spent more than a year rewriting similar linear scans
> > > into more efficient operations in KVM.
> >
> > In the last version I tried to solve this problem by using interval tree
> > (just like existing hva_tree), but finally we realized that in one VM we
> > can have multiple fds with overlapped offsets so that approach is
> > incorrect. See https://lkml.org/lkml/2021/12/28/480 for the discussion.
>
> That's right, in this case a two-level structure would be necessary:
> the first level matching a file, then the second level matching that
> file ranges.
> However, if such data is going to be used just for checking possible
> overlap at memslot add or move time it is almost certainly an overkill.

Yes, that is also what I'm seeing.

>
> > So linear scan is used before I can find a better way.
>
> Another option would be to simply not check for overlap at add or move
> time, declare such configuration undefined behavior under KVM API and
> make sure in MMU notifiers that nothing bad happens to the host kernel
> if it turns out somebody actually set up a VM this way (it could be
> inefficient in this case, since it's not supposed to ever happen
> unless there is a bug somewhere in the userspace part).

Specific to TDX case, SEAMMODULE will fail the overlapping case and then
KVM prints a message to the kernel log. It will not cause any other side
effect, it does look weird however. Yes warn that in the API document
can help to some extent.

Thanks,
Chao
>
> > Chao
>
> Thanks,
> Maciej

2022-02-23 22:42:49

by Maciej S. Szmigiero

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

On 23.02.2022 13:00, Chao Peng wrote:
> On Tue, Feb 22, 2022 at 02:16:46AM +0100, Maciej S. Szmigiero wrote:
>> On 17.02.2022 14:45, Chao Peng wrote:
>>> On Tue, Jan 25, 2022 at 09:20:39PM +0100, Maciej S. Szmigiero wrote:
>>>> On 18.01.2022 14:21, Chao Peng wrote:
>>>>> KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
>>>>> on it by implementing kvm_arch_private_memory_supported().
>>>>>
>>>>> Also private memslot cannot be movable and the same file+offset can not
>>>>> be mapped into different GFNs.
>>>>>
>>>>> Signed-off-by: Yu Zhang <[email protected]>
>>>>> Signed-off-by: Chao Peng <[email protected]>
>>>>> ---
>>>> (..)
>>>>> static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
>>>>> - gfn_t start, gfn_t end)
>>>>> + struct file *file,
>>>>> + gfn_t start, gfn_t end,
>>>>> + loff_t start_off, loff_t end_off)
>>>>> {
>>>>> struct kvm_memslot_iter iter;
>>>>> + struct kvm_memory_slot *slot;
>>>>> + struct inode *inode;
>>>>> + int bkt;
>>>>> kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
>>>>> if (iter.slot->id != id)
>>>>> return true;
>>>>> }
>>>>> + /* Disallow mapping the same file+offset into multiple gfns. */
>>>>> + if (file) {
>>>>> + inode = file_inode(file);
>>>>> + kvm_for_each_memslot(slot, bkt, slots) {
>>>>> + if (slot->private_file &&
>>>>> + file_inode(slot->private_file) == inode &&
>>>>> + !(end_off <= slot->private_offset ||
>>>>> + start_off >= slot->private_offset
>>>>> + + (slot->npages >> PAGE_SHIFT)))
>>>>> + return true;
>>>>> + }
>>>>> + }
>>>>
>>>> That's a linear scan of all memslots on each CREATE (and MOVE) operation
>>>> with a fd - we just spent more than a year rewriting similar linear scans
>>>> into more efficient operations in KVM.
>>>
(..)
>>> So linear scan is used before I can find a better way.
>>
>> Another option would be to simply not check for overlap at add or move
>> time, declare such configuration undefined behavior under KVM API and
>> make sure in MMU notifiers that nothing bad happens to the host kernel
>> if it turns out somebody actually set up a VM this way (it could be
>> inefficient in this case, since it's not supposed to ever happen
>> unless there is a bug somewhere in the userspace part).
>
> Specific to TDX case, SEAMMODULE will fail the overlapping case and then
> KVM prints a message to the kernel log. It will not cause any other side
> effect, it does look weird however. Yes warn that in the API document
> can help to some extent.

So for the functionality you are adding this code for (TDX) this scan
isn't necessary and the overlapping case (not supported anyway) is safely
handled by the hardware (or firmware)?
Then I would simply remove the scan and, maybe, add a comment instead
that the overlap check is done by the hardware.

By the way, if a kernel log message could be triggered by (misbehaving)
userspace then it should be rate limited (if it isn't already).

> Thanks,
> Chao

Thanks,
Maciej

2022-02-24 08:21:49

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE

On Wed, Feb 23, 2022 at 07:32:37PM +0100, Maciej S. Szmigiero wrote:
> On 23.02.2022 13:00, Chao Peng wrote:
> > On Tue, Feb 22, 2022 at 02:16:46AM +0100, Maciej S. Szmigiero wrote:
> > > On 17.02.2022 14:45, Chao Peng wrote:
> > > > On Tue, Jan 25, 2022 at 09:20:39PM +0100, Maciej S. Szmigiero wrote:
> > > > > On 18.01.2022 14:21, Chao Peng wrote:
> > > > > > KVM_MEM_PRIVATE is not exposed by default but architecture code can turn
> > > > > > on it by implementing kvm_arch_private_memory_supported().
> > > > > >
> > > > > > Also private memslot cannot be movable and the same file+offset can not
> > > > > > be mapped into different GFNs.
> > > > > >
> > > > > > Signed-off-by: Yu Zhang <[email protected]>
> > > > > > Signed-off-by: Chao Peng <[email protected]>
> > > > > > ---
> > > > > (..)
> > > > > > static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> > > > > > - gfn_t start, gfn_t end)
> > > > > > + struct file *file,
> > > > > > + gfn_t start, gfn_t end,
> > > > > > + loff_t start_off, loff_t end_off)
> > > > > > {
> > > > > > struct kvm_memslot_iter iter;
> > > > > > + struct kvm_memory_slot *slot;
> > > > > > + struct inode *inode;
> > > > > > + int bkt;
> > > > > > kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> > > > > > if (iter.slot->id != id)
> > > > > > return true;
> > > > > > }
> > > > > > + /* Disallow mapping the same file+offset into multiple gfns. */
> > > > > > + if (file) {
> > > > > > + inode = file_inode(file);
> > > > > > + kvm_for_each_memslot(slot, bkt, slots) {
> > > > > > + if (slot->private_file &&
> > > > > > + file_inode(slot->private_file) == inode &&
> > > > > > + !(end_off <= slot->private_offset ||
> > > > > > + start_off >= slot->private_offset
> > > > > > + + (slot->npages >> PAGE_SHIFT)))
> > > > > > + return true;
> > > > > > + }
> > > > > > + }
> > > > >
> > > > > That's a linear scan of all memslots on each CREATE (and MOVE) operation
> > > > > with a fd - we just spent more than a year rewriting similar linear scans
> > > > > into more efficient operations in KVM.
> > > >
> (..)
> > > > So linear scan is used before I can find a better way.
> > >
> > > Another option would be to simply not check for overlap at add or move
> > > time, declare such configuration undefined behavior under KVM API and
> > > make sure in MMU notifiers that nothing bad happens to the host kernel
> > > if it turns out somebody actually set up a VM this way (it could be
> > > inefficient in this case, since it's not supposed to ever happen
> > > unless there is a bug somewhere in the userspace part).
> >
> > Specific to TDX case, SEAMMODULE will fail the overlapping case and then
> > KVM prints a message to the kernel log. It will not cause any other side
> > effect, it does look weird however. Yes warn that in the API document
> > can help to some extent.
>
> So for the functionality you are adding this code for (TDX) this scan
> isn't necessary and the overlapping case (not supported anyway) is safely
> handled by the hardware (or firmware)?

Yes, it will be handled by the firmware.

> Then I would simply remove the scan and, maybe, add a comment instead
> that the overlap check is done by the hardware.

Sure.

>
> By the way, if a kernel log message could be triggered by (misbehaving)
> userspace then it should be rate limited (if it isn't already).

Thanks for mention.

Chao
>
> > Thanks,
> > Chao
>
> Thanks,
> Maciej

2022-03-04 20:57:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On 2/23/22 04:05, Steven Price wrote:
> On 23/02/2022 11:49, Chao Peng wrote:
>> On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
>>> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
>>>> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>>>>> On 1/18/22 05:21, Chao Peng wrote:
>>>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>>>
>>>>>> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>>>>>> the file is inaccessible from userspace through ordinary MMU access
>>>>>> (e.g., read/write/mmap). However, the file content can be accessed
>>>>>> via a different mechanism (e.g. KVM MMU) indirectly.
>>>>>>
>>>>>> It provides semantics required for KVM guest private memory support
>>>>>> that a file descriptor with this seal set is going to be used as the
>>>>>> source of guest memory in confidential computing environments such
>>>>>> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>>>>>>
>>>>>> At this time only shmem implements this seal.
>>>>>>
>>>>>
>>>>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>>>>> essentially transmutes a memfd into a different type of object. While this
>>>>> can apparently be done successfully and without races (as in this code),
>>>>> it's at least awkward. I think that either creating a special inaccessible
>>>>> memfd should be a single operation that create the correct type of object or
>>>>> there should be a clear justification for why it's a two-step process.
>>>>
>>>> Now one justification maybe from Stever's comment to patch-00: for ARM
>>>> usage it can be used with creating a normal memfd, (partially)populate
>>>> it with initial guest memory content (e.g. firmware), and then
>>>> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
>>>> KVM (definitely the current code needs to be changed to support that).
>>>
>>> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work.
>>
>> Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
>> need to make sure access to existing mmap-ed area should be prevented,
>> but that is hard.
>>
>>>
>>> In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.
>>
>> Yes, TDX requires a ioctl. Steven may comment on the ARM part.
>
> The Arm story is evolving so I can't give a definite answer yet. Our
> current prototyping works by creating the initial VM content in a
> memslot as with a normal VM and then calling an ioctl which throws the
> big switch and converts all the (populated) pages to be protected. At
> this point the RMM performs a measurement of the data that the VM is
> being populated with.
>
> The above (in our prototype) suffers from all the expected problems with
> a malicious VMM being able to trick the host kernel into accessing those
> pages after they have been protected (causing a fault detected by the
> hardware).
>
> The ideal (from our perspective) approach would be to follow the same
> flow but where the VMM populates a memfd rather than normal anonymous
> pages. The memfd could then be sealed and the pages converted to
> protected ones (with the RMM measuring them in the process).
>
> The question becomes how is that memfd populated? It would be nice if
> that could be done using normal operations on a memfd (i.e. using
> mmap()) and therefore this code could be (relatively) portable. This
> would mean that any pages mapped from the memfd would either need to
> block the sealing or be revoked at the time of sealing.
>
> The other approach is we could of course implement a special ioctl which
> effectively does a memcpy into the (created empty and sealed) memfd and
> does the necessary dance with the RMM to measure the contents. This
> would match the "transcript of the series of operations" described above
> - but seems much less ideal from the viewpoint of the VMM.

A VMM that supports Other Vendors will need to understand this sort of
model regardless.

I don't particularly mind the idea of having the kernel consume a normal
memfd and spit out a new object, but I find the concept of changing the
type of the object in place, even if it has other references, and trying
to control all the resulting races to be somewhat alarming.

In pseudo-Rust, this is the difference between:

fn convert_to_private(in: &mut Memfd)

and

fn convert_to_private(in: Memfd) -> PrivateMemoryFd

This doesn't map particularly nicely to the kernel, though.

--Andy\

2022-03-07 22:14:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 03/12] mm: Introduce memfile_notifier

On 1/18/22 14:21, Chao Peng wrote:
> This patch introduces memfile_notifier facility so existing memory file
> subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
> third kernel component to make use of memory bookmarked in the memory
> file and gets notified when the pages in the memory file become
> allocated/invalidated.
>
> It will be used for KVM to use a file descriptor as the guest memory
> backing store and KVM will use this memfile_notifier interface to
> interact with memory file subsystems. In the future there might be other
> consumers (e.g. VFIO with encrypted device memory).
>
> It consists two sets of callbacks:
> - memfile_notifier_ops: callbacks for memory backing store to notify
> KVM when memory gets allocated/invalidated.
> - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> to request memory pages for guest private memory.
>
> Userspace is in charge of guest memory lifecycle: it first allocates
> pages in memory backing store and then passes the fd to KVM and lets KVM
> register each memory slot to memory backing store via
> memfile_register_notifier.
>
> The supported memory backing store should maintain a memfile_notifier list
> and provide routine for memfile_notifier to get the list head address and
> memfile_pfn_ops callbacks for memfile_register_notifier. It also should call
> memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked
> memory gets allocated/invalidated.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Process nitpick:
Here and in patch 4/12 you have Kirill's S-o-b so there should probably be
also "From: Kirill ..." as was in v3? Or in case you modified the original
patches so much to become the primary author, you should add
"Co-developed-by: Kirill ..." here before his S-o-b.

> Signed-off-by: Chao Peng <[email protected]>
> ---
> include/linux/memfile_notifier.h | 53 +++++++++++++++++++
> mm/Kconfig | 4 ++
> mm/Makefile | 1 +
> mm/memfile_notifier.c | 89 ++++++++++++++++++++++++++++++++
> 4 files changed, 147 insertions(+)
> create mode 100644 include/linux/memfile_notifier.h
> create mode 100644 mm/memfile_notifier.c
>
> diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h
> new file mode 100644
> index 000000000000..a03bebdd1322
> --- /dev/null
> +++ b/include/linux/memfile_notifier.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMFILE_NOTIFIER_H
> +#define _LINUX_MEMFILE_NOTIFIER_H
> +
> +#include <linux/rculist.h>
> +#include <linux/spinlock.h>
> +#include <linux/srcu.h>
> +#include <linux/fs.h>
> +
> +struct memfile_notifier;
> +
> +struct memfile_notifier_ops {
> + void (*invalidate)(struct memfile_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> + void (*fallocate)(struct memfile_notifier *notifier,
> + pgoff_t start, pgoff_t end);
> +};
> +
> +struct memfile_pfn_ops {
> + long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
> + void (*put_unlock_pfn)(unsigned long pfn);
> +};
> +
> +struct memfile_notifier {
> + struct list_head list;
> + struct memfile_notifier_ops *ops;
> +};
> +
> +struct memfile_notifier_list {
> + struct list_head head;
> + spinlock_t lock;
> +};
> +
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static inline void memfile_notifier_list_init(struct memfile_notifier_list *list)
> +{
> + INIT_LIST_HEAD(&list->head);
> + spin_lock_init(&list->lock);
> +}
> +
> +extern void memfile_notifier_invalidate(struct memfile_notifier_list *list,
> + pgoff_t start, pgoff_t end);
> +extern void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> + pgoff_t start, pgoff_t end);
> +extern int memfile_register_notifier(struct inode *inode,
> + struct memfile_notifier *notifier,
> + struct memfile_pfn_ops **pfn_ops);
> +extern void memfile_unregister_notifier(struct inode *inode,
> + struct memfile_notifier *notifier);
> +
> +#endif /* CONFIG_MEMFILE_NOTIFIER */
> +
> +#endif /* _LINUX_MEMFILE_NOTIFIER_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 28edafc820ad..fa31eda3c895 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -900,6 +900,10 @@ config IO_MAPPING
> config SECRETMEM
> def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config MEMFILE_NOTIFIER
> + bool
> + select SRCU
> +
> source "mm/damon/Kconfig"
>
> endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index d6c0042e3aa0..80588f7c3bc2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -130,3 +130,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> obj-$(CONFIG_IO_MAPPING) += io-mapping.o
> obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
> obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
> +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o
> diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c
> new file mode 100644
> index 000000000000..8171d4601a04
> --- /dev/null
> +++ b/mm/memfile_notifier.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * linux/mm/memfile_notifier.c
> + *
> + * Copyright (C) 2022 Intel Corporation.
> + * Chao Peng <[email protected]>
> + */
> +
> +#include <linux/memfile_notifier.h>
> +#include <linux/srcu.h>
> +
> +DEFINE_STATIC_SRCU(srcu);
> +
> +void memfile_notifier_invalidate(struct memfile_notifier_list *list,
> + pgoff_t start, pgoff_t end)
> +{
> + struct memfile_notifier *notifier;
> + int id;
> +
> + id = srcu_read_lock(&srcu);
> + list_for_each_entry_srcu(notifier, &list->head, list,
> + srcu_read_lock_held(&srcu)) {
> + if (notifier->ops && notifier->ops->invalidate)
> + notifier->ops->invalidate(notifier, start, end);
> + }
> + srcu_read_unlock(&srcu, id);
> +}
> +
> +void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> + pgoff_t start, pgoff_t end)
> +{
> + struct memfile_notifier *notifier;
> + int id;
> +
> + id = srcu_read_lock(&srcu);
> + list_for_each_entry_srcu(notifier, &list->head, list,
> + srcu_read_lock_held(&srcu)) {
> + if (notifier->ops && notifier->ops->fallocate)
> + notifier->ops->fallocate(notifier, start, end);
> + }
> + srcu_read_unlock(&srcu, id);
> +}
> +
> +static int memfile_get_notifier_info(struct inode *inode,
> + struct memfile_notifier_list **list,
> + struct memfile_pfn_ops **ops)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +int memfile_register_notifier(struct inode *inode,
> + struct memfile_notifier *notifier,
> + struct memfile_pfn_ops **pfn_ops)
> +{
> + struct memfile_notifier_list *list;
> + int ret;
> +
> + if (!inode || !notifier | !pfn_ops)
> + return -EINVAL;
> +
> + ret = memfile_get_notifier_info(inode, &list, pfn_ops);
> + if (ret)
> + return ret;
> +
> + spin_lock(&list->lock);
> + list_add_rcu(&notifier->list, &list->head);
> + spin_unlock(&list->lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(memfile_register_notifier);
> +
> +void memfile_unregister_notifier(struct inode *inode,
> + struct memfile_notifier *notifier)
> +{
> + struct memfile_notifier_list *list;
> +
> + if (!inode || !notifier)
> + return;
> +
> + BUG_ON(memfile_get_notifier_info(inode, &list, NULL));
> +
> + spin_lock(&list->lock);
> + list_del_rcu(&notifier->list);
> + spin_unlock(&list->lock);
> +
> + synchronize_srcu(&srcu);
> +}
> +EXPORT_SYMBOL_GPL(memfile_unregister_notifier);

2022-03-08 06:17:49

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 03/12] mm: Introduce memfile_notifier

On Mon, Mar 07, 2022 at 04:42:08PM +0100, Vlastimil Babka wrote:
> On 1/18/22 14:21, Chao Peng wrote:
> > This patch introduces memfile_notifier facility so existing memory file
> > subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
> > third kernel component to make use of memory bookmarked in the memory
> > file and gets notified when the pages in the memory file become
> > allocated/invalidated.
> >
> > It will be used for KVM to use a file descriptor as the guest memory
> > backing store and KVM will use this memfile_notifier interface to
> > interact with memory file subsystems. In the future there might be other
> > consumers (e.g. VFIO with encrypted device memory).
> >
> > It consists two sets of callbacks:
> > - memfile_notifier_ops: callbacks for memory backing store to notify
> > KVM when memory gets allocated/invalidated.
> > - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> > to request memory pages for guest private memory.
> >
> > Userspace is in charge of guest memory lifecycle: it first allocates
> > pages in memory backing store and then passes the fd to KVM and lets KVM
> > register each memory slot to memory backing store via
> > memfile_register_notifier.
> >
> > The supported memory backing store should maintain a memfile_notifier list
> > and provide routine for memfile_notifier to get the list head address and
> > memfile_pfn_ops callbacks for memfile_register_notifier. It also should call
> > memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked
> > memory gets allocated/invalidated.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>
> Process nitpick:
> Here and in patch 4/12 you have Kirill's S-o-b so there should probably be
> also "From: Kirill ..." as was in v3? Or in case you modified the original
> patches so much to become the primary author, you should add
> "Co-developed-by: Kirill ..." here before his S-o-b.

Thanks. 3/12 is vastly rewritten so the latter case can be applied.
4/12 should keep Kirill as the primary author.

Chao

2022-03-08 14:15:35

by Chao Peng

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On Fri, Mar 04, 2022 at 11:24:30AM -0800, Andy Lutomirski wrote:
> On 2/23/22 04:05, Steven Price wrote:
> > On 23/02/2022 11:49, Chao Peng wrote:
> > > On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
> > > > On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> > > > > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> > > > > > On 1/18/22 05:21, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > > > >
> > > > > > > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > > > > > > the file is inaccessible from userspace through ordinary MMU access
> > > > > > > (e.g., read/write/mmap). However, the file content can be accessed
> > > > > > > via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private memory support
> > > > > > > that a file descriptor with this seal set is going to be used as the
> > > > > > > source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> > > > > > >
> > > > > > > At this time only shmem implements this seal.
> > > > > > >
> > > > > >
> > > > > > I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
> > > > > > essentially transmutes a memfd into a different type of object. While this
> > > > > > can apparently be done successfully and without races (as in this code),
> > > > > > it's at least awkward. I think that either creating a special inaccessible
> > > > > > memfd should be a single operation that create the correct type of object or
> > > > > > there should be a clear justification for why it's a two-step process.
> > > > >
> > > > > Now one justification maybe from Stever's comment to patch-00: for ARM
> > > > > usage it can be used with creating a normal memfd, (partially)populate
> > > > > it with initial guest memory content (e.g. firmware), and then
> > > > > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> > > > > KVM (definitely the current code needs to be changed to support that).
> > > >
> > > > Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work.
> > >
> > > Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
> > > need to make sure access to existing mmap-ed area should be prevented,
> > > but that is hard.
> > >
> > > >
> > > > In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism.
> > >
> > > Yes, TDX requires a ioctl. Steven may comment on the ARM part.
> >
> > The Arm story is evolving so I can't give a definite answer yet. Our
> > current prototyping works by creating the initial VM content in a
> > memslot as with a normal VM and then calling an ioctl which throws the
> > big switch and converts all the (populated) pages to be protected. At
> > this point the RMM performs a measurement of the data that the VM is
> > being populated with.
> >
> > The above (in our prototype) suffers from all the expected problems with
> > a malicious VMM being able to trick the host kernel into accessing those
> > pages after they have been protected (causing a fault detected by the
> > hardware).
> >
> > The ideal (from our perspective) approach would be to follow the same
> > flow but where the VMM populates a memfd rather than normal anonymous
> > pages. The memfd could then be sealed and the pages converted to
> > protected ones (with the RMM measuring them in the process).
> >
> > The question becomes how is that memfd populated? It would be nice if
> > that could be done using normal operations on a memfd (i.e. using
> > mmap()) and therefore this code could be (relatively) portable. This
> > would mean that any pages mapped from the memfd would either need to
> > block the sealing or be revoked at the time of sealing.
> >
> > The other approach is we could of course implement a special ioctl which
> > effectively does a memcpy into the (created empty and sealed) memfd and
> > does the necessary dance with the RMM to measure the contents. This
> > would match the "transcript of the series of operations" described above
> > - but seems much less ideal from the viewpoint of the VMM.
>
> A VMM that supports Other Vendors will need to understand this sort of model
> regardless.
>
> I don't particularly mind the idea of having the kernel consume a normal
> memfd and spit out a new object, but I find the concept of changing the type
> of the object in place, even if it has other references, and trying to
> control all the resulting races to be somewhat alarming.
>
> In pseudo-Rust, this is the difference between:
>
> fn convert_to_private(in: &mut Memfd)
>
> and
>
> fn convert_to_private(in: Memfd) -> PrivateMemoryFd
>
> This doesn't map particularly nicely to the kernel, though.

I understand this Rust semantics and the difficulty to handle races.
Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead
we can use a new in-kernel flag to indicate the same thing. That flag
should be set only when the memfd is created with MFD_INACCESSIBLE.

Chao
>
> --Andy\

2022-03-08 15:38:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

On 3/7/22 14:26, Chao Peng wrote:
>> In pseudo-Rust, this is the difference between:
>>
>> fn convert_to_private(in: &mut Memfd)
>>
>> and
>>
>> fn convert_to_private(in: Memfd) -> PrivateMemoryFd
>>
>> This doesn't map particularly nicely to the kernel, though.
> I understand this Rust semantics and the difficulty to handle races.
> Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead
> we can use a new in-kernel flag to indicate the same thing. That flag
> should be set only when the memfd is created with MFD_INACCESSIBLE.

Yes, I like this.

Paolo