2024-04-04 18:50:50

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 00/11] KVM: guest_memfd: New hooks and functionality for SEV-SNP and TDX

[Matthew, you're Cc'd here for patches 1 and 3 which touch
the mm/filemap code. Since in the meanwhile the KVM side has
taken a more definitive shape, this time through review/ack is
welcome! And there is a proper commit message too. - Paolo]

This is the next version of the gmem common API patches,
adding target-independent functionality and hooks that are
needed by SEV-SNP and TDX.

The code in here is mostly taken from two series:

- [PATCH 00/21] TDX/SNP part 1 of n, for 6.9
https://lore.kernel.org/kvm/[email protected]/

- [PATCH gmem 0/6] gmem fix-ups and interfaces for populating gmem pages
https://lore.kernel.org/kvm/[email protected]/

1-2: This introduces an AS_INACCESSIBLE flag that prevents unexpected
accesses to hole-punched gmem pages before invalidation hooks have had
a chance to make them safely accessible to the host again.

3-9: This introduces an interface for preparing gmem pages either on first
use or by populating them with user data.

The latter interface, kvm_gmem_populate(), alternates calls
to __kvm_gmem_get_pfn() with calls to a user provided callback.
This implementation simplifies the handling of races and errors,
by confining filemap rollback and locking in kvm_gmem_populate().
The function's tasks are otherwise kept to the minimum so that
it can be used by both SNP and TDX.

10-11: This introduces other hooks needed by SEV-SNP, and is unchanged
from "[PATCH 00/21] TDX/SNP part 1 of n, for 6.9".

The main changes compared to the previous posting are in patch 9;
both the locking of kvm_gmem_populate() (which now takes the
filemap's invalidate_lock) and the operation of the function
(which now looks up the memslot, but OTOH does not do copy_from_user()
anymore) are pretty new. I tested the logic slightly by adding a call
to it for sw-protected VMs.

Shout or post fixups if it breaks something for you.

Current state:

- kvm/queue has the SEV_INIT2 and some easy refactorings from
the TDX series. Both are expected to move to kvm/next soon.

- I have pushed this already at kvm-coco-queue, but I haven't
finished the #VE series yet so tomorrow I'll post it and
update kvm-coco-queue again.

Paolo


Michael Roth (4):
mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory
KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode
KVM: guest_memfd: Add hook for invalidating memory
KVM: x86: Add gmem hook for determining max NPT mapping level

Paolo Bonzini (7):
KVM: guest_memfd: pass error up from filemap_grab_folio
filemap: add FGP_CREAT_ONLY
KVM: guest_memfd: limit overzealous WARN
KVM: guest_memfd: Add hook for initializing memory
KVM: guest_memfd: extract __kvm_gmem_get_pfn()
KVM: guest_memfd: extract __kvm_gmem_punch_hole()
KVM: guest_memfd: Add interface for populating gmem pages with user
data

arch/x86/include/asm/kvm-x86-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/kvm/mmu/mmu.c | 8 +
arch/x86/kvm/x86.c | 13 ++
include/linux/kvm_host.h | 35 +++++
include/linux/pagemap.h | 3 +
mm/filemap.c | 4 +
mm/truncate.c | 3 +-
virt/kvm/Kconfig | 8 +
virt/kvm/guest_memfd.c | 230 ++++++++++++++++++++++++-----
10 files changed, 277 insertions(+), 34 deletions(-)

--
2.43.0



2024-04-04 18:51:27

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 05/11] KVM: guest_memfd: limit overzealous WARN

Because kvm_gmem_get_pfn() is called from the page fault path without
any of the slots_lock, filemap lock or mmu_lock taken, it is
possible for it to race with kvm_gmem_unbind(). This is not a
problem, as any PTE that is installed temporarily will be zapped
before the guest has the occasion to run.

However, it is not possible to have a complete unbind+bind
racing with the page fault, because deleting the memslot
will call synchronize_srcu_expedited() and wait for the
page fault to be resolved. Thus, we can still warn if
the file is there and is not the one we expect.

Signed-off-by: Paolo Bonzini <[email protected]>
---
virt/kvm/guest_memfd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 409cf9b51313..e5b3cd02b651 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -499,7 +499,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,

gmem = file->private_data;

- if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
+ if (xa_load(&gmem->bindings, index) != slot) {
+ WARN_ON_ONCE(xa_load(&gmem->bindings, index));
r = -EIO;
goto out_fput;
}
--
2.43.0



2024-04-04 18:51:39

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 02/11] KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode

From: Michael Roth <[email protected]>

truncate_inode_pages_range() may attempt to zero pages before truncating
them, and this will occur before arch-specific invalidations can be
triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
AMD SEV-SNP this would result in an RMP #PF being generated by the
hardware, which is currently treated as fatal (and even if specifically
allowed for, would not result in anything other than garbage being
written to guest pages due to encryption). On Intel TDX this would also
result in undesirable behavior.

Set the AS_INACCESSIBLE flag to prevent the MM from attempting
unexpected accesses of this sort during operations like truncation.

This may also in some cases yield a decent performance improvement for
guest_memfd userspace implementations that hole-punch ranges immediately
after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
the current implementation of truncate_inode_pages_range() always ends
up zero'ing an entire 4K range if it is backing by a 2M folio.

Link: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
virt/kvm/guest_memfd.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0f4e0cf4f158..5a929536ecf2 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -357,6 +357,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
inode->i_private = (void *)(unsigned long)flags;
inode->i_op = &kvm_gmem_iops;
inode->i_mapping->a_ops = &kvm_gmem_aops;
+ inode->i_mapping->flags |= AS_INACCESSIBLE;
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
--
2.43.0



2024-04-04 18:52:34

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 10/11] KVM: guest_memfd: Add hook for invalidating memory

From: Michael Roth <[email protected]>

In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.

Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 7 +++++++
include/linux/kvm_host.h | 4 ++++
virt/kvm/Kconfig | 4 ++++
virt/kvm/guest_memfd.c | 14 ++++++++++++++
6 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index d26fcad13e36..c81990937ab4 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -140,6 +140,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL(gmem_invalidate)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f101fab0040e..59c7b95034fc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1810,6 +1810,7 @@ struct kvm_x86_ops {
gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
+ void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 972524ddcfdb..83b8260443a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13605,6 +13605,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
}
#endif

+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+{
+ static_call_cond(kvm_x86_gmem_invalidate)(start, end);
+}
+#endif
+
int kvm_spec_ctrl_test_value(u64 value)
{
/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 97d57ec59789..ab439706ea2f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2476,4 +2476,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
void __user *src, int order, void *opaque),
void *opaque);

+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#endif
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index ca870157b2ed..754c6c923427 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM
config HAVE_KVM_GMEM_PREPARE
bool
depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_INVALIDATE
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e7de97382a67..d6b92d9b935a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -360,10 +360,24 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
return MF_DELAYED;
}

+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+static void kvm_gmem_free_folio(struct folio *folio)
+{
+ struct page *page = folio_page(folio, 0);
+ kvm_pfn_t pfn = page_to_pfn(page);
+ int order = folio_order(folio);
+
+ kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+}
+#endif
+
static const struct address_space_operations kvm_gmem_aops = {
.dirty_folio = noop_dirty_folio,
.migrate_folio = kvm_gmem_migrate_folio,
.error_remove_folio = kvm_gmem_error_folio,
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+ .free_folio = kvm_gmem_free_folio,
+#endif
};

static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
--
2.43.0



2024-04-04 18:53:22

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 08/11] KVM: guest_memfd: extract __kvm_gmem_punch_hole()

Extract a version of kvm_gmem_punch_hole() that expects the
caller to take the filemap invalidate_lock.

Signed-off-by: Paolo Bonzini <[email protected]>
---
virt/kvm/guest_memfd.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index a537a7e63ab5..51c99667690a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -152,19 +152,12 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
}
}

-static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+static long __kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
{
struct list_head *gmem_list = &inode->i_mapping->i_private_list;
pgoff_t start = offset >> PAGE_SHIFT;
pgoff_t end = (offset + len) >> PAGE_SHIFT;
struct kvm_gmem *gmem;
-
- /*
- * Bindings must be stable across invalidation to ensure the start+end
- * are balanced.
- */
- filemap_invalidate_lock(inode->i_mapping);
-
list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_begin(gmem, start, end);

@@ -173,11 +166,23 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_end(gmem, start, end);

- filemap_invalidate_unlock(inode->i_mapping);
-
return 0;
}

+static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+ int r;
+
+ /*
+ * Bindings must be stable across invalidation to ensure the start+end
+ * are balanced.
+ */
+ filemap_invalidate_lock(inode->i_mapping);
+ r = __kvm_gmem_punch_hole(inode, offset, len);
+ filemap_invalidate_unlock(inode->i_mapping);
+ return r;
+}
+
static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
{
struct address_space *mapping = inode->i_mapping;
--
2.43.0



2024-04-04 18:53:32

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 03/11] KVM: guest_memfd: pass error up from filemap_grab_folio

Some SNP ioctls will require the page not to be in the pagecache, and as such they
will want to return EEXIST to userspace. Start by passing the error up from
filemap_grab_folio.

Signed-off-by: Paolo Bonzini <[email protected]>
---
virt/kvm/guest_memfd.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5a929536ecf2..409cf9b51313 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -20,7 +20,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
/* TODO: Support huge pages. */
folio = filemap_grab_folio(inode->i_mapping, index);
if (IS_ERR_OR_NULL(folio))
- return NULL;
+ return folio;

/*
* Use the up-to-date flag to track whether or not the memory has been
@@ -146,8 +146,8 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
}

folio = kvm_gmem_get_folio(inode, index);
- if (!folio) {
- r = -ENOMEM;
+ if (IS_ERR_OR_NULL(folio)) {
+ r = folio ? PTR_ERR(folio) : -ENOMEM;
break;
}

--
2.43.0



2024-04-04 18:53:55

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 04/11] filemap: add FGP_CREAT_ONLY

KVM would like to add a ioctl to encrypt and install a page into private
memory (i.e. into a guest_memfd), in preparation for launching an
encrypted guest.

This API should be used only once per page (unless there are failures),
so we want to rule out the possibility of operating on a page that is
already in the guest_memfd's filemap. Overwriting the page is almost
certainly a sign of a bug, so we might as well forbid it.

Therefore, introduce a new flag for __filemap_get_folio (to be passed
together with FGP_CREAT) that allows *adding* a new page to the filemap
but not returning an existing one.

An alternative possibility would be to force KVM users to initialize
the whole filemap in one go, but that is complicated by the fact that
the filemap includes pages of different kinds, including some that are
per-vCPU rather than per-VM. Basically the result would be closer to
a system call that multiplexes multiple ioctls, than to something
cleaner like readv/writev.

Races between callers that pass FGP_CREAT_ONLY are uninteresting to
the filemap code: one of the racers wins and one fails with EEXIST,
similar to calling open(2) with O_CREAT|O_EXCL. It doesn't matter to
filemap.c if the missing synchronization is in the kernel or in userspace,
and in fact it could even be intentional. (In the case of KVM it turns
out that a mutex is taken around these calls for unrelated reasons,
so there can be no races.)

Cc: Matthew Wilcox <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/pagemap.h | 2 ++
mm/filemap.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f879c1d54da7..a8c0685e8c08 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -587,6 +587,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
* * %FGP_CREAT - If no folio is present then a new folio is allocated,
* added to the page cache and the VM's LRU list. The folio is
* returned locked.
+ * * %FGP_CREAT_ONLY - Fail if a folio is present
* * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
* folio is already in cache. If the folio was allocated, unlock it
* before returning so the caller can do the same dance.
@@ -607,6 +608,7 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_NOWAIT ((__force fgf_t)0x00000020)
#define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
#define FGP_STABLE ((__force fgf_t)0x00000080)
+#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
#define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */

#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
diff --git a/mm/filemap.c b/mm/filemap.c
index 7437b2bd75c1..e7440e189ebd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1863,6 +1863,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio = NULL;
if (!folio)
goto no_page;
+ if (fgp_flags & FGP_CREAT_ONLY) {
+ folio_put(folio);
+ return ERR_PTR(-EEXIST);
+ }

if (fgp_flags & FGP_LOCK) {
if (fgp_flags & FGP_NOWAIT) {
--
2.43.0



2024-04-04 18:53:55

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 11/11] KVM: x86: Add gmem hook for determining max NPT mapping level

From: Michael Roth <[email protected]>

In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:

- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.

Add a hook to determine the max NPT mapping size in situations like
this.

Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index c81990937ab4..2db87a6fd52a 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -140,6 +140,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
KVM_X86_OP_OPTIONAL(gmem_invalidate)

#undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 59c7b95034fc..67dc108dd366 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1811,6 +1811,8 @@ struct kvm_x86_ops {
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
+ int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, bool is_private,
+ u8 *max_level);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 992e651540e8..13dd367b8af1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4338,6 +4338,14 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);

+ r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
+ fault->gfn, fault->is_private,
+ &fault->max_level);
+ if (r) {
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
+ }
+
return RET_PF_CONTINUE;
}

--
2.43.0


2024-04-04 19:13:18

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 07/11] KVM: guest_memfd: extract __kvm_gmem_get_pfn()

In preparation for adding a function that walks a set of pages
provided by userspace and populates them in a guest_memfd,
add a version of kvm_gmem_get_pfn() that has a "bool prepare"
argument and passes it down to kvm_gmem_get_folio().

Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
on the same file, so make the new function take struct file*.

Signed-off-by: Paolo Bonzini <[email protected]>
---
virt/kvm/guest_memfd.c | 38 +++++++++++++++++++++++---------------
1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 486748e65f36..a537a7e63ab5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -540,33 +540,29 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
fput(file);
}

-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
{
pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
- struct kvm_gmem *gmem;
+ struct kvm_gmem *gmem = file->private_data;
struct folio *folio;
struct page *page;
- struct file *file;
int r;

- file = kvm_gmem_get_file(slot);
- if (!file)
+ if (file != slot->gmem.file) {
+ WARN_ON_ONCE(slot->gmem.file);
return -EFAULT;
+ }

gmem = file->private_data;
-
if (xa_load(&gmem->bindings, index) != slot) {
WARN_ON_ONCE(xa_load(&gmem->bindings, index));
- r = -EIO;
- goto out_fput;
+ return -EIO;
}

folio = kvm_gmem_get_folio(file_inode(file), index, true);
- if (!folio) {
- r = -ENOMEM;
- goto out_fput;
- }
+ if (!folio)
+ return -ENOMEM;

if (folio_test_hwpoison(folio)) {
r = -EHWPOISON;
@@ -583,9 +579,21 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,

out_unlock:
folio_unlock(folio);
-out_fput:
- fput(file);

return r;
}
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+ struct file *file = kvm_gmem_get_file(slot);
+ int r;
+
+ if (!file)
+ return -EFAULT;
+
+ r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, true);
+ fput(file);
+ return r;
+}
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
--
2.43.0



2024-04-04 19:13:26

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
prepare newly-allocated gmem pages prior to mapping them into the guest.
In the case of SEV-SNP, this mainly involves setting the pages to
private in the RMP table.

However, for the GPA ranges comprising the initial guest payload, which
are encrypted/measured prior to starting the guest, the gmem pages need
to be accessed prior to setting them to private in the RMP table so they
can be initialized with the userspace-provided data. Additionally, an
SNP firmware call is needed afterward to encrypt them in-place and
measure the contents into the guest's launch digest.

While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
this handling can be done in an open-coded/vendor-specific manner, this
may expose more gmem-internal state/dependencies to external callers
than necessary. Try to avoid this by implementing an interface that
tries to handle as much of the common functionality inside gmem as
possible, while also making it generic enough to potentially be
usable/extensible for TDX as well.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/kvm_host.h | 26 ++++++++++++++
virt/kvm/guest_memfd.c | 78 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 33ed3b884a6b..97d57ec59789 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2450,4 +2450,30 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
#endif

+/**
+ * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
+ *
+ * @kvm: KVM instance
+ * @gfn: starting GFN to be populated
+ * @src: userspace-provided buffer containing data to copy into GFN range
+ * (passed to @post_populate, and incremented on each iteration
+ * if not NULL)
+ * @npages: number of pages to copy from userspace-buffer
+ * @post_populate: callback to issue for each gmem page that backs the GPA
+ * range
+ * @opaque: opaque data to pass to @post_populate callback
+ *
+ * This is primarily intended for cases where a gmem-backed GPA range needs
+ * to be initialized with userspace-provided data prior to being mapped into
+ * the guest as a private page. This should be called with the slots->lock
+ * held so that caller-enforced invariants regarding the expected memory
+ * attributes of the GPA range do not race with KVM_SET_MEMORY_ATTRIBUTES.
+ *
+ * Returns the number of pages that were populated.
+ */
+long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
+ int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+ void __user *src, int order, void *opaque),
+ void *opaque);
+
#endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 51c99667690a..e7de97382a67 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -602,3 +602,81 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
return r;
}
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
+
+static int kvm_gmem_undo_get_pfn(struct file *file, struct kvm_memory_slot *slot,
+ gfn_t gfn, int order)
+{
+ pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+ struct kvm_gmem *gmem = file->private_data;
+
+ /*
+ * Races with kvm_gmem_unbind() must have been detected by
+ * __kvm_gmem_get_gfn(), because the invalidate_lock is
+ * taken between __kvm_gmem_get_gfn() and kvm_gmem_undo_get_pfn().
+ */
+ if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot))
+ return -EIO;
+
+ return __kvm_gmem_punch_hole(file_inode(file), index << PAGE_SHIFT, PAGE_SIZE << order);
+}
+
+long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
+ int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+ void __user *src, int order, void *opaque),
+ void *opaque)
+{
+ struct file *file;
+ struct kvm_memory_slot *slot;
+
+ int ret = 0, max_order;
+ long i;
+
+ lockdep_assert_held(&kvm->slots_lock);
+ if (npages < 0)
+ return -EINVAL;
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!kvm_slot_can_be_private(slot))
+ return -EINVAL;
+
+ file = kvm_gmem_get_file(slot);
+ if (!file)
+ return -EFAULT;
+
+ filemap_invalidate_lock(file->f_mapping);
+
+ npages = min_t(ulong, slot->npages - (gfn - slot->base_gfn), npages);
+ for (i = 0; i < npages; i += (1 << max_order)) {
+ gfn_t this_gfn = gfn + i;
+ kvm_pfn_t pfn;
+
+ ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, false);
+ if (ret)
+ break;
+
+ if (!IS_ALIGNED(this_gfn, (1 << max_order)) ||
+ (npages - i) < (1 << max_order))
+ max_order = 0;
+
+ if (post_populate) {
+ void __user *p = src ? src + i * PAGE_SIZE : NULL;
+ ret = post_populate(kvm, this_gfn, pfn, p, max_order, opaque);
+ }
+
+ put_page(pfn_to_page(pfn));
+ if (ret) {
+ /*
+ * Punch a hole so that FGP_CREAT_ONLY can succeed
+ * again.
+ */
+ kvm_gmem_undo_get_pfn(file, slot, this_gfn, max_order);
+ break;
+ }
+ }
+
+ filemap_invalidate_unlock(file->f_mapping);
+
+ fput(file);
+ return ret && !i ? ret : i;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_populate);
--
2.43.0



2024-04-04 19:19:36

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 01/11] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory

From: Michael Roth <[email protected]>

filemap users like guest_memfd may use page cache pages to
allocate/manage memory that is only intended to be accessed by guests
via hardware protections like encryption. Writes to memory of this sort
in common paths like truncation may cause unexpected behavior such
writing garbage instead of zeros when attempting to zero pages, or
worse, triggering hardware protections that are considered fatal as far
as the kernel is concerned.

Introduce a new address_space flag, AS_INACCESSIBLE, and use this
initially to prevent zero'ing of pages during truncation, with the
understanding that it is up to the owner of the mapping to handle this
specially if needed.

Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Matthew Wilcox <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/truncate.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..f879c1d54da7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -207,6 +207,7 @@ enum mapping_flags {
AS_STABLE_WRITES, /* must wait for writeback before modifying
folio contents */
AS_UNMOVABLE, /* The mapping cannot be moved, ever */
+ AS_INACCESSIBLE, /* Do not attempt direct R/W access to the mapping */
};

/**
diff --git a/mm/truncate.c b/mm/truncate.c
index 725b150e47ac..c501338c7ebd 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -233,7 +233,8 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
* doing a complex calculation here, and then doing the zeroing
* anyway if the page split fails.
*/
- folio_zero_range(folio, offset, length);
+ if (!(folio->mapping->flags & AS_INACCESSIBLE))
+ folio_zero_range(folio, offset, length);

if (folio_has_private(folio))
folio_invalidate(folio, offset, length);
--
2.43.0



2024-04-04 19:19:44

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 06/11] KVM: guest_memfd: Add hook for initializing memory

guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.

Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.

The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).

In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned'. Add an argument
(always true for now) to kvm_gmem_get_folio() that allows for the
preparation callback to be bypassed. To detect possible issues in
the way userspace initializes memory, it is only possible to add an
unprepared page if it is not already included in the filemap.

Link: https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 6 +++
include/linux/kvm_host.h | 5 +++
virt/kvm/Kconfig | 4 ++
virt/kvm/guest_memfd.c | 65 ++++++++++++++++++++++++++++--
6 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 5187fcf4b610..d26fcad13e36 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -139,6 +139,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
+KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01c69840647e..f101fab0040e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1809,6 +1809,7 @@ struct kvm_x86_ops {

gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+ int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d2619d3eee4..972524ddcfdb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13598,6 +13598,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_arch_no_poll);

+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
+{
+ return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
+}
+#endif

int kvm_spec_ctrl_test_value(u64 value)
{
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..33ed3b884a6b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2445,4 +2445,9 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
}
#endif /* CONFIG_KVM_PRIVATE_MEM */

+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
+bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
+#endif
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 29b73eedfe74..ca870157b2ed 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,3 +109,7 @@ config KVM_GENERIC_PRIVATE_MEM
select KVM_GENERIC_MEMORY_ATTRIBUTES
select KVM_PRIVATE_MEM
bool
+
+config HAVE_KVM_GMEM_PREPARE
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e5b3cd02b651..486748e65f36 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,12 +13,60 @@ struct kvm_gmem {
struct list_head entry;
};

-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+bool __weak kvm_arch_gmem_prepare_needed(struct kvm *kvm)
+{
+ return false;
+}
+#endif
+
+static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+ struct list_head *gmem_list = &inode->i_mapping->i_private_list;
+ struct kvm_gmem *gmem;
+
+ list_for_each_entry(gmem, gmem_list, entry) {
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = gmem->kvm;
+ struct page *page;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+ int rc;
+
+ if (!kvm_arch_gmem_prepare_needed(kvm))
+ continue;
+
+ slot = xa_load(&gmem->bindings, index);
+ if (!slot)
+ continue;
+
+ page = folio_file_page(folio, index);
+ pfn = page_to_pfn(page);
+ gfn = slot->base_gfn + index - slot->gmem.pgoff;
+ rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
+ if (rc) {
+ pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
+ index, rc);
+ return rc;
+ }
+ }
+
+#endif
+ return 0;
+}
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
{
struct folio *folio;
+ fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
+
+ if (!prepare)
+ fgp_flags |= FGP_CREAT_ONLY;

/* TODO: Support huge pages. */
- folio = filemap_grab_folio(inode->i_mapping, index);
+ folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags,
+ mapping_gfp_mask(inode->i_mapping));
if (IS_ERR_OR_NULL(folio))
return folio;

@@ -41,6 +89,15 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
folio_mark_uptodate(folio);
}

+ if (prepare) {
+ int r = kvm_gmem_prepare_folio(inode, index, folio);
+ if (r < 0) {
+ folio_unlock(folio);
+ folio_put(folio);
+ return ERR_PTR(r);
+ }
+ }
+
/*
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
@@ -145,7 +202,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
break;
}

- folio = kvm_gmem_get_folio(inode, index);
+ folio = kvm_gmem_get_folio(inode, index, true);
if (IS_ERR_OR_NULL(folio)) {
r = folio ? PTR_ERR(folio) : -ENOMEM;
break;
@@ -505,7 +562,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
goto out_fput;
}

- folio = kvm_gmem_get_folio(file_inode(file), index);
+ folio = kvm_gmem_get_folio(file_inode(file), index, true);
if (!folio) {
r = -ENOMEM;
goto out_fput;
--
2.43.0



2024-04-09 23:47:28

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH 11/11] KVM: x86: Add gmem hook for determining max NPT mapping level

On Thu, Apr 04, 2024 at 02:50:33PM -0400, Paolo Bonzini wrote:
> From: Michael Roth <[email protected]>
>
> In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
> 2MB mapping in the guest's nested page table depends on whether or not
> any subpages within the range have already been initialized as private
> in the RMP table. The existing mixed-attribute tracking in KVM is
> insufficient here, for instance:
>
> - gmem allocates 2MB page
> - guest issues PVALIDATE on 2MB page
> - guest later converts a subpage to shared
> - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> - KVM MMU splits NPT mapping to 4K

Binbin caught that I'd neglected to document the last step in the
theoretical sequence here. It should state something to the effect
of:

- guest later converts that shared page back to private

-Mike

>
> At this point there are no mixed attributes, and KVM would normally
> allow for 2MB NPT mappings again, but this is actually not allowed
> because the RMP table mappings are 4K and cannot be promoted on the
> hypervisor side, so the NPT mappings must still be limited to 4K to
> match this.
>
> Add a hook to determine the max NPT mapping size in situations like
> this.
>
> Signed-off-by: Michael Roth <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> 3 files changed, 11 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index c81990937ab4..2db87a6fd52a 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -140,6 +140,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
> KVM_X86_OP_OPTIONAL(gmem_invalidate)
>
> #undef KVM_X86_OP
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 59c7b95034fc..67dc108dd366 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1811,6 +1811,8 @@ struct kvm_x86_ops {
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> + int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, bool is_private,
> + u8 *max_level);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..13dd367b8af1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4338,6 +4338,14 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> fault->max_level);
> fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
>
> + r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
> + fault->gfn, fault->is_private,
> + &fault->max_level);
> + if (r) {
> + kvm_release_pfn_clean(fault->pfn);
> + return r;
> + }
> +
> return RET_PF_CONTINUE;
> }
>
> --
> 2.43.0
>

2024-04-09 23:47:58

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH 07/11] KVM: guest_memfd: extract __kvm_gmem_get_pfn()

On Thu, Apr 04, 2024 at 02:50:29PM -0400, Paolo Bonzini wrote:
> In preparation for adding a function that walks a set of pages
> provided by userspace and populates them in a guest_memfd,
> add a version of kvm_gmem_get_pfn() that has a "bool prepare"
> argument and passes it down to kvm_gmem_get_folio().
>
> Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
> on the same file, so make the new function take struct file*.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> virt/kvm/guest_memfd.c | 38 +++++++++++++++++++++++---------------
> 1 file changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 486748e65f36..a537a7e63ab5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -540,33 +540,29 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
> fput(file);
> }
>
> -int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> - gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
> {
> pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> - struct kvm_gmem *gmem;
> + struct kvm_gmem *gmem = file->private_data;
> struct folio *folio;
> struct page *page;
> - struct file *file;
> int r;
>
> - file = kvm_gmem_get_file(slot);
> - if (!file)
> + if (file != slot->gmem.file) {
> + WARN_ON_ONCE(slot->gmem.file);
> return -EFAULT;
> + }
>
> gmem = file->private_data;
> -
> if (xa_load(&gmem->bindings, index) != slot) {
> WARN_ON_ONCE(xa_load(&gmem->bindings, index));
> - r = -EIO;
> - goto out_fput;
> + return -EIO;
> }
>
> folio = kvm_gmem_get_folio(file_inode(file), index, true);

This should be:

folio = kvm_gmem_get_folio(file_inode(file), index, prepare);

Otherwise:

Reviewed-by: Michael Roth <[email protected]>

-Mike

> - if (!folio) {
> - r = -ENOMEM;
> - goto out_fput;
> - }
> + if (!folio)
> + return -ENOMEM;
>
> if (folio_test_hwpoison(folio)) {
> r = -EHWPOISON;
> @@ -583,9 +579,21 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> out_unlock:
> folio_unlock(folio);
> -out_fput:
> - fput(file);
>
> return r;
> }
> +
> +int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +{
> + struct file *file = kvm_gmem_get_file(slot);
> + int r;
> +
> + if (!file)
> + return -EFAULT;
> +
> + r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, true);
> + fput(file);
> + return r;
> +}
> EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
> --
> 2.43.0
>
>

2024-04-19 18:26:28

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH 11/11] KVM: x86: Add gmem hook for determining max NPT mapping level

On Tue, Apr 09, 2024 at 06:46:32PM -0500,
Michael Roth <[email protected]> wrote:

> On Thu, Apr 04, 2024 at 02:50:33PM -0400, Paolo Bonzini wrote:
> > From: Michael Roth <[email protected]>
> >
> > In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
> > 2MB mapping in the guest's nested page table depends on whether or not
> > any subpages within the range have already been initialized as private
> > in the RMP table. The existing mixed-attribute tracking in KVM is
> > insufficient here, for instance:
> >
> > - gmem allocates 2MB page
> > - guest issues PVALIDATE on 2MB page
> > - guest later converts a subpage to shared
> > - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> > - KVM MMU splits NPT mapping to 4K
>
> Binbin caught that I'd neglected to document the last step in the
> theoretical sequence here. It should state something to the effect
> of:
>
> - guest later converts that shared page back to private
>
> -Mike
>
> >
> > At this point there are no mixed attributes, and KVM would normally
> > allow for 2MB NPT mappings again, but this is actually not allowed
> > because the RMP table mappings are 4K and cannot be promoted on the
> > hypervisor side, so the NPT mappings must still be limited to 4K to
> > match this.
> >
> > Add a hook to determine the max NPT mapping size in situations like
> > this.
> >
> > Signed-off-by: Michael Roth <[email protected]>
> > Message-Id: <[email protected]>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> > 3 files changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index c81990937ab4..2db87a6fd52a 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -140,6 +140,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> > KVM_X86_OP_OPTIONAL(get_untagged_addr)
> > KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> > KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> > +KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
> > KVM_X86_OP_OPTIONAL(gmem_invalidate)
> >
> > #undef KVM_X86_OP
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 59c7b95034fc..67dc108dd366 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1811,6 +1811,8 @@ struct kvm_x86_ops {
> > void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> > int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> > void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> > + int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, bool is_private,
> > + u8 *max_level);
> > };

I think you added is_private due to the TDX patches. As Yan pointed out at
https://lore.kernel.org/kvm/[email protected]/

It's guaranteed that is_private is always true because the caller check it.

if (fault->is_private)
kvm_faultin_pfn_private()

So we can drop is_private parameter.
--
Isaku Yamahata <[email protected]>

2024-04-22 16:06:54

by Xu Yilun

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 04, 2024 at 02:50:31PM -0400, Paolo Bonzini wrote:
> During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
> prepare newly-allocated gmem pages prior to mapping them into the guest.
> In the case of SEV-SNP, this mainly involves setting the pages to
> private in the RMP table.
>
> However, for the GPA ranges comprising the initial guest payload, which
> are encrypted/measured prior to starting the guest, the gmem pages need
> to be accessed prior to setting them to private in the RMP table so they
> can be initialized with the userspace-provided data. Additionally, an
> SNP firmware call is needed afterward to encrypt them in-place and
> measure the contents into the guest's launch digest.
>
> While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
> this handling can be done in an open-coded/vendor-specific manner, this
> may expose more gmem-internal state/dependencies to external callers
> than necessary. Try to avoid this by implementing an interface that
> tries to handle as much of the common functionality inside gmem as
> possible, while also making it generic enough to potentially be
> usable/extensible for TDX as well.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Co-developed-by: Michael Roth <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> include/linux/kvm_host.h | 26 ++++++++++++++
> virt/kvm/guest_memfd.c | 78 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 104 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 33ed3b884a6b..97d57ec59789 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2450,4 +2450,30 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
> #endif
>
> +/**
> + * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
> + *
> + * @kvm: KVM instance
> + * @gfn: starting GFN to be populated
> + * @src: userspace-provided buffer containing data to copy into GFN range
> + * (passed to @post_populate, and incremented on each iteration
> + * if not NULL)
> + * @npages: number of pages to copy from userspace-buffer
> + * @post_populate: callback to issue for each gmem page that backs the GPA
> + * range
> + * @opaque: opaque data to pass to @post_populate callback
> + *
> + * This is primarily intended for cases where a gmem-backed GPA range needs
> + * to be initialized with userspace-provided data prior to being mapped into
> + * the guest as a private page. This should be called with the slots->lock
> + * held so that caller-enforced invariants regarding the expected memory
> + * attributes of the GPA range do not race with KVM_SET_MEMORY_ATTRIBUTES.
> + *
> + * Returns the number of pages that were populated.
> + */
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> + int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> + void __user *src, int order, void *opaque),
> + void *opaque);
> +
> #endif
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 51c99667690a..e7de97382a67 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -602,3 +602,81 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> return r;
> }
> EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
> +
> +static int kvm_gmem_undo_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> + gfn_t gfn, int order)
> +{
> + pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> + struct kvm_gmem *gmem = file->private_data;
> +
> + /*
> + * Races with kvm_gmem_unbind() must have been detected by
> + * __kvm_gmem_get_gfn(), because the invalidate_lock is
> + * taken between __kvm_gmem_get_gfn() and kvm_gmem_undo_get_pfn().
> + */
> + if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot))
> + return -EIO;
> +
> + return __kvm_gmem_punch_hole(file_inode(file), index << PAGE_SHIFT, PAGE_SIZE << order);
> +}
> +
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> + int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> + void __user *src, int order, void *opaque),
> + void *opaque)
> +{
> + struct file *file;
> + struct kvm_memory_slot *slot;
> +
> + int ret = 0, max_order;
> + long i;
> +
> + lockdep_assert_held(&kvm->slots_lock);
> + if (npages < 0)
> + return -EINVAL;
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!kvm_slot_can_be_private(slot))
> + return -EINVAL;
> +
> + file = kvm_gmem_get_file(slot);
> + if (!file)
> + return -EFAULT;
> +
> + filemap_invalidate_lock(file->f_mapping);
> +
> + npages = min_t(ulong, slot->npages - (gfn - slot->base_gfn), npages);
> + for (i = 0; i < npages; i += (1 << max_order)) {
> + gfn_t this_gfn = gfn + i;
> + kvm_pfn_t pfn;
> +
> + ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, false);
> + if (ret)
> + break;
> +
> + if (!IS_ALIGNED(this_gfn, (1 << max_order)) ||
> + (npages - i) < (1 << max_order))
> + max_order = 0;
> +
> + if (post_populate) {
> + void __user *p = src ? src + i * PAGE_SIZE : NULL;
> + ret = post_populate(kvm, this_gfn, pfn, p, max_order, opaque);

I don't see the main difference between gmem_prepare() and post_populate()
from gmem's point of view. They are all vendor callbacks invoked after
__filemap_get_folio(). Is it possible gmem choose to call gmem_prepare()
or post_populate() outside __kvm_gmem_get_pfn()? Or even pass all
parameters to a single gmem_prepare() and let vendor code decide what to
do.

> + }
> +
> + put_page(pfn_to_page(pfn));
> + if (ret) {
> + /*
> + * Punch a hole so that FGP_CREAT_ONLY can succeed
> + * again.
> + */
> + kvm_gmem_undo_get_pfn(file, slot, this_gfn, max_order);
> + break;
> + }
> + }
> +
> + filemap_invalidate_unlock(file->f_mapping);
> +
> + fput(file);
> + return ret && !i ? ret : i;
> +}
> +EXPORT_SYMBOL_GPL(kvm_gmem_populate);
> --
> 2.43.0
>
>
>

2024-04-22 16:56:57

by Xu Yilun

[permalink] [raw]
Subject: Re: [PATCH 06/11] KVM: guest_memfd: Add hook for initializing memory

On Thu, Apr 04, 2024 at 02:50:28PM -0400, Paolo Bonzini wrote:
> guest_memfd pages are generally expected to be in some arch-defined
> initial state prior to using them for guest memory. For SEV-SNP this
> initial state is 'private', or 'guest-owned', and requires additional
> operations to move these pages into a 'private' state by updating the
> corresponding entries the RMP table.
>
> Allow for an arch-defined hook to handle updates of this sort, and go
> ahead and implement one for x86 so KVM implementations like AMD SVM can
> register a kvm_x86_ops callback to handle these updates for SEV-SNP
> guests.
>
> The preparation callback is always called when allocating/grabbing
> folios via gmem, and it is up to the architecture to keep track of
> whether or not the pages are already in the expected state (e.g. the RMP
> table in the case of SEV-SNP).
>
> In some cases, it is necessary to defer the preparation of the pages to
> handle things like in-place encryption of initial guest memory payloads
> before marking these pages as 'private'/'guest-owned'. Add an argument
> (always true for now) to kvm_gmem_get_folio() that allows for the
> preparation callback to be bypassed. To detect possible issues in

IIUC, we have 2 dedicated flows.
1 kvm_gmem_get_pfn() or kvm_gmem_allocate()
a. kvm_gmem_get_folio()
b. gmem_prepare() for RMP

2 in-place encryption or whatever
a. kvm_gmem_get_folio(FGP_CREAT_ONLY)
b. in-place encryption
c. gmem_prepare() for RMP

Could we move gmem_prepare() out of kvm_gmem_get_folio(), then we could
have straightforward flow for each case, and don't have to have an
argument to pospone gmem_prepare().

> the way userspace initializes memory, it is only possible to add an
> unprepared page if it is not already included in the filemap.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Co-developed-by: Michael Roth <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 6 +++
> include/linux/kvm_host.h | 5 +++
> virt/kvm/Kconfig | 4 ++
> virt/kvm/guest_memfd.c | 65 ++++++++++++++++++++++++++++--
> 6 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 5187fcf4b610..d26fcad13e36 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -139,6 +139,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 01c69840647e..f101fab0040e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1809,6 +1809,7 @@ struct kvm_x86_ops {
>
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> + int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2d2619d3eee4..972524ddcfdb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13598,6 +13598,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
> +{
> + return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
> +}
> +#endif
>
> int kvm_spec_ctrl_test_value(u64 value)
> {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..33ed3b884a6b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2445,4 +2445,9 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
> +bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
> +#endif
> +
> #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 29b73eedfe74..ca870157b2ed 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,3 +109,7 @@ config KVM_GENERIC_PRIVATE_MEM
> select KVM_GENERIC_MEMORY_ATTRIBUTES
> select KVM_PRIVATE_MEM
> bool
> +
> +config HAVE_KVM_GMEM_PREPARE
> + bool
> + depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index e5b3cd02b651..486748e65f36 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -13,12 +13,60 @@ struct kvm_gmem {
> struct list_head entry;
> };
>
> -static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +bool __weak kvm_arch_gmem_prepare_needed(struct kvm *kvm)
> +{
> + return false;
> +}
> +#endif

In which case HAVE_KVM_GMEM_PREPARE is selected but
gmem_prepare_needed() is never implemented? Then all gmem_prepare stuff
are actually dead code. Maybe we don't need this weak stub?

> +
> +static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
> +{
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> + struct list_head *gmem_list = &inode->i_mapping->i_private_list;
> + struct kvm_gmem *gmem;
> +
> + list_for_each_entry(gmem, gmem_list, entry) {
> + struct kvm_memory_slot *slot;
> + struct kvm *kvm = gmem->kvm;
> + struct page *page;
> + kvm_pfn_t pfn;
> + gfn_t gfn;
> + int rc;
> +
> + if (!kvm_arch_gmem_prepare_needed(kvm))
> + continue;
> +
> + slot = xa_load(&gmem->bindings, index);
> + if (!slot)
> + continue;
> +
> + page = folio_file_page(folio, index);
> + pfn = page_to_pfn(page);
> + gfn = slot->base_gfn + index - slot->gmem.pgoff;
> + rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
> + if (rc) {
> + pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
> + index, rc);
> + return rc;
> + }
> + }
> +
> +#endif
> + return 0;
> +}
> +
> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
> {
> struct folio *folio;
> + fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
> +
> + if (!prepare)
> + fgp_flags |= FGP_CREAT_ONLY;
>
> /* TODO: Support huge pages. */
> - folio = filemap_grab_folio(inode->i_mapping, index);
> + folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags,
> + mapping_gfp_mask(inode->i_mapping));
> if (IS_ERR_OR_NULL(folio))
> return folio;
>
> @@ -41,6 +89,15 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> folio_mark_uptodate(folio);
> }
>
> + if (prepare) {
> + int r = kvm_gmem_prepare_folio(inode, index, folio);
> + if (r < 0) {
> + folio_unlock(folio);
> + folio_put(folio);
> + return ERR_PTR(r);
> + }
> + }
> +

Do we still need to prepare the page if it is hwpoisoned? I see the
hwpoisoned check is outside, in kvm_gmem_get_pfn().

Thanks,
Yilun

> /*
> * Ignore accessed, referenced, and dirty flags. The memory is
> * unevictable and there is no storage to write back to.
> @@ -145,7 +202,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> break;
> }
>
> - folio = kvm_gmem_get_folio(inode, index);
> + folio = kvm_gmem_get_folio(inode, index, true);
> if (IS_ERR_OR_NULL(folio)) {
> r = folio ? PTR_ERR(folio) : -ENOMEM;
> break;
> @@ -505,7 +562,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> goto out_fput;
> }
>
> - folio = kvm_gmem_get_folio(file_inode(file), index);
> + folio = kvm_gmem_get_folio(file_inode(file), index, true);
> if (!folio) {
> r = -ENOMEM;
> goto out_fput;
> --
> 2.43.0
>
>
>

2024-04-22 20:02:12

by Xu Yilun

[permalink] [raw]
Subject: Re: [PATCH 11/11] KVM: x86: Add gmem hook for determining max NPT mapping level

On Thu, Apr 04, 2024 at 02:50:33PM -0400, Paolo Bonzini wrote:
> From: Michael Roth <[email protected]>
>
> In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
> 2MB mapping in the guest's nested page table depends on whether or not
> any subpages within the range have already been initialized as private
> in the RMP table. The existing mixed-attribute tracking in KVM is
> insufficient here, for instance:
>
> - gmem allocates 2MB page
> - guest issues PVALIDATE on 2MB page
> - guest later converts a subpage to shared
> - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> - KVM MMU splits NPT mapping to 4K
>
> At this point there are no mixed attributes, and KVM would normally
> allow for 2MB NPT mappings again, but this is actually not allowed
> because the RMP table mappings are 4K and cannot be promoted on the
> hypervisor side, so the NPT mappings must still be limited to 4K to
> match this.

Just curious how the mapping could be promoted back in this case?

Thanks,
Yilun

>
> Add a hook to determine the max NPT mapping size in situations like
> this.
>
> Signed-off-by: Michael Roth <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> 3 files changed, 11 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index c81990937ab4..2db87a6fd52a 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -140,6 +140,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
> KVM_X86_OP_OPTIONAL(gmem_invalidate)
>
> #undef KVM_X86_OP
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 59c7b95034fc..67dc108dd366 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1811,6 +1811,8 @@ struct kvm_x86_ops {
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> + int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, bool is_private,
> + u8 *max_level);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..13dd367b8af1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4338,6 +4338,14 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> fault->max_level);
> fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
>
> + r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
> + fault->gfn, fault->is_private,
> + &fault->max_level);
> + if (r) {
> + kvm_release_pfn_clean(fault->pfn);
> + return r;
> + }
> +
> return RET_PF_CONTINUE;
> }
>
> --
> 2.43.0
>
>

2024-04-23 23:50:28

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 04, 2024 at 02:50:31PM -0400,
Paolo Bonzini <[email protected]> wrote:

> During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
> prepare newly-allocated gmem pages prior to mapping them into the guest.
> In the case of SEV-SNP, this mainly involves setting the pages to
> private in the RMP table.
>
> However, for the GPA ranges comprising the initial guest payload, which
> are encrypted/measured prior to starting the guest, the gmem pages need
> to be accessed prior to setting them to private in the RMP table so they
> can be initialized with the userspace-provided data. Additionally, an
> SNP firmware call is needed afterward to encrypt them in-place and
> measure the contents into the guest's launch digest.
>
> While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
> this handling can be done in an open-coded/vendor-specific manner, this
> may expose more gmem-internal state/dependencies to external callers
> than necessary. Try to avoid this by implementing an interface that
> tries to handle as much of the common functionality inside gmem as
> possible, while also making it generic enough to potentially be
> usable/extensible for TDX as well.

I explored how TDX will use this hook. However, it resulted in not using this
hook, and instead used kvm_tdp_mmu_get_walk() with a twist. The patch is below.

Because SEV-SNP manages the RMP that is not tied to NPT directly, SEV-SNP can
ignore TDP MMU page tables when updating RMP.
On the other hand, TDX essentially updates Secure-EPT when it adds a page to
the guest by TDH.MEM.PAGE.ADD(). It needs to protect KVM TDP MMU page tables
with mmu_lock, not guest memfd file mapping with invalidate_lock. The hook
doesn't apply to TDX well. The resulted KVM_TDX_INIT_MEM_REGION logic is as
follows.

get_user_pages_fast(source addr)
read_lock(mmu_lock)
kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
if the page table doesn't map gpa, error.
TDH.MEM.PAGE.ADD()
TDH.MR.EXTEND()
read_unlock(mmu_lock)
put_page()


From 7d4024049b51969a2431805c2117992fc7ec0981 Mon Sep 17 00:00:00 2001
Message-ID: <7d4024049b51969a2431805c2117992fc7ec0981.1713913379.git.isaku.yamahata@intel.com>
In-Reply-To: <[email protected]>
References: <[email protected]>
From: Isaku Yamahata <[email protected]>
Date: Tue, 23 Apr 2024 11:33:44 -0700
Subject: [PATCH] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

KVM_TDX_INIT_MEM_REGION needs to check if the given GFN is already
populated. Add wrapping logic to kvm_tdp_mmu_get_walk() to export it.

Alternatives are as follows. Choose the approach of this patch as the
least intrusive change.
- Refactor kvm page fault handler. Populating part and unlock function.
The page fault handler to populate with keeping lock, TDH.MEM.PAGE.ADD(),
unlock.
- Add a callback function to struct kvm_page_fault and call it
after the page fault handler before unlocking mmu_lock and releasing PFN.

Based on the feedback of
https://lore.kernel.org/kvm/[email protected]/
https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/tdp_mmu.c | 44 ++++++++++++++++++++++++++++++++------
2 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 712e9408f634..4f61f4b9fd64 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -287,6 +287,9 @@ extern bool tdp_mmu_enabled;
#define tdp_mmu_enabled false
#endif

+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+ kvm_pfn_t *pfn);
+
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3592ae4e485f..bafcd8aeb3b3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -2035,14 +2035,25 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
*
* Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
*/
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
- int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ bool is_private)
{
struct tdp_iter iter;
struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;

+ tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
+ leaf = iter.level;
+ sptes[leaf] = iter.old_spte;
+ }
+
+ return leaf;
+}
+
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+ int *root_level)
+{
*root_level = vcpu->arch.mmu->root_role.level;

/*
@@ -2050,15 +2061,34 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
* instructions in protected guest memory can't be parsed by VMM.
*/
if (WARN_ON_ONCE(kvm_gfn_shared_mask(vcpu->kvm)))
- return leaf;
+ return -1;

- tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
- leaf = iter.level;
- sptes[leaf] = iter.old_spte;
+ return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
+}
+
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+ kvm_pfn_t *pfn)
+{
+ u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+ int leaf;
+
+ lockdep_assert_held(&vcpu->kvm->mmu_lock);
+
+ kvm_tdp_mmu_walk_lockless_begin();
+ leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
+ kvm_tdp_mmu_walk_lockless_end();
+ if (leaf < 0)
+ return -ENOENT;
+
+ spte = sptes[leaf];
+ if (is_shadow_present_pte(spte) && is_last_spte(spte, leaf)) {
+ *pfn = spte_to_pfn(spte);
+ return leaf;
}

- return leaf;
+ return -ENOENT;
}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);

/*
* Returns the last level spte pointer of the shadow page walk for the given
--
2.43.2

--
Isaku Yamahata <[email protected]>

2024-04-24 22:26:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Tue, Apr 23, 2024, Isaku Yamahata wrote:
> On Thu, Apr 04, 2024 at 02:50:31PM -0400,
> Paolo Bonzini <[email protected]> wrote:
>
> > During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
> > prepare newly-allocated gmem pages prior to mapping them into the guest.
> > In the case of SEV-SNP, this mainly involves setting the pages to
> > private in the RMP table.
> >
> > However, for the GPA ranges comprising the initial guest payload, which
> > are encrypted/measured prior to starting the guest, the gmem pages need
> > to be accessed prior to setting them to private in the RMP table so they
> > can be initialized with the userspace-provided data. Additionally, an
> > SNP firmware call is needed afterward to encrypt them in-place and
> > measure the contents into the guest's launch digest.
> >
> > While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
> > this handling can be done in an open-coded/vendor-specific manner, this
> > may expose more gmem-internal state/dependencies to external callers
> > than necessary. Try to avoid this by implementing an interface that
> > tries to handle as much of the common functionality inside gmem as
> > possible, while also making it generic enough to potentially be
> > usable/extensible for TDX as well.
>
> I explored how TDX will use this hook. However, it resulted in not using this
> hook, and instead used kvm_tdp_mmu_get_walk() with a twist. The patch is below.
>
> Because SEV-SNP manages the RMP that is not tied to NPT directly, SEV-SNP can
> ignore TDP MMU page tables when updating RMP.
> On the other hand, TDX essentially updates Secure-EPT when it adds a page to
> the guest by TDH.MEM.PAGE.ADD(). It needs to protect KVM TDP MMU page tables
> with mmu_lock, not guest memfd file mapping with invalidate_lock. The hook
> doesn't apply to TDX well. The resulted KVM_TDX_INIT_MEM_REGION logic is as
> follows.
>
> get_user_pages_fast(source addr)
> read_lock(mmu_lock)
> kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> if the page table doesn't map gpa, error.
> TDH.MEM.PAGE.ADD()
> TDH.MR.EXTEND()
> read_unlock(mmu_lock)
> put_page()

Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
invalidation, but I also don't see why it would cause problems. I.e. why not
take mmu_lock() in TDX's post_populate() implementation? That would allow having
a sanity check that the PFN that guest_memfd() has is indeed the PFN that KVM's
S-EPT mirror has, i.e. the PFN that KVM is going to PAGE.ADD.

> >From 7d4024049b51969a2431805c2117992fc7ec0981 Mon Sep 17 00:00:00 2001
> Message-ID: <7d4024049b51969a2431805c2117992fc7ec0981.1713913379.git.isaku.yamahata@intel.com>
> In-Reply-To: <[email protected]>
> References: <[email protected]>
> From: Isaku Yamahata <[email protected]>
> Date: Tue, 23 Apr 2024 11:33:44 -0700
> Subject: [PATCH] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
>
> KVM_TDX_INIT_MEM_REGION needs to check if the given GFN is already
> populated. Add wrapping logic to kvm_tdp_mmu_get_walk() to export it.

> +int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> + kvm_pfn_t *pfn)
> +{
> + u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> + int leaf;
> +
> + lockdep_assert_held(&vcpu->kvm->mmu_lock);
> +
> + kvm_tdp_mmu_walk_lockless_begin();

This is obviously not a lockless walk.

> + leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
> + kvm_tdp_mmu_walk_lockless_end();
> + if (leaf < 0)
> + return -ENOENT;
> +
> + spte = sptes[leaf];
> + if (is_shadow_present_pte(spte) && is_last_spte(spte, leaf)) {
> + *pfn = spte_to_pfn(spte);
> + return leaf;
> }
>
> - return leaf;
> + return -ENOENT;
> }
> +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
>
> /*
> * Returns the last level spte pointer of the shadow page walk for the given
> --
> 2.43.2
>
> --
> Isaku Yamahata <[email protected]>

2024-04-24 22:32:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 04, 2024, Paolo Bonzini wrote:
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> + int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> + void __user *src, int order, void *opaque),

Add a typedef for callback? If only to make this prototype readable.

> +long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> + int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> + void __user *src, int order, void *opaque),
> + void *opaque)
> +{
> + struct file *file;
> + struct kvm_memory_slot *slot;
> +
> + int ret = 0, max_order;
> + long i;
> +
> + lockdep_assert_held(&kvm->slots_lock);
> + if (npages < 0)
> + return -EINVAL;
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!kvm_slot_can_be_private(slot))
> + return -EINVAL;
> +
> + file = kvm_gmem_get_file(slot);
> + if (!file)
> + return -EFAULT;
> +
> + filemap_invalidate_lock(file->f_mapping);
> +
> + npages = min_t(ulong, slot->npages - (gfn - slot->base_gfn), npages);
> + for (i = 0; i < npages; i += (1 << max_order)) {
> + gfn_t this_gfn = gfn + i;

KVM usually does something like "start_gfn" or "base_gfn", and then uses "gfn"
for the one gfn that's being processed. And IMO that's much better because the
propotype for kvm_gmem_populate() does not make it obvious that @gfn is the base
of a range, not a singular gfn to process.


> + kvm_pfn_t pfn;
> +
> + ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, false);
> + if (ret)
> + break;
> +
> + if (!IS_ALIGNED(this_gfn, (1 << max_order)) ||
> + (npages - i) < (1 << max_order))
> + max_order = 0;
> +
> + if (post_populate) {

Is there any use for this without @post_populate? I.e. why make this optional?

> + void __user *p = src ? src + i * PAGE_SIZE : NULL;

Eh, I would vote to either define "p" at the top of the loop.

> + ret = post_populate(kvm, this_gfn, pfn, p, max_order, opaque);
> + }
> +
> + put_page(pfn_to_page(pfn));
> + if (ret) {
> + /*
> + * Punch a hole so that FGP_CREAT_ONLY can succeed
> + * again.
> + */
> + kvm_gmem_undo_get_pfn(file, slot, this_gfn, max_order);
> + break;
> + }
> + }
> +
> + filemap_invalidate_unlock(file->f_mapping);
> +
> + fput(file);
> + return ret && !i ? ret : i;
> +}
> +EXPORT_SYMBOL_GPL(kvm_gmem_populate);
> --
> 2.43.0
>
>

2024-04-24 22:34:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 07/11] KVM: guest_memfd: extract __kvm_gmem_get_pfn()

On Thu, Apr 04, 2024, Paolo Bonzini wrote:
> In preparation for adding a function that walks a set of pages
> provided by userspace and populates them in a guest_memfd,
> add a version of kvm_gmem_get_pfn() that has a "bool prepare"
> argument and passes it down to kvm_gmem_get_folio().
>
> Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
> on the same file, so make the new function take struct file*.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> virt/kvm/guest_memfd.c | 38 +++++++++++++++++++++++---------------
> 1 file changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 486748e65f36..a537a7e63ab5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -540,33 +540,29 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
> fput(file);
> }
>
> -int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> - gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)

I genuinely don't know what it means to "prepare" a guest_memfd. I see it becomes

if (!prepare)
fgp_flags |= FGP_CREAT_ONLY;

but I find the name "prepare" to be extremely unhelpful.

2024-04-24 23:00:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 07/11] KVM: guest_memfd: extract __kvm_gmem_get_pfn()

On Wed, Apr 24, 2024, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, Paolo Bonzini wrote:
> > In preparation for adding a function that walks a set of pages
> > provided by userspace and populates them in a guest_memfd,
> > add a version of kvm_gmem_get_pfn() that has a "bool prepare"
> > argument and passes it down to kvm_gmem_get_folio().
> >
> > Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
> > on the same file, so make the new function take struct file*.
> >
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > virt/kvm/guest_memfd.c | 38 +++++++++++++++++++++++---------------
> > 1 file changed, 23 insertions(+), 15 deletions(-)
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 486748e65f36..a537a7e63ab5 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -540,33 +540,29 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
> > fput(file);
> > }
> >
> > -int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > - gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> > +static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> > + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
>
> I genuinely don't know what it means to "prepare" a guest_memfd. I see it becomes
>
> if (!prepare)
> fgp_flags |= FGP_CREAT_ONLY;
>
> but I find the name "prepare" to be extremely unhelpful.

Ah, I'm blind. Maybe "do_prepare", or "do_arch_prepare"? To make it clear that
it's a command, not a description of the operation (which is how I first read it).

And I feel like overloading it to also mean FGP_CREAT_ONLY when _not_ preparing
the memory is odd.

if (prepare) {
int r = kvm_gmem_prepare_folio(inode, index, folio);
if (r < 0) {
folio_unlock(folio);
folio_put(folio);
return ERR_PTR(r);
}
}

Instead of "prepare" as a command, would it make sense to describe the "populating"
case? Because I think it's more intuitive that populating _needs_ to operate on
new, unprepared data.

2024-04-25 01:13:01

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Wed, Apr 24, 2024 at 03:24:58PM -0700,
Sean Christopherson <[email protected]> wrote:

> On Tue, Apr 23, 2024, Isaku Yamahata wrote:
> > On Thu, Apr 04, 2024 at 02:50:31PM -0400,
> > Paolo Bonzini <[email protected]> wrote:
> >
> > > During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
> > > prepare newly-allocated gmem pages prior to mapping them into the guest.
> > > In the case of SEV-SNP, this mainly involves setting the pages to
> > > private in the RMP table.
> > >
> > > However, for the GPA ranges comprising the initial guest payload, which
> > > are encrypted/measured prior to starting the guest, the gmem pages need
> > > to be accessed prior to setting them to private in the RMP table so they
> > > can be initialized with the userspace-provided data. Additionally, an
> > > SNP firmware call is needed afterward to encrypt them in-place and
> > > measure the contents into the guest's launch digest.
> > >
> > > While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
> > > this handling can be done in an open-coded/vendor-specific manner, this
> > > may expose more gmem-internal state/dependencies to external callers
> > > than necessary. Try to avoid this by implementing an interface that
> > > tries to handle as much of the common functionality inside gmem as
> > > possible, while also making it generic enough to potentially be
> > > usable/extensible for TDX as well.
> >
> > I explored how TDX will use this hook. However, it resulted in not using this
> > hook, and instead used kvm_tdp_mmu_get_walk() with a twist. The patch is below.
> >
> > Because SEV-SNP manages the RMP that is not tied to NPT directly, SEV-SNP can
> > ignore TDP MMU page tables when updating RMP.
> > On the other hand, TDX essentially updates Secure-EPT when it adds a page to
> > the guest by TDH.MEM.PAGE.ADD(). It needs to protect KVM TDP MMU page tables
> > with mmu_lock, not guest memfd file mapping with invalidate_lock. The hook
> > doesn't apply to TDX well. The resulted KVM_TDX_INIT_MEM_REGION logic is as
> > follows.
> >
> > get_user_pages_fast(source addr)
> > read_lock(mmu_lock)
> > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > if the page table doesn't map gpa, error.
> > TDH.MEM.PAGE.ADD()
> > TDH.MR.EXTEND()
> > read_unlock(mmu_lock)
> > put_page()
>
> Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> invalidation, but I also don't see why it would cause problems. I.e. why not
> take mmu_lock() in TDX's post_populate() implementation?

We can take the lock. Because we have already populated the GFN of guest_memfd,
we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
get -EEXIST.


> That would allow having
> a sanity check that the PFN that guest_memfd() has is indeed the PFN that KVM's
> S-EPT mirror has, i.e. the PFN that KVM is going to PAGE.ADD.

Because we have PFN from the mirrored EPT, I thought it's duplicate to get PFN
again via guest memfd. We can check if two PFN matches.
--
Isaku Yamahata <[email protected]>

2024-04-25 05:53:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 04/11] filemap: add FGP_CREAT_ONLY

On 4/4/24 20:50, Paolo Bonzini wrote:
> KVM would like to add a ioctl to encrypt and install a page into private
> memory (i.e. into a guest_memfd), in preparation for launching an
> encrypted guest.
>
> This API should be used only once per page (unless there are failures),
> so we want to rule out the possibility of operating on a page that is
> already in the guest_memfd's filemap. Overwriting the page is almost
> certainly a sign of a bug, so we might as well forbid it.
>
> Therefore, introduce a new flag for __filemap_get_folio (to be passed
> together with FGP_CREAT) that allows *adding* a new page to the filemap
> but not returning an existing one.
>
> An alternative possibility would be to force KVM users to initialize
> the whole filemap in one go, but that is complicated by the fact that
> the filemap includes pages of different kinds, including some that are
> per-vCPU rather than per-VM. Basically the result would be closer to
> a system call that multiplexes multiple ioctls, than to something
> cleaner like readv/writev.
>
> Races between callers that pass FGP_CREAT_ONLY are uninteresting to
> the filemap code: one of the racers wins and one fails with EEXIST,
> similar to calling open(2) with O_CREAT|O_EXCL. It doesn't matter to
> filemap.c if the missing synchronization is in the kernel or in userspace,
> and in fact it could even be intentional. (In the case of KVM it turns
> out that a mutex is taken around these calls for unrelated reasons,
> so there can be no races.)
>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Yosry Ahmed <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>

Matthew, are your objections still valid or could I have your ack?

Thanks,

Paolo

> ---
> include/linux/pagemap.h | 2 ++
> mm/filemap.c | 4 ++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index f879c1d54da7..a8c0685e8c08 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -587,6 +587,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> * added to the page cache and the VM's LRU list. The folio is
> * returned locked.
> + * * %FGP_CREAT_ONLY - Fail if a folio is present
> * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> * folio is already in cache. If the folio was allocated, unlock it
> * before returning so the caller can do the same dance.
> @@ -607,6 +608,7 @@ typedef unsigned int __bitwise fgf_t;
> #define FGP_NOWAIT ((__force fgf_t)0x00000020)
> #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
> #define FGP_STABLE ((__force fgf_t)0x00000080)
> +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
> #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
>
> #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7437b2bd75c1..e7440e189ebd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1863,6 +1863,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> folio = NULL;
> if (!folio)
> goto no_page;
> + if (fgp_flags & FGP_CREAT_ONLY) {
> + folio_put(folio);
> + return ERR_PTR(-EEXIST);
> + }
>
> if (fgp_flags & FGP_LOCK) {
> if (fgp_flags & FGP_NOWAIT) {


2024-04-25 05:57:07

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 25, 2024 at 12:32 AM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Thu, Apr 04, 2024, Paolo Bonzini wrote:
> > +long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> > + int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > + void __user *src, int order, void *opaque),
>
> Add a typedef for callback? If only to make this prototype readable.
>
> > +long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> > + int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > + void __user *src, int order, void *opaque),
> > + void *opaque)
> > +{
> > + struct file *file;
> > + struct kvm_memory_slot *slot;
> > +
> > + int ret = 0, max_order;
> > + long i;
> > +
> > + lockdep_assert_held(&kvm->slots_lock);
> > + if (npages < 0)
> > + return -EINVAL;
> > +
> > + slot = gfn_to_memslot(kvm, gfn);
> > + if (!kvm_slot_can_be_private(slot))
> > + return -EINVAL;
> > +
> > + file = kvm_gmem_get_file(slot);
> > + if (!file)
> > + return -EFAULT;
> > +
> > + filemap_invalidate_lock(file->f_mapping);
> > +
> > + npages = min_t(ulong, slot->npages - (gfn - slot->base_gfn), npages);
> > + for (i = 0; i < npages; i += (1 << max_order)) {
> > + gfn_t this_gfn = gfn + i;
>
> KVM usually does something like "start_gfn" or "base_gfn", and then uses "gfn"
> for the one gfn that's being processed. And IMO that's much better because the
> propotype for kvm_gmem_populate() does not make it obvious that @gfn is the base
> of a range, not a singular gfn to process.
>
>
> > + kvm_pfn_t pfn;
> > +
> > + ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, false);
> > + if (ret)
> > + break;
> > +
> > + if (!IS_ALIGNED(this_gfn, (1 << max_order)) ||
> > + (npages - i) < (1 << max_order))
> > + max_order = 0;
> > +
> > + if (post_populate) {
>
> Is there any use for this without @post_populate? I.e. why make this optional?

Yeah, it probably does not need to be optional (before, the
copy_from_user was optionally done from kvm_gmem_populate, but not
anymore).

Paolo


2024-04-25 06:01:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 25, 2024 at 3:12 AM Isaku Yamahata <[email protected]> wrote:
> > > get_user_pages_fast(source addr)
> > > read_lock(mmu_lock)
> > > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > > if the page table doesn't map gpa, error.
> > > TDH.MEM.PAGE.ADD()
> > > TDH.MR.EXTEND()
> > > read_unlock(mmu_lock)
> > > put_page()
> >
> > Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> > invalidation, but I also don't see why it would cause problems.

The invalidate_lock is only needed to operate on the guest_memfd, but
it's a rwsem so there are no risks of lock inversion.

> > I.e. why not
> > take mmu_lock() in TDX's post_populate() implementation?
>
> We can take the lock. Because we have already populated the GFN of guest_memfd,
> we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
> get -EEXIST.

I don't understand why TDH.MEM.PAGE.ADD() cannot be called from the
post-populate hook. Can the code for TDH.MEM.PAGE.ADD be shared
between the memory initialization ioctl and the page fault hook in
kvm_x86_ops?

Paolo

>
> > That would allow having
> > a sanity check that the PFN that guest_memfd() has is indeed the PFN that KVM's
> > S-EPT mirror has, i.e. the PFN that KVM is going to PAGE.ADD.
>
> Because we have PFN from the mirrored EPT, I thought it's duplicate to get PFN
> again via guest memfd. We can check if two PFN matches.
> --
> Isaku Yamahata <[email protected]>
>


2024-04-25 16:25:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 25, 2024, Paolo Bonzini wrote:
> On Thu, Apr 25, 2024 at 3:12 AM Isaku Yamahata <[email protected]> wrote:
> > > > get_user_pages_fast(source addr)
> > > > read_lock(mmu_lock)
> > > > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > > > if the page table doesn't map gpa, error.
> > > > TDH.MEM.PAGE.ADD()
> > > > TDH.MR.EXTEND()
> > > > read_unlock(mmu_lock)
> > > > put_page()
> > >
> > > Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> > > invalidation, but I also don't see why it would cause problems.
>
> The invalidate_lock is only needed to operate on the guest_memfd, but
> it's a rwsem so there are no risks of lock inversion.
>
> > > I.e. why not
> > > take mmu_lock() in TDX's post_populate() implementation?
> >
> > We can take the lock. Because we have already populated the GFN of guest_memfd,
> > we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
> > get -EEXIST.
>
> I don't understand why TDH.MEM.PAGE.ADD() cannot be called from the
> post-populate hook. Can the code for TDH.MEM.PAGE.ADD be shared
> between the memory initialization ioctl and the page fault hook in
> kvm_x86_ops?

Ah, because TDX is required to pre-fault the memory to establish the S-EPT walk,
and pre-faulting means guest_memfd()

Requiring that guest_memfd not have a page when initializing the guest image
seems wrong, i.e. I don't think we want FGP_CREAT_ONLY. And not just because I
am a fan of pre-faulting, I think the semantics are bad.

E.g. IIUC, doing fallocate() to ensure memory is available would cause LAUNCH_UPDATE
to fail. That's weird and has nothing to do with KVM_PRE_FAULT.

I don't understand why we want FGP_CREAT_ONLY semantics. Who cares if there's a
page allocated? KVM already checks that the page is unassigned in the RMP, so
why does guest_memfd care whether or not the page was _just_ allocated?

AFAIK, unwinding on failure is completely uninteresting, and arguably undesirable,
because undoing LAUNCH_UPDATE or PAGE.ADD will affect the measurement, i.e. there
is no scenario where deleting pages from guest_memfd would allow a restart/resume
of the build process to truly succeed.

2024-04-25 17:08:27

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 25, 2024 at 09:00:38AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Thu, Apr 25, 2024, Paolo Bonzini wrote:
> > On Thu, Apr 25, 2024 at 3:12 AM Isaku Yamahata <[email protected]> wrote:
> > > > > get_user_pages_fast(source addr)
> > > > > read_lock(mmu_lock)
> > > > > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > > > > if the page table doesn't map gpa, error.
> > > > > TDH.MEM.PAGE.ADD()
> > > > > TDH.MR.EXTEND()
> > > > > read_unlock(mmu_lock)
> > > > > put_page()
> > > >
> > > > Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> > > > invalidation, but I also don't see why it would cause problems.
> >
> > The invalidate_lock is only needed to operate on the guest_memfd, but
> > it's a rwsem so there are no risks of lock inversion.
> >
> > > > I.e. why not
> > > > take mmu_lock() in TDX's post_populate() implementation?
> > >
> > > We can take the lock. Because we have already populated the GFN of guest_memfd,
> > > we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
> > > get -EEXIST.
> >
> > I don't understand why TDH.MEM.PAGE.ADD() cannot be called from the
> > post-populate hook. Can the code for TDH.MEM.PAGE.ADD be shared
> > between the memory initialization ioctl and the page fault hook in
> > kvm_x86_ops?
>
> Ah, because TDX is required to pre-fault the memory to establish the S-EPT walk,
> and pre-faulting means guest_memfd()
>
> Requiring that guest_memfd not have a page when initializing the guest image
> seems wrong, i.e. I don't think we want FGP_CREAT_ONLY. And not just because I
> am a fan of pre-faulting, I think the semantics are bad.
>
> E.g. IIUC, doing fallocate() to ensure memory is available would cause LAUNCH_UPDATE
> to fail. That's weird and has nothing to do with KVM_PRE_FAULT.
>
> I don't understand why we want FGP_CREAT_ONLY semantics. Who cares if there's a
> page allocated? KVM already checks that the page is unassigned in the RMP, so
> why does guest_memfd care whether or not the page was _just_ allocated?
>
> AFAIK, unwinding on failure is completely uninteresting, and arguably undesirable,
> because undoing LAUNCH_UPDATE or PAGE.ADD will affect the measurement, i.e. there
> is no scenario where deleting pages from guest_memfd would allow a restart/resume
> of the build process to truly succeed.


Just for record. With the following twist to kvm_gmem_populate,
KVM_TDX_INIT_MEM_REGION can use kvm_gmem_populate(). For those who are curious,
I also append the callback implementation at the end.

--

include/linux/kvm_host.h | 2 ++
virt/kvm/guest_memfd.c | 3 ++-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index df957c9f9115..7c86b77f8895 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2460,6 +2460,7 @@ bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
* (passed to @post_populate, and incremented on each iteration
* if not NULL)
* @npages: number of pages to copy from userspace-buffer
+ * @prepare: Allow page allocation to invoke gmem_prepare hook
* @post_populate: callback to issue for each gmem page that backs the GPA
* range
* @opaque: opaque data to pass to @post_populate callback
@@ -2473,6 +2474,7 @@ bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
* Returns the number of pages that were populated.
*/
long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
+ bool prepare,
int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
void __user *src, int order, void *opaque),
void *opaque);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 3195ceefe915..18809e6dea8a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -638,6 +638,7 @@ static int kvm_gmem_undo_get_pfn(struct file *file, struct kvm_memory_slot *slot
}

long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
+ bool prepare,
int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
void __user *src, int order, void *opaque),
void *opaque)
@@ -667,7 +668,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
gfn_t this_gfn = gfn + i;
kvm_pfn_t pfn;

- ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, false);
+ ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, prepare);
if (ret)
break;

--
2.43.2


Here is the callback for KVM_TDX_INIT_MEM_REGION.
Note: the caller of kvm_gmem_populate() acquires mutex_lock(&kvm->slots_lock)
and idx = srcu_read_lock(&kvm->srcu).


struct tdx_gmem_post_populate_arg {
struct kvm_vcpu *vcpu;
__u32 flags;
};

static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
void __user *src, int order, void *_arg)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct tdx_gmem_post_populate_arg *arg = _arg;
struct kvm_vcpu *vcpu = arg->vcpu;
struct kvm_memory_slot *slot;
gpa_t gpa = gfn_to_gpa(gfn);
struct page *page;
kvm_pfn_t mmu_pfn;
int ret, i;
u64 err;

/* Pin the source page. */
ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
if (ret < 0)
return ret;
if (ret != 1)
return -ENOMEM;

slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
ret = -EFAULT;
goto out_put_page;
}

read_lock(&kvm->mmu_lock);

ret = kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &mmu_pfn);
if (ret < 0)
goto out;
if (ret > PG_LEVEL_4K) {
ret = -EINVAL;
goto out;
}
if (mmu_pfn != pfn) {
ret = -EAGAIN;
goto out;
}

ret = 0;
do {
err = tdh_mem_page_add(kvm_tdx, gpa, pfn_to_hpa(pfn),
pfn_to_hpa(page_to_pfn(page)), NULL);
} while (err == TDX_ERROR_SEPT_BUSY);
if (err) {
ret = -EIO;
goto out;
}

WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
atomic64_dec(&kvm_tdx->nr_premapped);
tdx_account_td_pages(vcpu->kvm, PG_LEVEL_4K);

if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
err = tdh_mr_extend(kvm_tdx, gpa + i, NULL);
if (err) {
ret = -EIO;
break;
}
}
}

out:
read_unlock(&kvm->mmu_lock);
out_put_page:
put_page(page);
return ret;
}

--
Isaku Yamahata <[email protected]>

2024-04-26 05:42:12

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 25, 2024 at 6:00 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Apr 25, 2024, Paolo Bonzini wrote:
> > On Thu, Apr 25, 2024 at 3:12 AM Isaku Yamahata <[email protected]> wrote:
> > > > > get_user_pages_fast(source addr)
> > > > > read_lock(mmu_lock)
> > > > > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > > > > if the page table doesn't map gpa, error.
> > > > > TDH.MEM.PAGE.ADD()
> > > > > TDH.MR.EXTEND()
> > > > > read_unlock(mmu_lock)
> > > > > put_page()
> > > >
> > > > Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> > > > invalidation, but I also don't see why it would cause problems.
> >
> > The invalidate_lock is only needed to operate on the guest_memfd, but
> > it's a rwsem so there are no risks of lock inversion.
> >
> > > > I.e. why not
> > > > take mmu_lock() in TDX's post_populate() implementation?
> > >
> > > We can take the lock. Because we have already populated the GFN of guest_memfd,
> > > we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
> > > get -EEXIST.
> >
> > I don't understand why TDH.MEM.PAGE.ADD() cannot be called from the
> > post-populate hook. Can the code for TDH.MEM.PAGE.ADD be shared
> > between the memory initialization ioctl and the page fault hook in
> > kvm_x86_ops?
>
> Ah, because TDX is required to pre-fault the memory to establish the S-EPT walk,
> and pre-faulting means guest_memfd()
>
> Requiring that guest_memfd not have a page when initializing the guest image
> seems wrong, i.e. I don't think we want FGP_CREAT_ONLY. And not just because I
> am a fan of pre-faulting, I think the semantics are bad.

Ok, fair enough. I wanted to do the once-only test in common code but
since SEV code checks for the RMP I can remove that. One less
headache.

Paolo


2024-04-26 05:45:12

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Thu, Apr 25, 2024 at 6:51 PM Isaku Yamahata <[email protected]> wrote:
> > AFAIK, unwinding on failure is completely uninteresting, and arguably undesirable,
> > because undoing LAUNCH_UPDATE or PAGE.ADD will affect the measurement, i.e. there
> > is no scenario where deleting pages from guest_memfd would allow a restart/resume
> > of the build process to truly succeed.
>
>
> Just for record. With the following twist to kvm_gmem_populate,
> KVM_TDX_INIT_MEM_REGION can use kvm_gmem_populate(). For those who are curious,
> I also append the callback implementation at the end.

Nice, thank you very much. Since TDX does not need
HAVE_KVM_GMEM_PREPARE, if I get rid of FGP_CREAT_ONLY it will work for
you, right?

Paolo

>
> --
>
> include/linux/kvm_host.h | 2 ++
> virt/kvm/guest_memfd.c | 3 ++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index df957c9f9115..7c86b77f8895 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2460,6 +2460,7 @@ bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
> * (passed to @post_populate, and incremented on each iteration
> * if not NULL)
> * @npages: number of pages to copy from userspace-buffer
> + * @prepare: Allow page allocation to invoke gmem_prepare hook
> * @post_populate: callback to issue for each gmem page that backs the GPA
> * range
> * @opaque: opaque data to pass to @post_populate callback
> @@ -2473,6 +2474,7 @@ bool kvm_arch_gmem_prepare_needed(struct kvm *kvm);
> * Returns the number of pages that were populated.
> */
> long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> + bool prepare,
> int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> void __user *src, int order, void *opaque),
> void *opaque);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 3195ceefe915..18809e6dea8a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -638,6 +638,7 @@ static int kvm_gmem_undo_get_pfn(struct file *file, struct kvm_memory_slot *slot
> }
>
> long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> + bool prepare,
> int (*post_populate)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> void __user *src, int order, void *opaque),
> void *opaque)
> @@ -667,7 +668,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
> gfn_t this_gfn = gfn + i;
> kvm_pfn_t pfn;
>
> - ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, false);
> + ret = __kvm_gmem_get_pfn(file, slot, this_gfn, &pfn, &max_order, prepare);
> if (ret)
> break;
>
> --
> 2.43.2
>
>
> Here is the callback for KVM_TDX_INIT_MEM_REGION.
> Note: the caller of kvm_gmem_populate() acquires mutex_lock(&kvm->slots_lock)
> and idx = srcu_read_lock(&kvm->srcu).
>
>
> struct tdx_gmem_post_populate_arg {
> struct kvm_vcpu *vcpu;
> __u32 flags;
> };
>
> static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> void __user *src, int order, void *_arg)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> struct tdx_gmem_post_populate_arg *arg = _arg;
> struct kvm_vcpu *vcpu = arg->vcpu;
> struct kvm_memory_slot *slot;
> gpa_t gpa = gfn_to_gpa(gfn);
> struct page *page;
> kvm_pfn_t mmu_pfn;
> int ret, i;
> u64 err;
>
> /* Pin the source page. */
> ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
> if (ret < 0)
> return ret;
> if (ret != 1)
> return -ENOMEM;
>
> slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
> ret = -EFAULT;
> goto out_put_page;
> }
>
> read_lock(&kvm->mmu_lock);
>
> ret = kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &mmu_pfn);
> if (ret < 0)
> goto out;
> if (ret > PG_LEVEL_4K) {
> ret = -EINVAL;
> goto out;
> }
> if (mmu_pfn != pfn) {
> ret = -EAGAIN;
> goto out;
> }
>
> ret = 0;
> do {
> err = tdh_mem_page_add(kvm_tdx, gpa, pfn_to_hpa(pfn),
> pfn_to_hpa(page_to_pfn(page)), NULL);
> } while (err == TDX_ERROR_SEPT_BUSY);
> if (err) {
> ret = -EIO;
> goto out;
> }
>
> WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> atomic64_dec(&kvm_tdx->nr_premapped);
> tdx_account_td_pages(vcpu->kvm, PG_LEVEL_4K);
>
> if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> err = tdh_mr_extend(kvm_tdx, gpa + i, NULL);
> if (err) {
> ret = -EIO;
> break;
> }
> }
> }
>
> out:
> read_unlock(&kvm->mmu_lock);
> out_put_page:
> put_page(page);
> return ret;
> }
>
> --
> Isaku Yamahata <[email protected]>
>


2024-04-26 15:22:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Fri, Apr 26, 2024, Paolo Bonzini wrote:
> On Thu, Apr 25, 2024 at 6:00 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Apr 25, 2024, Paolo Bonzini wrote:
> > > On Thu, Apr 25, 2024 at 3:12 AM Isaku Yamahata <[email protected]> wrote:
> > > > > > get_user_pages_fast(source addr)
> > > > > > read_lock(mmu_lock)
> > > > > > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > > > > > if the page table doesn't map gpa, error.
> > > > > > TDH.MEM.PAGE.ADD()
> > > > > > TDH.MR.EXTEND()
> > > > > > read_unlock(mmu_lock)
> > > > > > put_page()
> > > > >
> > > > > Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> > > > > invalidation, but I also don't see why it would cause problems.
> > >
> > > The invalidate_lock is only needed to operate on the guest_memfd, but
> > > it's a rwsem so there are no risks of lock inversion.
> > >
> > > > > I.e. why not
> > > > > take mmu_lock() in TDX's post_populate() implementation?
> > > >
> > > > We can take the lock. Because we have already populated the GFN of guest_memfd,
> > > > we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
> > > > get -EEXIST.
> > >
> > > I don't understand why TDH.MEM.PAGE.ADD() cannot be called from the
> > > post-populate hook. Can the code for TDH.MEM.PAGE.ADD be shared
> > > between the memory initialization ioctl and the page fault hook in
> > > kvm_x86_ops?
> >
> > Ah, because TDX is required to pre-fault the memory to establish the S-EPT walk,
> > and pre-faulting means guest_memfd()
> >
> > Requiring that guest_memfd not have a page when initializing the guest image
> > seems wrong, i.e. I don't think we want FGP_CREAT_ONLY. And not just because I
> > am a fan of pre-faulting, I think the semantics are bad.
>
> Ok, fair enough. I wanted to do the once-only test in common code but since
> SEV code checks for the RMP I can remove that. One less headache.

I definitely don't object to having a check in common code, and I'd be in favor
of removing the RMP checks if possible, but tracking needs to be something more
explicit in guest_memfd.

*sigh*

I even left behind a TODO for this exact thing, and y'all didn't even wave at it
as you flew by :-)

/*
* Use the up-to-date flag to track whether or not the memory has been
* zeroed before being handed off to the guest. There is no backing
* storage for the memory, so the folio will remain up-to-date until
* it's removed.
*
* TODO: Skip clearing pages when trusted firmware will do it when <==========================
* assigning memory to the guest.
*/
if (!folio_test_uptodate(folio)) {
unsigned long nr_pages = folio_nr_pages(folio);
unsigned long i;

for (i = 0; i < nr_pages; i++)
clear_highpage(folio_page(folio, i));

folio_mark_uptodate(folio);
}

if (prepare) {
int r = kvm_gmem_prepare_folio(inode, index, folio);
if (r < 0) {
folio_unlock(folio);
folio_put(folio);
return ERR_PTR(r);
}
}

Compile tested only (and not even fully as I didn't bother defining
CONFIG_HAVE_KVM_GMEM_INITIALIZE), but I think this is the basic gist.

8< --------------------------------

// SPDX-License-Identifier: GPL-2.0
#include <linux/backing-dev.h>
#include <linux/falloc.h>
#include <linux/kvm_host.h>
#include <linux/pagemap.h>
#include <linux/anon_inodes.h>

#include "kvm_mm.h"

struct kvm_gmem {
struct kvm *kvm;
struct xarray bindings;
struct list_head entry;
};

static int kvm_gmem_initialize_folio(struct kvm *kvm, struct folio *folio,
pgoff_t index, void __user *src,
void *opaque)
{
#ifdef CONFIG_HAVE_KVM_GMEM_INITIALIZE
return kvm_arch_gmem_initialize(kvm, folio, index, src, opaque);
#else
unsigned long nr_pages = folio_nr_pages(folio);
unsigned long i;

if (WARN_ON_ONCE(src))
return -EIO;

for (i = 0; i < nr_pages; i++)
clear_highpage(folio_file_page(folio, index + i));
#endif
return 0;
}


static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
{
return gfn - slot->base_gfn + slot->gmem.pgoff;
}


static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
{
/*
* Do not return slot->gmem.file if it has already been closed;
* there might be some time between the last fput() and when
* kvm_gmem_release() clears slot->gmem.file, and you do not
* want to spin in the meanwhile.
*/
return get_file_active(&slot->gmem.file);
}

static struct folio *__kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
{
fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
struct folio *folio;

/*
* The caller is responsible for managing the up-to-date flag (or not),
* as the memory doesn't need to be initialized until it's actually
* mapped into the guest. Waiting to initialize memory is necessary
* for VM types where the memory can be initialized exactly once.
*
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
*
* TODO: Support huge pages.
*/
folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags,
mapping_gfp_mask(inode->i_mapping));

if (folio_test_hwpoison(folio)) {
folio_unlock(folio);
return ERR_PTR(-EHWPOISON);
}
return folio;
}

static struct folio *kvm_gmem_get_folio(struct file *file,
struct kvm_memory_slot *slot,
gfn_t gfn)
{
pgoff_t index = kvm_gmem_get_index(slot, gfn);
struct kvm_gmem *gmem = file->private_data;
struct inode *inode;

if (file != slot->gmem.file) {
WARN_ON_ONCE(slot->gmem.file);
return ERR_PTR(-EFAULT);
}

gmem = file->private_data;
if (xa_load(&gmem->bindings, index) != slot) {
WARN_ON_ONCE(xa_load(&gmem->bindings, index));
return ERR_PTR(-EIO);
}

inode = file_inode(file);

/*
* The caller is responsible for managing the up-to-date flag (or not),
* as the memory doesn't need to be initialized until it's actually
* mapped into the guest. Waiting to initialize memory is necessary
* for VM types where the memory can be initialized exactly once.
*
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
*
* TODO: Support huge pages.
*/
return __kvm_gmem_get_folio(inode, index);
}

int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
{
pgoff_t index = kvm_gmem_get_index(slot, gfn);
struct file *file = kvm_gmem_get_file(slot);
struct folio *folio;
struct page *page;

if (!file)
return -EFAULT;

folio = kvm_gmem_get_folio(file, slot, gfn);
if (IS_ERR(folio))
goto out;

if (!folio_test_uptodate(folio)) {
kvm_gmem_initialize_folio(kvm, folio, index, NULL, NULL);
folio_mark_uptodate(folio);
}

page = folio_file_page(folio, index);

*pfn = page_to_pfn(page);
if (max_order)
*max_order = 0;

out:
fput(file);
return IS_ERR(folio) ? PTR_ERR(folio) : 0;
}
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);

long kvm_gmem_populate(struct kvm *kvm, gfn_t base_gfn, void __user *base_src,
long npages, void *opaque)
{
struct kvm_memory_slot *slot;
struct file *file;
int ret = 0, max_order;
long i;

lockdep_assert_held(&kvm->slots_lock);
if (npages < 0)
return -EINVAL;

slot = gfn_to_memslot(kvm, base_gfn);
if (!kvm_slot_can_be_private(slot))
return -EINVAL;

file = kvm_gmem_get_file(slot);
if (!file)
return -EFAULT;

filemap_invalidate_lock(file->f_mapping);

npages = min_t(ulong, slot->npages - (base_gfn - slot->base_gfn), npages);
for (i = 0; i < npages; i += (1 << max_order)) {
void __user *src = base_src + i * PAGE_SIZE;
gfn_t gfn = base_gfn + i;
pgoff_t index = kvm_gmem_get_index(slot, gfn);
struct folio *folio;

folio = kvm_gmem_get_folio(file, slot, gfn);
if (IS_ERR(folio)) {
ret = PTR_ERR(folio);
break;
}

if (folio_test_uptodate(folio)) {
folio_put(folio);
ret = -EEXIST;
break;
}

kvm_gmem_initialize_folio(kvm, folio, index, src, opaque);
folio_unlock(folio);
folio_put(folio);
}

filemap_invalidate_unlock(file->f_mapping);

fput(file);
return ret && !i ? ret : i;
}
EXPORT_SYMBOL_GPL(kvm_gmem_populate);

static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
pgoff_t end)
{
bool flush = false, found_memslot = false;
struct kvm_memory_slot *slot;
struct kvm *kvm = gmem->kvm;
unsigned long index;

xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
pgoff_t pgoff = slot->gmem.pgoff;

struct kvm_gfn_range gfn_range = {
.start = slot->base_gfn + max(pgoff, start) - pgoff,
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
.slot = slot,
.may_block = true,
};

if (!found_memslot) {
found_memslot = true;

KVM_MMU_LOCK(kvm);
kvm_mmu_invalidate_begin(kvm);
}

flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
}

if (flush)
kvm_flush_remote_tlbs(kvm);

if (found_memslot)
KVM_MMU_UNLOCK(kvm);
}

static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
pgoff_t end)
{
struct kvm *kvm = gmem->kvm;

if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
KVM_MMU_LOCK(kvm);
kvm_mmu_invalidate_end(kvm);
KVM_MMU_UNLOCK(kvm);
}
}

static long __kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
{
struct list_head *gmem_list = &inode->i_mapping->i_private_list;
pgoff_t start = offset >> PAGE_SHIFT;
pgoff_t end = (offset + len) >> PAGE_SHIFT;
struct kvm_gmem *gmem;
list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_begin(gmem, start, end);

truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);

list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_end(gmem, start, end);

return 0;
}

static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
{
int r;

/*
* Bindings must be stable across invalidation to ensure the start+end
* are balanced.
*/
filemap_invalidate_lock(inode->i_mapping);
r = __kvm_gmem_punch_hole(inode, offset, len);
filemap_invalidate_unlock(inode->i_mapping);
return r;
}

static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
{
struct address_space *mapping = inode->i_mapping;
pgoff_t start, index, end;
int r;

/* Dedicated guest is immutable by default. */
if (offset + len > i_size_read(inode))
return -EINVAL;

filemap_invalidate_lock_shared(mapping);

start = offset >> PAGE_SHIFT;
end = (offset + len) >> PAGE_SHIFT;

r = 0;
for (index = start; index < end; ) {
struct folio *folio;

if (signal_pending(current)) {
r = -EINTR;
break;
}

folio = __kvm_gmem_get_folio(inode, index);
if (IS_ERR(folio)) {
r = PTR_ERR(folio);
break;
}

index = folio_next_index(folio);

folio_unlock(folio);
folio_put(folio);

/* 64-bit only, wrapping the index should be impossible. */
if (WARN_ON_ONCE(!index))
break;

cond_resched();
}

filemap_invalidate_unlock_shared(mapping);

return r;
}

static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
loff_t len)
{
int ret;

if (!(mode & FALLOC_FL_KEEP_SIZE))
return -EOPNOTSUPP;

if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;

if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
return -EINVAL;

if (mode & FALLOC_FL_PUNCH_HOLE)
ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
else
ret = kvm_gmem_allocate(file_inode(file), offset, len);

if (!ret)
file_modified(file);
return ret;
}

static int kvm_gmem_release(struct inode *inode, struct file *file)
{
struct kvm_gmem *gmem = file->private_data;
struct kvm_memory_slot *slot;
struct kvm *kvm = gmem->kvm;
unsigned long index;

/*
* Prevent concurrent attempts to *unbind* a memslot. This is the last
* reference to the file and thus no new bindings can be created, but
* dereferencing the slot for existing bindings needs to be protected
* against memslot updates, specifically so that unbind doesn't race
* and free the memslot (kvm_gmem_get_file() will return NULL).
*/
mutex_lock(&kvm->slots_lock);

filemap_invalidate_lock(inode->i_mapping);

xa_for_each(&gmem->bindings, index, slot)
rcu_assign_pointer(slot->gmem.file, NULL);

synchronize_rcu();

/*
* All in-flight operations are gone and new bindings can be created.
* Zap all SPTEs pointed at by this file. Do not free the backing
* memory, as its lifetime is associated with the inode, not the file.
*/
kvm_gmem_invalidate_begin(gmem, 0, -1ul);
kvm_gmem_invalidate_end(gmem, 0, -1ul);

list_del(&gmem->entry);

filemap_invalidate_unlock(inode->i_mapping);

mutex_unlock(&kvm->slots_lock);

xa_destroy(&gmem->bindings);
kfree(gmem);

kvm_put_kvm(kvm);

return 0;
}

static struct file_operations kvm_gmem_fops = {
.open = generic_file_open,
.release = kvm_gmem_release,
.fallocate = kvm_gmem_fallocate,
};

void kvm_gmem_init(struct module *module)
{
kvm_gmem_fops.owner = module;
}

static int kvm_gmem_migrate_folio(struct address_space *mapping,
struct folio *dst, struct folio *src,
enum migrate_mode mode)
{
WARN_ON_ONCE(1);
return -EINVAL;
}

static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *folio)
{
struct list_head *gmem_list = &mapping->i_private_list;
struct kvm_gmem *gmem;
pgoff_t start, end;

filemap_invalidate_lock_shared(mapping);

start = folio->index;
end = start + folio_nr_pages(folio);

list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_begin(gmem, start, end);

/*
* Do not truncate the range, what action is taken in response to the
* error is userspace's decision (assuming the architecture supports
* gracefully handling memory errors). If/when the guest attempts to
* access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON,
* at which point KVM can either terminate the VM or propagate the
* error to userspace.
*/

list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_end(gmem, start, end);

filemap_invalidate_unlock_shared(mapping);

return MF_DELAYED;
}

#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
static void kvm_gmem_free_folio(struct folio *folio)
{
struct page *page = folio_page(folio, 0);
kvm_pfn_t pfn = page_to_pfn(page);
int order = folio_order(folio);

kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
}
#endif

static const struct address_space_operations kvm_gmem_aops = {
.dirty_folio = noop_dirty_folio,
.migrate_folio = kvm_gmem_migrate_folio,
.error_remove_folio = kvm_gmem_error_folio,
#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
.free_folio = kvm_gmem_free_folio,
#endif
};

static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
struct kstat *stat, u32 request_mask,
unsigned int query_flags)
{
struct inode *inode = path->dentry->d_inode;

generic_fillattr(idmap, request_mask, inode, stat);
return 0;
}

static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct iattr *attr)
{
return -EINVAL;
}
static const struct inode_operations kvm_gmem_iops = {
.getattr = kvm_gmem_getattr,
.setattr = kvm_gmem_setattr,
};

static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
{
const char *anon_name = "[kvm-gmem]";
struct kvm_gmem *gmem;
struct inode *inode;
struct file *file;
int fd, err;

fd = get_unused_fd_flags(0);
if (fd < 0)
return fd;

gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
if (!gmem) {
err = -ENOMEM;
goto err_fd;
}

file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
O_RDWR, NULL);
if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_gmem;
}

file->f_flags |= O_LARGEFILE;

inode = file->f_inode;
WARN_ON(file->f_mapping != inode->i_mapping);

inode->i_private = (void *)(unsigned long)flags;
inode->i_op = &kvm_gmem_iops;
inode->i_mapping->a_ops = &kvm_gmem_aops;
inode->i_mapping->flags |= AS_INACCESSIBLE;
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
mapping_set_unmovable(inode->i_mapping);
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));

kvm_get_kvm(kvm);
gmem->kvm = kvm;
xa_init(&gmem->bindings);
list_add(&gmem->entry, &inode->i_mapping->i_private_list);

fd_install(fd, file);
return fd;

err_gmem:
kfree(gmem);
err_fd:
put_unused_fd(fd);
return err;
}

int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
{
loff_t size = args->size;
u64 flags = args->flags;
u64 valid_flags = 0;

if (flags & ~valid_flags)
return -EINVAL;

if (size <= 0 || !PAGE_ALIGNED(size))
return -EINVAL;

return __kvm_gmem_create(kvm, size, flags);
}

int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
unsigned int fd, loff_t offset)
{
loff_t size = slot->npages << PAGE_SHIFT;
unsigned long start, end;
struct kvm_gmem *gmem;
struct inode *inode;
struct file *file;
int r = -EINVAL;

BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));

file = fget(fd);
if (!file)
return -EBADF;

if (file->f_op != &kvm_gmem_fops)
goto err;

gmem = file->private_data;
if (gmem->kvm != kvm)
goto err;

inode = file_inode(file);

if (offset < 0 || !PAGE_ALIGNED(offset) ||
offset + size > i_size_read(inode))
goto err;

filemap_invalidate_lock(inode->i_mapping);

start = offset >> PAGE_SHIFT;
end = start + slot->npages;

if (!xa_empty(&gmem->bindings) &&
xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
filemap_invalidate_unlock(inode->i_mapping);
goto err;
}

/*
* No synchronize_rcu() needed, any in-flight readers are guaranteed to
* be see either a NULL file or this new file, no need for them to go
* away.
*/
rcu_assign_pointer(slot->gmem.file, file);
slot->gmem.pgoff = start;

xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
filemap_invalidate_unlock(inode->i_mapping);

/*
* Drop the reference to the file, even on success. The file pins KVM,
* not the other way 'round. Active bindings are invalidated if the
* file is closed before memslots are destroyed.
*/
r = 0;
err:
fput(file);
return r;
}

void kvm_gmem_unbind(struct kvm_memory_slot *slot)
{
unsigned long start = slot->gmem.pgoff;
unsigned long end = start + slot->npages;
struct kvm_gmem *gmem;
struct file *file;

/*
* Nothing to do if the underlying file was already closed (or is being
* closed right now), kvm_gmem_release() invalidates all bindings.
*/
file = kvm_gmem_get_file(slot);
if (!file)
return;

gmem = file->private_data;

filemap_invalidate_lock(file->f_mapping);
xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL);
rcu_assign_pointer(slot->gmem.file, NULL);
synchronize_rcu();
filemap_invalidate_unlock(file->f_mapping);

fput(file);
}


2024-04-26 17:16:01

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Fri, Apr 26, 2024 at 07:44:40AM +0200,
Paolo Bonzini <[email protected]> wrote:

> On Thu, Apr 25, 2024 at 6:51 PM Isaku Yamahata <[email protected]> wrote:
> > > AFAIK, unwinding on failure is completely uninteresting, and arguably undesirable,
> > > because undoing LAUNCH_UPDATE or PAGE.ADD will affect the measurement, i.e. there
> > > is no scenario where deleting pages from guest_memfd would allow a restart/resume
> > > of the build process to truly succeed.
> >
> >
> > Just for record. With the following twist to kvm_gmem_populate,
> > KVM_TDX_INIT_MEM_REGION can use kvm_gmem_populate(). For those who are curious,
> > I also append the callback implementation at the end.
>
> Nice, thank you very much. Since TDX does not need
> HAVE_KVM_GMEM_PREPARE, if I get rid of FGP_CREAT_ONLY it will work for
> you, right?

Yes, that's right.
--
Isaku Yamahata <[email protected]>

2024-04-29 13:22:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 01/11] mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory

On 4/4/24 8:50 PM, Paolo Bonzini wrote:
> From: Michael Roth <[email protected]>
>
> filemap users like guest_memfd may use page cache pages to
> allocate/manage memory that is only intended to be accessed by guests
> via hardware protections like encryption. Writes to memory of this sort
> in common paths like truncation may cause unexpected behavior such
> writing garbage instead of zeros when attempting to zero pages, or
> worse, triggering hardware protections that are considered fatal as far
> as the kernel is concerned.
>
> Introduce a new address_space flag, AS_INACCESSIBLE, and use this
> initially to prevent zero'ing of pages during truncation, with the
> understanding that it is up to the owner of the mapping to handle this
> specially if needed.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Cc: Matthew Wilcox <[email protected]>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Message-ID: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>

I've replied on Michael's original series thread but that was after this one
was already posted, and I missed it due to smaller Cc list, e.g. linux-mm
not being Cc... so let me repeat here:


Hm somehow it seems like a rather blunt solution to a fairly specific issue
on one hand, and on the other hand I'm not sure whether there are other
places (not yet triggered) that should now take into account the flag to
keep its promise. But as long as it gets the job done, and can be replaced
later with something better...

Acked-by: Vlastimil Babka <[email protected]>


> ---
> include/linux/pagemap.h | 1 +
> mm/truncate.c | 3 ++-
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..f879c1d54da7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -207,6 +207,7 @@ enum mapping_flags {
> AS_STABLE_WRITES, /* must wait for writeback before modifying
> folio contents */
> AS_UNMOVABLE, /* The mapping cannot be moved, ever */
> + AS_INACCESSIBLE, /* Do not attempt direct R/W access to the mapping */
> };
>
> /**
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 725b150e47ac..c501338c7ebd 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -233,7 +233,8 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
> * doing a complex calculation here, and then doing the zeroing
> * anyway if the page split fails.
> */
> - folio_zero_range(folio, offset, length);
> + if (!(folio->mapping->flags & AS_INACCESSIBLE))
> + folio_zero_range(folio, offset, length);
>
> if (folio_has_private(folio))
> folio_invalidate(folio, offset, length);


2024-04-29 13:23:16

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 02/11] KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode

On 4/4/24 8:50 PM, Paolo Bonzini wrote:
> From: Michael Roth <[email protected]>
>
> truncate_inode_pages_range() may attempt to zero pages before truncating
> them, and this will occur before arch-specific invalidations can be
> triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
> AMD SEV-SNP this would result in an RMP #PF being generated by the
> hardware, which is currently treated as fatal (and even if specifically
> allowed for, would not result in anything other than garbage being
> written to guest pages due to encryption). On Intel TDX this would also
> result in undesirable behavior.
>
> Set the AS_INACCESSIBLE flag to prevent the MM from attempting
> unexpected accesses of this sort during operations like truncation.
>
> This may also in some cases yield a decent performance improvement for
> guest_memfd userspace implementations that hole-punch ranges immediately
> after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
> the current implementation of truncate_inode_pages_range() always ends
> up zero'ing an entire 4K range if it is backing by a 2M folio.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Message-ID: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> virt/kvm/guest_memfd.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 0f4e0cf4f158..5a929536ecf2 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -357,6 +357,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> inode->i_private = (void *)(unsigned long)flags;
> inode->i_op = &kvm_gmem_iops;
> inode->i_mapping->a_ops = &kvm_gmem_aops;
> + inode->i_mapping->flags |= AS_INACCESSIBLE;
> inode->i_mode |= S_IFREG;
> inode->i_size = size;
> mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);


2024-04-29 13:34:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 04/11] filemap: add FGP_CREAT_ONLY

On 4/25/24 7:52 AM, Paolo Bonzini wrote:
> On 4/4/24 20:50, Paolo Bonzini wrote:
>> KVM would like to add a ioctl to encrypt and install a page into private
>> memory (i.e. into a guest_memfd), in preparation for launching an
>> encrypted guest.
>>
>> This API should be used only once per page (unless there are failures),
>> so we want to rule out the possibility of operating on a page that is
>> already in the guest_memfd's filemap. Overwriting the page is almost
>> certainly a sign of a bug, so we might as well forbid it.
>>
>> Therefore, introduce a new flag for __filemap_get_folio (to be passed
>> together with FGP_CREAT) that allows *adding* a new page to the filemap
>> but not returning an existing one.
>>
>> An alternative possibility would be to force KVM users to initialize
>> the whole filemap in one go, but that is complicated by the fact that
>> the filemap includes pages of different kinds, including some that are
>> per-vCPU rather than per-VM. Basically the result would be closer to
>> a system call that multiplexes multiple ioctls, than to something
>> cleaner like readv/writev.
>>
>> Races between callers that pass FGP_CREAT_ONLY are uninteresting to
>> the filemap code: one of the racers wins and one fails with EEXIST,
>> similar to calling open(2) with O_CREAT|O_EXCL. It doesn't matter to
>> filemap.c if the missing synchronization is in the kernel or in userspace,
>> and in fact it could even be intentional. (In the case of KVM it turns
>> out that a mutex is taken around these calls for unrelated reasons,
>> so there can be no races.)
>>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Yosry Ahmed <[email protected]>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>
> Matthew, are your objections still valid or could I have your ack?

So per the sub-thread on PATCH 09/11, IIUC this is now moot, right?

Vlastimil

> Thanks,
>
> Paolo
>
>> ---
>> include/linux/pagemap.h | 2 ++
>> mm/filemap.c | 4 ++++
>> 2 files changed, 6 insertions(+)
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index f879c1d54da7..a8c0685e8c08 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -587,6 +587,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
>> * * %FGP_CREAT - If no folio is present then a new folio is allocated,
>> * added to the page cache and the VM's LRU list. The folio is
>> * returned locked.
>> + * * %FGP_CREAT_ONLY - Fail if a folio is present
>> * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
>> * folio is already in cache. If the folio was allocated, unlock it
>> * before returning so the caller can do the same dance.
>> @@ -607,6 +608,7 @@ typedef unsigned int __bitwise fgf_t;
>> #define FGP_NOWAIT ((__force fgf_t)0x00000020)
>> #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
>> #define FGP_STABLE ((__force fgf_t)0x00000080)
>> +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
>> #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
>>
>> #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7437b2bd75c1..e7440e189ebd 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1863,6 +1863,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>> folio = NULL;
>> if (!folio)
>> goto no_page;
>> + if (fgp_flags & FGP_CREAT_ONLY) {
>> + folio_put(folio);
>> + return ERR_PTR(-EEXIST);
>> + }
>>
>> if (fgp_flags & FGP_LOCK) {
>> if (fgp_flags & FGP_NOWAIT) {
>


2024-05-07 16:44:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 06/11] KVM: guest_memfd: Add hook for initializing memory

On Mon, Apr 22, 2024 at 12:58 PM Xu Yilun <[email protected]> wrote:
> > In some cases, it is necessary to defer the preparation of the pages to
> > handle things like in-place encryption of initial guest memory payloads
> > before marking these pages as 'private'/'guest-owned'. Add an argument
> > (always true for now) to kvm_gmem_get_folio() that allows for the
> > preparation callback to be bypassed. To detect possible issues in
>
> IIUC, we have 2 dedicated flows.
> 1 kvm_gmem_get_pfn() or kvm_gmem_allocate()
> a. kvm_gmem_get_folio()
> b. gmem_prepare() for RMP
>
> 2 in-place encryption or whatever
> a. kvm_gmem_get_folio(FGP_CREAT_ONLY)
> b. in-place encryption
> c. gmem_prepare() for RMP
>
> Could we move gmem_prepare() out of kvm_gmem_get_folio(), then we could
> have straightforward flow for each case, and don't have to have an
> argument to pospone gmem_prepare().

There are 3 flows as you note above - kvm_gmem_get_pfn() and
kvm_gmem_allocate() are different paths but they all need to call the
prepare hook. It is a tempting idea to pull kvm_gmem_prepare_folio()
to the two functions (get_pfn and allocate) but the resulting code is
really ugly due to folio_unlock/folio_put.

> > -static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> > +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> > +bool __weak kvm_arch_gmem_prepare_needed(struct kvm *kvm)
> > +{
> > + return false;
> > +}
> > +#endif
>
> In which case HAVE_KVM_GMEM_PREPARE is selected but
> gmem_prepare_needed() is never implemented? Then all gmem_prepare stuff
> are actually dead code. Maybe we don't need this weak stub?

It's not needed indeed.

> > + if (prepare) {
> > + int r = kvm_gmem_prepare_folio(inode, index, folio);
> > + if (r < 0) {
> > + folio_unlock(folio);
> > + folio_put(folio);
> > + return ERR_PTR(r);
> > + }
> > + }
> > +
>
> Do we still need to prepare the page if it is hwpoisoned? I see the
> hwpoisoned check is outside, in kvm_gmem_get_pfn().

Yep, it can be moved here.

Paolo


2024-06-07 23:03:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

SNP folks and/or Paolo, what's the plan for this? I don't see how what's sitting
in kvm/next can possibly be correct without conditioning population on the folio
being !uptodate.

On Fri, Apr 26, 2024, Sean Christopherson wrote:
> On Fri, Apr 26, 2024, Paolo Bonzini wrote:
> > On Thu, Apr 25, 2024 at 6:00 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Apr 25, 2024, Paolo Bonzini wrote:
> > > > On Thu, Apr 25, 2024 at 3:12 AM Isaku Yamahata <[email protected]> wrote:
> > > > > > > get_user_pages_fast(source addr)
> > > > > > > read_lock(mmu_lock)
> > > > > > > kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
> > > > > > > if the page table doesn't map gpa, error.
> > > > > > > TDH.MEM.PAGE.ADD()
> > > > > > > TDH.MR.EXTEND()
> > > > > > > read_unlock(mmu_lock)
> > > > > > > put_page()
> > > > > >
> > > > > > Hmm, KVM doesn't _need_ to use invalidate_lock to protect against guest_memfd
> > > > > > invalidation, but I also don't see why it would cause problems.
> > > >
> > > > The invalidate_lock is only needed to operate on the guest_memfd, but
> > > > it's a rwsem so there are no risks of lock inversion.
> > > >
> > > > > > I.e. why not
> > > > > > take mmu_lock() in TDX's post_populate() implementation?
> > > > >
> > > > > We can take the lock. Because we have already populated the GFN of guest_memfd,
> > > > > we need to make kvm_gmem_populate() not pass FGP_CREAT_ONLY. Otherwise we'll
> > > > > get -EEXIST.
> > > >
> > > > I don't understand why TDH.MEM.PAGE.ADD() cannot be called from the
> > > > post-populate hook. Can the code for TDH.MEM.PAGE.ADD be shared
> > > > between the memory initialization ioctl and the page fault hook in
> > > > kvm_x86_ops?
> > >
> > > Ah, because TDX is required to pre-fault the memory to establish the S-EPT walk,
> > > and pre-faulting means guest_memfd()
> > >
> > > Requiring that guest_memfd not have a page when initializing the guest image
> > > seems wrong, i.e. I don't think we want FGP_CREAT_ONLY. And not just because I
> > > am a fan of pre-faulting, I think the semantics are bad.
> >
> > Ok, fair enough. I wanted to do the once-only test in common code but since
> > SEV code checks for the RMP I can remove that. One less headache.
>
> I definitely don't object to having a check in common code, and I'd be in favor
> of removing the RMP checks if possible, but tracking needs to be something more
> explicit in guest_memfd.
>
> *sigh*
>
> I even left behind a TODO for this exact thing, and y'all didn't even wave at it
> as you flew by :-)
>
> /*
> * Use the up-to-date flag to track whether or not the memory has been
> * zeroed before being handed off to the guest. There is no backing
> * storage for the memory, so the folio will remain up-to-date until
> * it's removed.
> *
> * TODO: Skip clearing pages when trusted firmware will do it when <==========================
> * assigning memory to the guest.
> */
> if (!folio_test_uptodate(folio)) {
> unsigned long nr_pages = folio_nr_pages(folio);
> unsigned long i;
>
> for (i = 0; i < nr_pages; i++)
> clear_highpage(folio_page(folio, i));
>
> folio_mark_uptodate(folio);
> }
>
> if (prepare) {
> int r = kvm_gmem_prepare_folio(inode, index, folio);
> if (r < 0) {
> folio_unlock(folio);
> folio_put(folio);
> return ERR_PTR(r);
> }
> }
>
> Compile tested only (and not even fully as I didn't bother defining
> CONFIG_HAVE_KVM_GMEM_INITIALIZE), but I think this is the basic gist.
>
> 8< --------------------------------
>
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/backing-dev.h>
> #include <linux/falloc.h>
> #include <linux/kvm_host.h>
> #include <linux/pagemap.h>
> #include <linux/anon_inodes.h>
>
> #include "kvm_mm.h"
>
> struct kvm_gmem {
> struct kvm *kvm;
> struct xarray bindings;
> struct list_head entry;
> };
>
> static int kvm_gmem_initialize_folio(struct kvm *kvm, struct folio *folio,
> pgoff_t index, void __user *src,
> void *opaque)
> {
> #ifdef CONFIG_HAVE_KVM_GMEM_INITIALIZE
> return kvm_arch_gmem_initialize(kvm, folio, index, src, opaque);
> #else
> unsigned long nr_pages = folio_nr_pages(folio);
> unsigned long i;
>
> if (WARN_ON_ONCE(src))
> return -EIO;
>
> for (i = 0; i < nr_pages; i++)
> clear_highpage(folio_file_page(folio, index + i));
> #endif
> return 0;
> }
>
>
> static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
> {
> return gfn - slot->base_gfn + slot->gmem.pgoff;
> }
>
>
> static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
> {
> /*
> * Do not return slot->gmem.file if it has already been closed;
> * there might be some time between the last fput() and when
> * kvm_gmem_release() clears slot->gmem.file, and you do not
> * want to spin in the meanwhile.
> */
> return get_file_active(&slot->gmem.file);
> }
>
> static struct folio *__kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> {
> fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
> struct folio *folio;
>
> /*
> * The caller is responsible for managing the up-to-date flag (or not),
> * as the memory doesn't need to be initialized until it's actually
> * mapped into the guest. Waiting to initialize memory is necessary
> * for VM types where the memory can be initialized exactly once.
> *
> * Ignore accessed, referenced, and dirty flags. The memory is
> * unevictable and there is no storage to write back to.
> *
> * TODO: Support huge pages.
> */
> folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags,
> mapping_gfp_mask(inode->i_mapping));
>
> if (folio_test_hwpoison(folio)) {
> folio_unlock(folio);
> return ERR_PTR(-EHWPOISON);
> }
> return folio;
> }
>
> static struct folio *kvm_gmem_get_folio(struct file *file,
> struct kvm_memory_slot *slot,
> gfn_t gfn)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> struct kvm_gmem *gmem = file->private_data;
> struct inode *inode;
>
> if (file != slot->gmem.file) {
> WARN_ON_ONCE(slot->gmem.file);
> return ERR_PTR(-EFAULT);
> }
>
> gmem = file->private_data;
> if (xa_load(&gmem->bindings, index) != slot) {
> WARN_ON_ONCE(xa_load(&gmem->bindings, index));
> return ERR_PTR(-EIO);
> }
>
> inode = file_inode(file);
>
> /*
> * The caller is responsible for managing the up-to-date flag (or not),
> * as the memory doesn't need to be initialized until it's actually
> * mapped into the guest. Waiting to initialize memory is necessary
> * for VM types where the memory can be initialized exactly once.
> *
> * Ignore accessed, referenced, and dirty flags. The memory is
> * unevictable and there is no storage to write back to.
> *
> * TODO: Support huge pages.
> */
> return __kvm_gmem_get_folio(inode, index);
> }
>
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> struct file *file = kvm_gmem_get_file(slot);
> struct folio *folio;
> struct page *page;
>
> if (!file)
> return -EFAULT;
>
> folio = kvm_gmem_get_folio(file, slot, gfn);
> if (IS_ERR(folio))
> goto out;
>
> if (!folio_test_uptodate(folio)) {
> kvm_gmem_initialize_folio(kvm, folio, index, NULL, NULL);
> folio_mark_uptodate(folio);
> }
>
> page = folio_file_page(folio, index);
>
> *pfn = page_to_pfn(page);
> if (max_order)
> *max_order = 0;
>
> out:
> fput(file);
> return IS_ERR(folio) ? PTR_ERR(folio) : 0;
> }
> EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
>
> long kvm_gmem_populate(struct kvm *kvm, gfn_t base_gfn, void __user *base_src,
> long npages, void *opaque)
> {
> struct kvm_memory_slot *slot;
> struct file *file;
> int ret = 0, max_order;
> long i;
>
> lockdep_assert_held(&kvm->slots_lock);
> if (npages < 0)
> return -EINVAL;
>
> slot = gfn_to_memslot(kvm, base_gfn);
> if (!kvm_slot_can_be_private(slot))
> return -EINVAL;
>
> file = kvm_gmem_get_file(slot);
> if (!file)
> return -EFAULT;
>
> filemap_invalidate_lock(file->f_mapping);
>
> npages = min_t(ulong, slot->npages - (base_gfn - slot->base_gfn), npages);
> for (i = 0; i < npages; i += (1 << max_order)) {
> void __user *src = base_src + i * PAGE_SIZE;
> gfn_t gfn = base_gfn + i;
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> struct folio *folio;
>
> folio = kvm_gmem_get_folio(file, slot, gfn);
> if (IS_ERR(folio)) {
> ret = PTR_ERR(folio);
> break;
> }
>
> if (folio_test_uptodate(folio)) {
> folio_put(folio);
> ret = -EEXIST;
> break;
> }
>
> kvm_gmem_initialize_folio(kvm, folio, index, src, opaque);
> folio_unlock(folio);
> folio_put(folio);
> }
>
> filemap_invalidate_unlock(file->f_mapping);
>
> fput(file);
> return ret && !i ? ret : i;
> }
> EXPORT_SYMBOL_GPL(kvm_gmem_populate);
>
> static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> pgoff_t end)
> {
> bool flush = false, found_memslot = false;
> struct kvm_memory_slot *slot;
> struct kvm *kvm = gmem->kvm;
> unsigned long index;
>
> xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> pgoff_t pgoff = slot->gmem.pgoff;
>
> struct kvm_gfn_range gfn_range = {
> .start = slot->base_gfn + max(pgoff, start) - pgoff,
> .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> .slot = slot,
> .may_block = true,
> };
>
> if (!found_memslot) {
> found_memslot = true;
>
> KVM_MMU_LOCK(kvm);
> kvm_mmu_invalidate_begin(kvm);
> }
>
> flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> }
>
> if (flush)
> kvm_flush_remote_tlbs(kvm);
>
> if (found_memslot)
> KVM_MMU_UNLOCK(kvm);
> }
>
> static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> pgoff_t end)
> {
> struct kvm *kvm = gmem->kvm;
>
> if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
> KVM_MMU_LOCK(kvm);
> kvm_mmu_invalidate_end(kvm);
> KVM_MMU_UNLOCK(kvm);
> }
> }
>
> static long __kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> {
> struct list_head *gmem_list = &inode->i_mapping->i_private_list;
> pgoff_t start = offset >> PAGE_SHIFT;
> pgoff_t end = (offset + len) >> PAGE_SHIFT;
> struct kvm_gmem *gmem;
> list_for_each_entry(gmem, gmem_list, entry)
> kvm_gmem_invalidate_begin(gmem, start, end);
>
> truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
>
> list_for_each_entry(gmem, gmem_list, entry)
> kvm_gmem_invalidate_end(gmem, start, end);
>
> return 0;
> }
>
> static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> {
> int r;
>
> /*
> * Bindings must be stable across invalidation to ensure the start+end
> * are balanced.
> */
> filemap_invalidate_lock(inode->i_mapping);
> r = __kvm_gmem_punch_hole(inode, offset, len);
> filemap_invalidate_unlock(inode->i_mapping);
> return r;
> }
>
> static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> {
> struct address_space *mapping = inode->i_mapping;
> pgoff_t start, index, end;
> int r;
>
> /* Dedicated guest is immutable by default. */
> if (offset + len > i_size_read(inode))
> return -EINVAL;
>
> filemap_invalidate_lock_shared(mapping);
>
> start = offset >> PAGE_SHIFT;
> end = (offset + len) >> PAGE_SHIFT;
>
> r = 0;
> for (index = start; index < end; ) {
> struct folio *folio;
>
> if (signal_pending(current)) {
> r = -EINTR;
> break;
> }
>
> folio = __kvm_gmem_get_folio(inode, index);
> if (IS_ERR(folio)) {
> r = PTR_ERR(folio);
> break;
> }
>
> index = folio_next_index(folio);
>
> folio_unlock(folio);
> folio_put(folio);
>
> /* 64-bit only, wrapping the index should be impossible. */
> if (WARN_ON_ONCE(!index))
> break;
>
> cond_resched();
> }
>
> filemap_invalidate_unlock_shared(mapping);
>
> return r;
> }
>
> static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
> loff_t len)
> {
> int ret;
>
> if (!(mode & FALLOC_FL_KEEP_SIZE))
> return -EOPNOTSUPP;
>
> if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> return -EOPNOTSUPP;
>
> if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> return -EINVAL;
>
> if (mode & FALLOC_FL_PUNCH_HOLE)
> ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
> else
> ret = kvm_gmem_allocate(file_inode(file), offset, len);
>
> if (!ret)
> file_modified(file);
> return ret;
> }
>
> static int kvm_gmem_release(struct inode *inode, struct file *file)
> {
> struct kvm_gmem *gmem = file->private_data;
> struct kvm_memory_slot *slot;
> struct kvm *kvm = gmem->kvm;
> unsigned long index;
>
> /*
> * Prevent concurrent attempts to *unbind* a memslot. This is the last
> * reference to the file and thus no new bindings can be created, but
> * dereferencing the slot for existing bindings needs to be protected
> * against memslot updates, specifically so that unbind doesn't race
> * and free the memslot (kvm_gmem_get_file() will return NULL).
> */
> mutex_lock(&kvm->slots_lock);
>
> filemap_invalidate_lock(inode->i_mapping);
>
> xa_for_each(&gmem->bindings, index, slot)
> rcu_assign_pointer(slot->gmem.file, NULL);
>
> synchronize_rcu();
>
> /*
> * All in-flight operations are gone and new bindings can be created.
> * Zap all SPTEs pointed at by this file. Do not free the backing
> * memory, as its lifetime is associated with the inode, not the file.
> */
> kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> kvm_gmem_invalidate_end(gmem, 0, -1ul);
>
> list_del(&gmem->entry);
>
> filemap_invalidate_unlock(inode->i_mapping);
>
> mutex_unlock(&kvm->slots_lock);
>
> xa_destroy(&gmem->bindings);
> kfree(gmem);
>
> kvm_put_kvm(kvm);
>
> return 0;
> }
>
> static struct file_operations kvm_gmem_fops = {
> .open = generic_file_open,
> .release = kvm_gmem_release,
> .fallocate = kvm_gmem_fallocate,
> };
>
> void kvm_gmem_init(struct module *module)
> {
> kvm_gmem_fops.owner = module;
> }
>
> static int kvm_gmem_migrate_folio(struct address_space *mapping,
> struct folio *dst, struct folio *src,
> enum migrate_mode mode)
> {
> WARN_ON_ONCE(1);
> return -EINVAL;
> }
>
> static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *folio)
> {
> struct list_head *gmem_list = &mapping->i_private_list;
> struct kvm_gmem *gmem;
> pgoff_t start, end;
>
> filemap_invalidate_lock_shared(mapping);
>
> start = folio->index;
> end = start + folio_nr_pages(folio);
>
> list_for_each_entry(gmem, gmem_list, entry)
> kvm_gmem_invalidate_begin(gmem, start, end);
>
> /*
> * Do not truncate the range, what action is taken in response to the
> * error is userspace's decision (assuming the architecture supports
> * gracefully handling memory errors). If/when the guest attempts to
> * access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON,
> * at which point KVM can either terminate the VM or propagate the
> * error to userspace.
> */
>
> list_for_each_entry(gmem, gmem_list, entry)
> kvm_gmem_invalidate_end(gmem, start, end);
>
> filemap_invalidate_unlock_shared(mapping);
>
> return MF_DELAYED;
> }
>
> #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> static void kvm_gmem_free_folio(struct folio *folio)
> {
> struct page *page = folio_page(folio, 0);
> kvm_pfn_t pfn = page_to_pfn(page);
> int order = folio_order(folio);
>
> kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> }
> #endif
>
> static const struct address_space_operations kvm_gmem_aops = {
> .dirty_folio = noop_dirty_folio,
> .migrate_folio = kvm_gmem_migrate_folio,
> .error_remove_folio = kvm_gmem_error_folio,
> #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> .free_folio = kvm_gmem_free_folio,
> #endif
> };
>
> static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
> struct kstat *stat, u32 request_mask,
> unsigned int query_flags)
> {
> struct inode *inode = path->dentry->d_inode;
>
> generic_fillattr(idmap, request_mask, inode, stat);
> return 0;
> }
>
> static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> struct iattr *attr)
> {
> return -EINVAL;
> }
> static const struct inode_operations kvm_gmem_iops = {
> .getattr = kvm_gmem_getattr,
> .setattr = kvm_gmem_setattr,
> };
>
> static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> {
> const char *anon_name = "[kvm-gmem]";
> struct kvm_gmem *gmem;
> struct inode *inode;
> struct file *file;
> int fd, err;
>
> fd = get_unused_fd_flags(0);
> if (fd < 0)
> return fd;
>
> gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
> if (!gmem) {
> err = -ENOMEM;
> goto err_fd;
> }
>
> file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
> O_RDWR, NULL);
> if (IS_ERR(file)) {
> err = PTR_ERR(file);
> goto err_gmem;
> }
>
> file->f_flags |= O_LARGEFILE;
>
> inode = file->f_inode;
> WARN_ON(file->f_mapping != inode->i_mapping);
>
> inode->i_private = (void *)(unsigned long)flags;
> inode->i_op = &kvm_gmem_iops;
> inode->i_mapping->a_ops = &kvm_gmem_aops;
> inode->i_mapping->flags |= AS_INACCESSIBLE;
> inode->i_mode |= S_IFREG;
> inode->i_size = size;
> mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> mapping_set_unmovable(inode->i_mapping);
> /* Unmovable mappings are supposed to be marked unevictable as well. */
> WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>
> kvm_get_kvm(kvm);
> gmem->kvm = kvm;
> xa_init(&gmem->bindings);
> list_add(&gmem->entry, &inode->i_mapping->i_private_list);
>
> fd_install(fd, file);
> return fd;
>
> err_gmem:
> kfree(gmem);
> err_fd:
> put_unused_fd(fd);
> return err;
> }
>
> int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> {
> loff_t size = args->size;
> u64 flags = args->flags;
> u64 valid_flags = 0;
>
> if (flags & ~valid_flags)
> return -EINVAL;
>
> if (size <= 0 || !PAGE_ALIGNED(size))
> return -EINVAL;
>
> return __kvm_gmem_create(kvm, size, flags);
> }
>
> int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
> unsigned int fd, loff_t offset)
> {
> loff_t size = slot->npages << PAGE_SHIFT;
> unsigned long start, end;
> struct kvm_gmem *gmem;
> struct inode *inode;
> struct file *file;
> int r = -EINVAL;
>
> BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
>
> file = fget(fd);
> if (!file)
> return -EBADF;
>
> if (file->f_op != &kvm_gmem_fops)
> goto err;
>
> gmem = file->private_data;
> if (gmem->kvm != kvm)
> goto err;
>
> inode = file_inode(file);
>
> if (offset < 0 || !PAGE_ALIGNED(offset) ||
> offset + size > i_size_read(inode))
> goto err;
>
> filemap_invalidate_lock(inode->i_mapping);
>
> start = offset >> PAGE_SHIFT;
> end = start + slot->npages;
>
> if (!xa_empty(&gmem->bindings) &&
> xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
> filemap_invalidate_unlock(inode->i_mapping);
> goto err;
> }
>
> /*
> * No synchronize_rcu() needed, any in-flight readers are guaranteed to
> * be see either a NULL file or this new file, no need for them to go
> * away.
> */
> rcu_assign_pointer(slot->gmem.file, file);
> slot->gmem.pgoff = start;
>
> xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
> filemap_invalidate_unlock(inode->i_mapping);
>
> /*
> * Drop the reference to the file, even on success. The file pins KVM,
> * not the other way 'round. Active bindings are invalidated if the
> * file is closed before memslots are destroyed.
> */
> r = 0;
> err:
> fput(file);
> return r;
> }
>
> void kvm_gmem_unbind(struct kvm_memory_slot *slot)
> {
> unsigned long start = slot->gmem.pgoff;
> unsigned long end = start + slot->npages;
> struct kvm_gmem *gmem;
> struct file *file;
>
> /*
> * Nothing to do if the underlying file was already closed (or is being
> * closed right now), kvm_gmem_release() invalidates all bindings.
> */
> file = kvm_gmem_get_file(slot);
> if (!file)
> return;
>
> gmem = file->private_data;
>
> filemap_invalidate_lock(file->f_mapping);
> xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL);
> rcu_assign_pointer(slot->gmem.file, NULL);
> synchronize_rcu();
> filemap_invalidate_unlock(file->f_mapping);
>
> fput(file);
> }
>

2024-06-10 21:48:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Sat, Jun 8, 2024 at 1:03 AM Sean Christopherson <[email protected]> wrote:
> SNP folks and/or Paolo, what's the plan for this? I don't see how what's sitting
> in kvm/next can possibly be correct without conditioning population on the folio
> being !uptodate.

I don't think I have time to look at it closely until Friday; but
thanks for reminding me.

Paolo


2024-06-10 22:28:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On 6/10/24 23:48, Paolo Bonzini wrote:
> On Sat, Jun 8, 2024 at 1:03 AM Sean Christopherson <[email protected]> wrote:
>> SNP folks and/or Paolo, what's the plan for this? I don't see how what's sitting
>> in kvm/next can possibly be correct without conditioning population on the folio
>> being !uptodate.
>
> I don't think I have time to look at it closely until Friday; but
> thanks for reminding me.

Ok, I'm officially confused. I think I understand what you did in your
suggested code. Limiting it to the bare minimum (keeping the callback
instead of CONFIG_HAVE_KVM_GMEM_INITIALIZE) it would be something
like what I include at the end of the message.

But the discussion upthread was about whether to do the check for
RMP state in sev.c, or do it in common code using folio_mark_uptodate().
I am not sure what you mean by "cannot possibly be correct", and
whether it's referring to kvm_gmem_populate() in general or the
callback in sev_gmem_post_populate().

The change below looks like just an optimization to me, which
suggests that I'm missing something glaring.

Paolo

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d4206e53a9c81..a0417ef5b86eb 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -52,37 +52,39 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
{
struct folio *folio;
+ int r;

/* TODO: Support huge pages. */
folio = filemap_grab_folio(inode->i_mapping, index);
if (IS_ERR(folio))
return folio;

- /*
- * Use the up-to-date flag to track whether or not the memory has been
- * zeroed before being handed off to the guest. There is no backing
- * storage for the memory, so the folio will remain up-to-date until
- * it's removed.
- *
- * TODO: Skip clearing pages when trusted firmware will do it when
- * assigning memory to the guest.
- */
- if (!folio_test_uptodate(folio)) {
- unsigned long nr_pages = folio_nr_pages(folio);
- unsigned long i;
+ if (prepare) {
+ /*
+ * Use the up-to-date flag to track whether or not the memory has
+ * been handed off to the guest. There is no backing storage for
+ * the memory, so the folio will remain up-to-date until it's
+ * removed.
+ *
+ * Take the occasion of the first prepare operation to clear it.
+ */
+ if (!folio_test_uptodate(folio)) {
+ unsigned long nr_pages = folio_nr_pages(folio);
+ unsigned long i;

- for (i = 0; i < nr_pages; i++)
- clear_highpage(folio_page(folio, i));
+ for (i = 0; i < nr_pages; i++)
+ clear_highpage(folio_page(folio, i));
+ }
+
+ r = kvm_gmem_prepare_folio(inode, index, folio);
+ if (r < 0)
+ goto err_unlock_put;

folio_mark_uptodate(folio);
- }
-
- if (prepare) {
- int r = kvm_gmem_prepare_folio(inode, index, folio);
- if (r < 0) {
- folio_unlock(folio);
- folio_put(folio);
- return ERR_PTR(r);
+ } else {
+ if (folio_test_uptodate(folio)) {
+ r = -EEXIST;
+ goto err_unlock_put;
}
}

@@ -91,6 +93,11 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
* unevictable and there is no storage to write back to.
*/
return folio;
+
+err_unlock_put:
+ folio_unlock(folio);
+ folio_put(folio);
+ return ERR_PTR(r);
}

static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -545,8 +552,15 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
fput(file);
}

+/* If p_folio is NULL, the folio is cleared, prepared and marked up-to-date
+ * before returning.
+ *
+ * If p_folio is not NULL, this is left to the caller, who must call
+ * folio_mark_uptodate() once the page is ready for use by the guest.
+ */
static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order,
+ struct folio **p_folio)
{
pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
struct kvm_gmem *gmem = file->private_data;
@@ -565,7 +579,7 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
return -EIO;
}

- folio = kvm_gmem_get_folio(file_inode(file), index, prepare);
+ folio = kvm_gmem_get_folio(file_inode(file), index, !p_folio);
if (IS_ERR(folio))
return PTR_ERR(folio);

@@ -577,6 +591,8 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
page = folio_file_page(folio, index);

*pfn = page_to_pfn(page);
+ if (p_folio)
+ *p_folio = folio;
if (max_order)
*max_order = 0;

@@ -597,7 +613,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
if (!file)
return -EFAULT;

- r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, true);
+ r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, NULL);
fput(file);
return r;
}
@@ -629,10 +645,11 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long

npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
for (i = 0; i < npages; i += (1 << max_order)) {
+ struct folio *folio;
gfn_t gfn = start_gfn + i;
kvm_pfn_t pfn;

- ret = __kvm_gmem_get_pfn(file, slot, gfn, &pfn, &max_order, false);
+ ret = __kvm_gmem_get_pfn(file, slot, gfn, &pfn, &max_order, &folio);
if (ret)
break;

@@ -642,8 +659,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long

p = src ? src + i * PAGE_SIZE : NULL;
ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
+ if (!ret)
+ folio_mark_uptodate(folio);

- put_page(pfn_to_page(pfn));
+ folio_put(folio);
if (ret)
break;
}



2024-06-10 23:41:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Tue, Jun 11, 2024, Paolo Bonzini wrote:
> On 6/10/24 23:48, Paolo Bonzini wrote:
> > On Sat, Jun 8, 2024 at 1:03 AM Sean Christopherson <[email protected]> wrote:
> > > SNP folks and/or Paolo, what's the plan for this? I don't see how what's sitting
> > > in kvm/next can possibly be correct without conditioning population on the folio
> > > being !uptodate.
> >
> > I don't think I have time to look at it closely until Friday; but
> > thanks for reminding me.
>
> Ok, I'm officially confused. I think I understand what you did in your
> suggested code. Limiting it to the bare minimum (keeping the callback
> instead of CONFIG_HAVE_KVM_GMEM_INITIALIZE) it would be something
> like what I include at the end of the message.
>
> But the discussion upthread was about whether to do the check for
> RMP state in sev.c, or do it in common code using folio_mark_uptodate().
> I am not sure what you mean by "cannot possibly be correct", and
> whether it's referring to kvm_gmem_populate() in general or the
> callback in sev_gmem_post_populate().

Doing fallocate() before KVM_SEV_SNP_LAUNCH_UPDATE will cause the latter to fail.
That likely works for QEMU, at least for now, but it's unnecessarily restrictive
and IMO incorrect/wrong.

E.g. a more convoluted, fallocate() + PUNCH_HOLE + KVM_SEV_SNP_LAUNCH_UPDATE will
work (I think? AFAICT adding and removing pages directly to/from the RMP doesn't
affect SNP's measurement, only pages that are added via SNP_LAUNCH_UPDATE affect
the measurement).

Punting the sanity check to vendor code is also gross and will make it harder to
provide a consistent, unified ABI for all architectures. E.g. SNP returns -EINVAL
if the page is already assigned, which is quite misleading.

> The change below looks like just an optimization to me, which
> suggests that I'm missing something glaring.

I really dislike @prepare. There are two paths that should actually initialize
the contents of the folio, and they are mutually exclusive and have meaningfully
different behavior. Faulting in memory via kvm_gmem_get_pfn() explicitly zeros
the folio _if necessary_, whereas kvm_gmem_populate() initializes the folio with
user-provided data _and_ requires that the folio be !uptodate.

If we fix the above oddity where fallocate() initializes memory, then there's
no need to try and handle the initialization in a common chokepoint as the two
relevant paths will naturally have unique code.

The below is also still suboptimal for TDX, as KVM will zero the memory and then
TDX-module will also zero memory on PAGE.AUGA.

And I find SNP to be the odd one. IIUC, the ASP (the artist formerly known as
the PSP) doesn't provide any guarantees about the contents of a page that is
assigned to a guest without bouncing through SNP_LAUNCH_UPDATE. It'd be nice to
explicitly document that somewhere in the SNP code. E.g. if guest_memfd invokes
a common kvm_gmem_initialize_folio() or whatever, then SNP's implementation can
clearly capture that KVM zeros the page to protect the _host_ data.

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d4206e53a9c81..a0417ef5b86eb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -52,37 +52,39 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
> static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
> {
> struct folio *folio;
> + int r;
> /* TODO: Support huge pages. */
> folio = filemap_grab_folio(inode->i_mapping, index);
> if (IS_ERR(folio))
> return folio;
> - /*
> - * Use the up-to-date flag to track whether or not the memory has been
> - * zeroed before being handed off to the guest. There is no backing
> - * storage for the memory, so the folio will remain up-to-date until
> - * it's removed.
> - *
> - * TODO: Skip clearing pages when trusted firmware will do it when
> - * assigning memory to the guest.
> - */
> - if (!folio_test_uptodate(folio)) {
> - unsigned long nr_pages = folio_nr_pages(folio);
> - unsigned long i;
> + if (prepare) {
> + /*
> + * Use the up-to-date flag to track whether or not the memory has
> + * been handed off to the guest. There is no backing storage for
> + * the memory, so the folio will remain up-to-date until it's
> + * removed.
> + *
> + * Take the occasion of the first prepare operation to clear it.
> + */
> + if (!folio_test_uptodate(folio)) {
> + unsigned long nr_pages = folio_nr_pages(folio);
> + unsigned long i;
> - for (i = 0; i < nr_pages; i++)
> - clear_highpage(folio_page(folio, i));
> + for (i = 0; i < nr_pages; i++)
> + clear_highpage(folio_page(folio, i));
> + }
> +
> + r = kvm_gmem_prepare_folio(inode, index, folio);
> + if (r < 0)
> + goto err_unlock_put;
> folio_mark_uptodate(folio);
> - }
> -
> - if (prepare) {
> - int r = kvm_gmem_prepare_folio(inode, index, folio);
> - if (r < 0) {
> - folio_unlock(folio);
> - folio_put(folio);
> - return ERR_PTR(r);
> + } else {
> + if (folio_test_uptodate(folio)) {
> + r = -EEXIST;
> + goto err_unlock_put;
> }
> }

2024-06-11 06:09:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

On Tue, Jun 11, 2024 at 1:41 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Jun 11, 2024, Paolo Bonzini wrote:
> > On 6/10/24 23:48, Paolo Bonzini wrote:
> > > On Sat, Jun 8, 2024 at 1:03 AM Sean Christopherson <[email protected]> wrote:
> > > > SNP folks and/or Paolo, what's the plan for this? I don't see how what's sitting
> > > > in kvm/next can possibly be correct without conditioning population on the folio
> > > > being !uptodate.
> > >
> > > I don't think I have time to look at it closely until Friday; but
> > > thanks for reminding me.
> >
> > Ok, I'm officially confused. I think I understand what you did in your
> > suggested code. Limiting it to the bare minimum (keeping the callback
> > instead of CONFIG_HAVE_KVM_GMEM_INITIALIZE) it would be something
> > like what I include at the end of the message.
> >
> > But the discussion upthread was about whether to do the check for
> > RMP state in sev.c, or do it in common code using folio_mark_uptodate().
> > I am not sure what you mean by "cannot possibly be correct", and
> > whether it's referring to kvm_gmem_populate() in general or the
> > callback in sev_gmem_post_populate().
>
> Doing fallocate() before KVM_SEV_SNP_LAUNCH_UPDATE will cause the latter to fail.
> That likely works for QEMU, at least for now, but it's unnecessarily restrictive
> and IMO incorrect/wrong.

Ok, I interpreted incorrect as if it caused incorrect initialization
or something similarly fatal. Being too restrictive can (almost)
always be lifted.

> E.g. a more convoluted, fallocate() + PUNCH_HOLE + KVM_SEV_SNP_LAUNCH_UPDATE will
> work (I think? AFAICT adding and removing pages directly to/from the RMP doesn't
> affect SNP's measurement, only pages that are added via SNP_LAUNCH_UPDATE affect
> the measurement).

So the starting point is writing testcases (for which indeed I have to
wait until Friday). It's not exactly a rewrite but almost.

> Punting the sanity check to vendor code is also gross and will make it harder to
> provide a consistent, unified ABI for all architectures. E.g. SNP returns -EINVAL
> if the page is already assigned, which is quite misleading.
>
> > The change below looks like just an optimization to me, which
> > suggests that I'm missing something glaring.
>
> I really dislike @prepare. There are two paths that should actually initialize
> the contents of the folio, and they are mutually exclusive and have meaningfully
> different behavior. Faulting in memory via kvm_gmem_get_pfn() explicitly zeros
> the folio _if necessary_, whereas kvm_gmem_populate() initializes the folio with
> user-provided data _and_ requires that the folio be !uptodate.

No complaints there, I just wanted to start with the minimal change to
use the uptodate flag in kvm_gmem_populate(). And yeah,
kvm_gmem_get_folio() at this point can be basically replaced by
filemap_grab_folio() in the kvm_gmem_populate() path. What I need to
think about, is that there is still quite a bit of code in
__kvm_gmem_get_pfn() that is common to both paths.

Paolo