This patchset is also available at:
https://github.com/mdroth/linux/commits/gmem-x86-v1-rfc
and is based on top of the kvm-x86 gmem tree (e7af8d17224a):
https://github.com/kvm-x86/linux/commits/guest_memfd
== OVERVIEW ==
This is a series of patches that are currently needed for implementing
SEV-SNP hypervisor support on top of guest_memfd, but have a lot of
potential to overlap with changes that may be also be needed for TDX.
The goal of this series is to help reach a consensus on what
functionality implemented here is indeed common between SNP/TDX, and
try to finalize what these interfaces should look like so we can
incorporate them into a common gmem/x86 tree to base on top of to
reduce potential churn from the various SNP/TDX-specific patchsets.
A couple of the patches here are newer versions of patches that were
included in a similar series posted by Isaku here[1] that were revised
to incorporate feedback from Sean and others.
Some of the approaches implementated have deviated somewhat from what
may have been discussed/suggested on the list. For these cases I've
tried to provide my rationale below along with some relevant background
so that we can continue these discussions from where we left off and
reach a consensus on what these need to look like to be usable for both
SNP and TDX, and acceptable for gmem in general.
== Hooks for preparing gmem pages (patch #3) ==
The design here is mainly driven by this discussion with Sean[2]. In the
prior version used by v9 SNP hypervisor patchset[3] and included in
Isaku's patchset[1], this hook was triggered directly by KVM MMU just
prior to mapping it into the guest NPT/EPT to handle things like putting
it into the initial 'private' state as defined by the architecture (e.g.
'guest-owned' state in the SNP RMP table).
Sean was hoping to tie this update to allocation time rather than
mapping time, so it has more symmetry with the 'invalidation' side that
happens when the backing pages are freed back to the host, and allows
for better run-time performance if userspace opts to pre-allocate pages
in advance, since the cost of the 'preparation' hook could then also be
paid up front.
To accomodate this I changed this hook to trigger automatically when
a folio is allocated either via kvm_gmem_get_pfn(), or via an
fallocate() operation. There are a couple places in this implementation
where things fall a bit short of the original design goals however:
1) For SNP, preparing a page as 'guest-owned' in the RMP table
requires the GPA, which can only be known after the guest_memfd
range that being allocated has been bound to a memslot, and there's
no guarantee userspace won't attempt to fallocate() in advance of
binding to a memslot unless we enforce that as part of the gmem
API.
Instead, this implementation simply attempts to call the prepare
hook every time a folio is accessed via the common
kvm_gmem_get_folio() path to ensure these 'deferred' preparation
hooks will happen before KVM MMU maps any pages into a guest.
Thus, it's up to the architecture/platform to keep track of
whether a page is already in the 'prepared' state. For SNP this
tracked via the RMP table itself, so we sort of already have this
for free.
2) AIUI the design proposed by Sean would involve gmem internally
keeping track of whether or not a page/folio has already been
prepared. As mentioned above, in the version we instead simply
punt that tracking to the architecture.
I experimented with tracking this inside gmem though, but it was a
bit of a mis-start. I tried using an xarray to keep track of 2
states: 'allocated'/'prepare', since both would be need if we
wanted to be able to do things like handle deferred preparation
hooks for situations like a memslot getting bound to a range that
has already been allocated.
The 'allocated' state was inferred based on whether an entry was
present for a particular gmem offset, and the entry itself was a
set of flags, 'prepared' being one of them. However, at the time
there was a TODO[4] to investigate dropping the use of filemap in
favor of doing that internally in gmem, and this seemed like it
could be an intermediate step toward that, so I started heading
down that road a bit by using higher-order/multi-index xarray
entries with the thought of eventually being able to just track
the folios themselves and drop reliance on filemap. This got messy
quickly however and I dropped it once Paolo's investigations
suggested that replacing filemap completely probably wouldn't be
very worthwhile.[3]
So maybe internally tracking 'allocated'/'prepared' states in gmem
is more doable if we take a simpler approach like 4K-granularity
xarrays or sparse bitmaps, or something else, but it's still
enough additional complexity that I think it would be good to
understand better what we really gain by doing this tracking in
gmem. The one thing I can think of is the ability to immediately
move already-allocated pages into the 'prepared' state if userspace
decides to pre-allocate them prior to binding the range to a
memslot, but I'm not sure how much that buys us performance-wise.
At least for SNP, the initial guest payload will necessarily be
put into 'prepared' state prior to boot, and anything other than
that will need to go through the hole shared->private+PVALIDATE
dance where saving on an RMPUPDATE in that path probably wouldn't
make a huge difference.
== Hooks for invalidating gmem pages (patch #4) ==
In the prior version used by v9 SNP hypervisor patchset[3] and included
in Isaku's patchset[1], gmem calls these hooks directly during
hole-punching operations or during kvm_gmem_release() to make gmem
patches accessible to the host before freeing them.
There was an open TODO[4] for looking at making using of
.invalidate_folio, .evict_inode, and similar callbacks to handle this
sort of cleanup.
Toward that end I switched to using .free_folio to trigger these
arch-specific hooks since it seemed to be the most direct choice. What's
nice about that is even in the context of future support for things like
intra-host migration to support live update[5], where multiple gmem
instances might share an inode, there is less risk of one gmem instance
clobbering the other when it is release, since the reference counting on
the underlying inode will keep the inode alive.
One downside to using .free_folio is there is no metadata to pass along
with it: the callback gets PFN/order and that's it. For SNP this is
fine, but maybe for other platforms that's not enough to handle the
cleanup work.
If some sort of metadata *is* needed, .invalidate_folio is an option
since it can also pass along information associated with the folio via
folio_attach_private() and otherwise behaves similarly to .free_folio.
One major exception however it hole-punching, where splitting the folio
results in the private data being lost. And unlike normal pagecache
users, there aren't obvious points to re-attach it like read/write
operations on some file. So we'd probably need to do something like
scan for split folios in the hugepage-ranges that contain the range
that got hole-punched and re-attach the private data immeditely after
each hole-punch operation. That, or interesting some other flag to ask
mm/truncate.c to handle this for us. Or just stick with the prior
in Isaku's patchset[1].
== Determining whether #NPFs were for private access (patch #5-8) ==
This is mainly driven by these discussions[6][7], which suggest moving
toward special-casing handling based on VM type where necessary, but
consolidating around the use of the AMD-defined encryption bit to
encode whether a guest #NPF / EPT violation was for a private page or
not. Toward that end I defined the SNP vm type here and pulled in a
patch from the SNP patchset that introduces PFERR_GUEST_ENC_MASK, and
use those to initialize struct kvm_page_fault's .is_private field. My
that with TDX sythesizing the same PFERR_GUEST_ENC_MASK that logic
there would work the same for both TDX/SNP vm types.
== References ==
[1] https://lkml.kernel.org/kvm/CUU93XA8UKMG.X15YWDK533GB@suppilovahvero/t/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/kvm/CABgObfZiS+e7oDbwuC1Uycsz8Mjsu-FSfSmu=3R0M71vUhpq_Q@mail.gmail.com/
[4] https://lore.kernel.org/kvm/[email protected]/
[5] https://lore.kernel.org/lkml/ZN%[email protected]/t/
[6] https://lkml.kernel.org/kvm/[email protected]/
[7] https://lore.kernel.org/kvm/[email protected]/T/#me8a395cdf682068d8e5152c358016bf2fa4328e5
Any suggestions and feedback are very much appreciated.
-Mike
----------------------------------------------------------------
Brijesh Singh (1):
KVM: x86: Define RMP page fault error bits for #NPF
Michael Roth (7):
mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory
KVM: Use AS_INACCESSIBLE when creating guest_memfd inode
KVM: x86: Add gmem hook for initializing memory
KVM: x86: Add gmem hook for invalidating memory
KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults
KVM: x86: Add KVM_X86_SNP_VM vm_type
KVM: x86: Determine shared/private faults based on vm_type
arch/x86/include/asm/kvm-x86-ops.h | 2 ++
arch/x86/include/asm/kvm_host.h | 13 +++++++
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu/mmu.c | 15 +++++---
arch/x86/kvm/mmu/mmu_internal.h | 24 +++++++++++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/x86.c | 21 ++++++++++-
include/linux/kvm_host.h | 18 ++++++++++
include/linux/pagemap.h | 1 +
mm/truncate.c | 3 +-
virt/kvm/Kconfig | 8 +++++
virt/kvm/guest_memfd.c | 71 +++++++++++++++++++++++++++++++++++---
13 files changed, 165 insertions(+), 16 deletions(-)
filemap users like guest_memfd may use page cache pages to
allocate/manage memory that is only intended to be accessed by guests
via hardware protections like encryption. Writes to memory of this sort
in common paths like truncation may cause unexpected behavior such
writing garbage instead of zeros when attempting to zero pages, or
worse, triggering hardware protections that are considered fatal as far
as the kernel is concerned.
Introduce a new address_space flag, AS_INACCESSIBLE, and use this
initially to prevent zero'ing of pages during truncation, with the
understanding that it is up to the owner of the mapping to handle this
specially if needed.
Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Matthew Wilcox <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/truncate.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 82c9bf506b79..9e79cf48f67a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -205,6 +205,7 @@ enum mapping_flags {
AS_LARGE_FOLIO_SUPPORT = 6,
AS_RELEASE_ALWAYS = 7, /* Call ->release_folio(), even if no private data */
AS_UNMOVABLE = 8, /* The mapping cannot be moved, ever */
+ AS_INACCESSIBLE = 9, /* Do not attempt direct R/W access to the mapping */
};
/**
diff --git a/mm/truncate.c b/mm/truncate.c
index 8e3aa9e8618e..0d80bcc250af 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -233,7 +233,8 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
* doing a complex calculation here, and then doing the zeroing
* anyway if the page split fails.
*/
- folio_zero_range(folio, offset, length);
+ if (!(folio->mapping->flags & AS_INACCESSIBLE))
+ folio_zero_range(folio, offset, length);
if (folio_has_private(folio))
folio_invalidate(folio, offset, length);
--
2.25.1
truncate_inode_pages_range() may attempt to zero pages before truncating
them, and this will occur before arch-specific invalidations can be
triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
AMD SEV-SNP this would result in an RMP #PF being generated by the
hardware, which is currently treated as fatal (and even if specifically
allowed for, would not result in anything other than garbage being
written to guest pages due to encryption). On Intel TDX this would also
result in undesirable behavior.
Set the AS_INACCESSIBLE flag to prevent the MM from attempting
unexpected accesses of this sort during operations like truncation.
This may also in some cases yield a decent performance improvement for
guest_memfd userspace implementations that hole-punch ranges immediately
after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
the current implementation of truncate_inode_pages_range() always ends
up zero'ing an entire 4K range if it is backing by a 2M folio.
Link: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
virt/kvm/guest_memfd.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9ffce54555ae..f6f1b17a319c 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -398,6 +398,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
inode->i_private = (void *)(unsigned long)flags;
inode->i_op = &kvm_gmem_iops;
inode->i_mapping->a_ops = &kvm_gmem_aops;
+ inode->i_mapping->flags |= AS_INACCESSIBLE;
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
--
2.25.1
guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.
Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.
The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).
In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned', so also add a
helper that performs the same function as kvm_gmem_get_pfn(), but allows
for the preparation callback to be bypassed to allow for pages to be
accessed beforehand.
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 6 ++++
include/linux/kvm_host.h | 14 ++++++++
virt/kvm/Kconfig | 4 +++
virt/kvm/guest_memfd.c | 56 +++++++++++++++++++++++++++---
6 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e3054e3e46d5..0c113f42d5c7 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 95018cc653f5..66fc89d1858f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1752,6 +1752,8 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 767236b4d771..33a4cc33d86d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13301,6 +13301,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
+{
+ return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
+}
+#endif
int kvm_spec_ctrl_test_value(u64 value)
{
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c5c017ab4e9..c7f82c2f1bcf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2403,9 +2403,19 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
#ifdef CONFIG_KVM_PRIVATE_MEM
+int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
#else
+static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ kvm_pfn_t *pfn, int *max_order)
+{
+ KVM_BUG_ON(1, kvm);
+ return -EIO;
+}
+
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
kvm_pfn_t *pfn, int *max_order)
@@ -2415,4 +2425,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
}
#endif /* CONFIG_KVM_PRIVATE_MEM */
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
+#endif
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 2c964586aa14..992cf6ed86ef 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,3 +109,7 @@ config KVM_GENERIC_PRIVATE_MEM
select KVM_GENERIC_MEMORY_ATTRIBUTES
select KVM_PRIVATE_MEM
bool
+
+config HAVE_KVM_GMEM_PREPARE
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index f6f1b17a319c..72ff8b7b31d5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -44,7 +44,40 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
#endif
}
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+ struct list_head *gmem_list = &inode->i_mapping->private_list;
+ struct kvm_gmem *gmem;
+
+ list_for_each_entry(gmem, gmem_list, entry) {
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = gmem->kvm;
+ struct page *page;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+ int rc;
+
+ slot = xa_load(&gmem->bindings, index);
+ if (!slot)
+ continue;
+
+ page = folio_file_page(folio, index);
+ pfn = page_to_pfn(page);
+ gfn = slot->base_gfn + index - slot->gmem.pgoff;
+ rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
+ if (rc) {
+ pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
+ index, rc);
+ return rc;
+ }
+ }
+
+#endif
+ return 0;
+}
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prep)
{
struct folio *folio;
@@ -74,6 +107,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
folio_mark_uptodate(folio);
}
+ if (prep && kvm_gmem_prepare_folio(inode, index, folio)) {
+ folio_unlock(folio);
+ folio_put(folio);
+ return NULL;
+ }
+
/*
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
@@ -178,7 +217,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
break;
}
- folio = kvm_gmem_get_folio(inode, index);
+ folio = kvm_gmem_get_folio(inode, index, true);
if (!folio) {
r = -ENOMEM;
break;
@@ -537,8 +576,8 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
fput(file);
}
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
{
pgoff_t index, huge_index;
struct kvm_gmem *gmem;
@@ -559,7 +598,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
goto out_fput;
}
- folio = kvm_gmem_get_folio(file_inode(file), index);
+ folio = kvm_gmem_get_folio(file_inode(file), index, prep);
if (!folio) {
r = -ENOMEM;
goto out_fput;
@@ -600,4 +639,11 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
return r;
}
+EXPORT_SYMBOL_GPL(__kvm_gmem_get_pfn);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+ return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
+}
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
--
2.25.1
In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 7 +++++++
include/linux/kvm_host.h | 4 ++++
virt/kvm/Kconfig | 4 ++++
virt/kvm/guest_memfd.c | 14 ++++++++++++++
6 files changed, 31 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 0c113f42d5c7..f1505a5fa781 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL(gmem_invalidate)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 66fc89d1858f..dbec74783f48 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1754,6 +1754,7 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
+ void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 33a4cc33d86d..0e95c3a95e59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
}
#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+{
+ static_call_cond(kvm_x86_gmem_invalidate)(start, end);
+}
+#endif
+
int kvm_spec_ctrl_test_value(u64 value)
{
/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c7f82c2f1bcf..840a5be5962a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#endif
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 992cf6ed86ef..7fd1362a7ebe 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM
config HAVE_KVM_GMEM_PREPARE
bool
depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_INVALIDATE
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 72ff8b7b31d5..b4c4df259fb8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
return MF_DELAYED;
}
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+static void kvm_gmem_free_folio(struct folio *folio)
+{
+ struct page *page = folio_page(folio, 0);
+ kvm_pfn_t pfn = page_to_pfn(page);
+ int order = folio_order(folio);
+
+ kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+}
+#endif
+
static const struct address_space_operations kvm_gmem_aops = {
.dirty_folio = noop_dirty_folio,
#ifdef CONFIG_MIGRATION
.migrate_folio = kvm_gmem_migrate_folio,
#endif
.error_remove_page = kvm_gmem_error_page,
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+ .free_folio = kvm_gmem_free_folio,
+#endif
};
static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
--
2.25.1
In some cases the full 64-bit error code for the KVM page fault will be
needed to determine things like whether or not a fault was for a private
or shared guest page, so update related code to accept the full 64-bit
value so it can be plumbed all the way through to where it is needed.
The accessors of fault->error_code are changed as follows:
- FNAME(page_fault): change to explicitly use lower_32_bits() since that
is no longer done in kvm_mmu_page_fault()
- kvm_mmu_page_fault(): explicit mask with PFERR_RSVD_MASK,
PFERR_NESTED_GUEST_PAGE
- mmutrace: changed u32 -> u64
Signed-off-by: Isaku Yamahata <[email protected]>
Link: https://lore.kernel.org/kvm/[email protected]/T/#mbd0b20c9a2cf50319d5d2a27b63f73c772112076
[mdr: drop references/changes to code not in current gmem tree, update
commit message]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 +--
arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
4 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bcb812a7f563..686f88c263a9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5802,8 +5802,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}
if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
&emulation_type);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 71ba4f833dc1..759c8b718201 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
struct kvm_page_fault {
/* arguments to kvm_mmu_do_page_fault. */
const gpa_t addr;
- const u32 error_code;
+ const u64 error_code;
const bool prefetch;
/* Derived from error_code. */
@@ -280,7 +280,7 @@ enum {
};
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..195d98bc8de8 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -260,7 +260,7 @@ TRACE_EVENT(
TP_STRUCT__entry(
__field(int, vcpu_id)
__field(gpa_t, cr2_or_gpa)
- __field(u32, error_code)
+ __field(u64, error_code)
__field(u64 *, sptep)
__field(u64, old_spte)
__field(u64, new_spte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..2f60f68f5f2d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -787,7 +787,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* The bit needs to be cleared before walking guest page tables.
*/
r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
- fault->error_code & ~PFERR_RSVD_MASK);
+ lower_32_bits(fault->error_code) & ~PFERR_RSVD_MASK);
/*
* The page is not mapped by the guest. Let the guest handle it.
--
2.25.1
In some cases, such as detecting whether a page fault should be handled
as a private fault or not, KVM will need to handle things differently
versus the existing KVM_X86_PROTECTED_VM type.
Add a new KVM_X86_SNP_VM to allow for this, along with a helper to query
the vm_type.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/x86.c | 8 +++++++-
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dbec74783f48..cdc235277a6f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2089,6 +2089,8 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
#define kvm_arch_has_private_mem(kvm) false
#endif
+bool kvm_is_vm_type(struct kvm *kvm, unsigned long type);
+
static inline u16 kvm_read_ldt(void)
{
u16 ldt;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a448d0964fc0..57e4ba484aa2 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -564,5 +564,6 @@ struct kvm_pmu_event_filter {
#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_SW_PROTECTED_VM 1
+#define KVM_X86_SNP_VM 3
#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0e95c3a95e59..12f9e99c7ad0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4444,10 +4444,16 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
static bool kvm_is_vm_type_supported(unsigned long type)
{
return type == KVM_X86_DEFAULT_VM ||
- (type == KVM_X86_SW_PROTECTED_VM &&
+ ((type == KVM_X86_SW_PROTECTED_VM ||
+ type == KVM_X86_SNP_VM) &&
IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
}
+bool kvm_is_vm_type(struct kvm *kvm, unsigned long type)
+{
+ return kvm->arch.vm_type == type;
+}
+
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
{
int r = 0;
--
2.25.1
From: Brijesh Singh <[email protected]>
When SEV-SNP is enabled globally, the hardware places restrictions on all
memory accesses based on the RMP entry, whether the hypervisor or a VM,
performs the accesses. When hardware encounters an RMP access violation
during a guest access, it will cause a #VMEXIT(NPF) with a number of
additional bits set to indicate the reasons for the #NPF. Define those
here.
See APM2 section 16.36.10 for more details.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
[mdr: add some additional details to commit message]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cdc235277a6f..fa401cb1a552 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -253,9 +253,13 @@ enum x86_intercept_stage;
#define PFERR_FETCH_BIT 4
#define PFERR_PK_BIT 5
#define PFERR_SGX_BIT 15
+#define PFERR_GUEST_RMP_BIT 31
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
#define PFERR_IMPLICIT_ACCESS_BIT 48
+#define PFERR_GUEST_ENC_BIT 34
+#define PFERR_GUEST_SIZEM_BIT 35
+#define PFERR_GUEST_VMPL_BIT 36
#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
#define PFERR_WRITE_MASK BIT(PFERR_WRITE_BIT)
@@ -267,6 +271,10 @@ enum x86_intercept_stage;
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
+#define PFERR_GUEST_RMP_MASK BIT_ULL(PFERR_GUEST_RMP_BIT)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
+#define PFERR_GUEST_SIZEM_MASK BIT_ULL(PFERR_GUEST_SIZEM_BIT)
+#define PFERR_GUEST_VMPL_MASK BIT_ULL(PFERR_GUEST_VMPL_BIT)
#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
PFERR_WRITE_MASK | \
--
2.25.1
For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
determine with an #NPF is due to a private/shared access by the guest.
Implement that handling here. Also add handling needed to deal with
SNP guests which in some cases will make MMIO accesses with the
encryption bit.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
arch/x86/kvm/mmu/mmu_internal.h | 20 +++++++++++++++++++-
2 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 686f88c263a9..10c323e2faa4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4327,6 +4327,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
+ bool private_fault = fault->is_private;
bool async;
/*
@@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ /*
+ * In some cases SNP guests will make MMIO accesses with the encryption
+ * bit set. Handle these via the normal MMIO fault path.
+ */
+ if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
+ private_fault = false;
+
+ if (private_fault != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return -EFAULT;
}
- if (fault->is_private)
+ if (private_fault)
return kvm_faultin_pfn_private(vcpu, fault);
async = false;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 759c8b718201..e5b973051ad9 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -251,6 +251,24 @@ struct kvm_page_fault {
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
+{
+ bool private_fault = false;
+
+ if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
+ private_fault = !!(err & PFERR_GUEST_ENC_MASK);
+ } else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
+ /*
+ * This handling is for gmem self-tests and guests that treat
+ * userspace as the authority on whether a fault should be
+ * private or not.
+ */
+ private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
+ }
+
+ return private_fault;
+}
+
/*
* Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
* and of course kvm_mmu_do_page_fault().
@@ -298,7 +316,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+ .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
};
int r;
--
2.25.1
On Mon, Oct 16, 2023, Michael Roth wrote:
> For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
> determine with an #NPF is due to a private/shared access by the guest.
> Implement that handling here. Also add handling needed to deal with
> SNP guests which in some cases will make MMIO accesses with the
> encryption bit.
...
> @@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> - if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> + /*
> + * In some cases SNP guests will make MMIO accesses with the encryption
> + * bit set. Handle these via the normal MMIO fault path.
> + */
> + if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
> + private_fault = false;
Why? This is inarguably a guest bug.
> + if (private_fault != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> return -EFAULT;
> }
>
> - if (fault->is_private)
> + if (private_fault)
> return kvm_faultin_pfn_private(vcpu, fault);
>
> async = false;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 759c8b718201..e5b973051ad9 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -251,6 +251,24 @@ struct kvm_page_fault {
>
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>
> +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> +{
> + bool private_fault = false;
> +
> + if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
> + private_fault = !!(err & PFERR_GUEST_ENC_MASK);
> + } else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
> + /*
> + * This handling is for gmem self-tests and guests that treat
> + * userspace as the authority on whether a fault should be
> + * private or not.
> + */
> + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> + }
This can be more simply:
if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM))
return !!(err & PFERR_GUEST_ENC_MASK);
if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM))
return kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
Hi,
On 16/10/2023 12:50, Michael Roth wrote:
> guest_memfd pages are generally expected to be in some arch-defined
> initial state prior to using them for guest memory. For SEV-SNP this
> initial state is 'private', or 'guest-owned', and requires additional
> operations to move these pages into a 'private' state by updating the
> corresponding entries the RMP table.
>
> Allow for an arch-defined hook to handle updates of this sort, and go
> ahead and implement one for x86 so KVM implementations like AMD SVM can
> register a kvm_x86_ops callback to handle these updates for SEV-SNP
> guests.
>
> The preparation callback is always called when allocating/grabbing
> folios via gmem, and it is up to the architecture to keep track of
> whether or not the pages are already in the expected state (e.g. the RMP
> table in the case of SEV-SNP).
>
> In some cases, it is necessary to defer the preparation of the pages to
> handle things like in-place encryption of initial guest memory payloads
> before marking these pages as 'private'/'guest-owned', so also add a
> helper that performs the same function as kvm_gmem_get_pfn(), but allows
> for the preparation callback to be bypassed to allow for pages to be
> accessed beforehand.
This will be useful for Arm CCA, where the pages need to be moved into
"Realm state". Some minor comments below.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/x86.c | 6 ++++
> include/linux/kvm_host.h | 14 ++++++++
> virt/kvm/Kconfig | 4 +++
> virt/kvm/guest_memfd.c | 56 +++++++++++++++++++++++++++---
> 6 files changed, 78 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index e3054e3e46d5..0c113f42d5c7 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
> KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> +KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 95018cc653f5..66fc89d1858f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1752,6 +1752,8 @@ struct kvm_x86_ops {
> * Returns vCPU specific APICv inhibit reasons
> */
> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
> +
> + int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 767236b4d771..33a4cc33d86d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13301,6 +13301,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
> +{
> + return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
> +}
> +#endif
>
> int kvm_spec_ctrl_test_value(u64 value)
> {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8c5c017ab4e9..c7f82c2f1bcf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2403,9 +2403,19 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
>
> #ifdef CONFIG_KVM_PRIVATE_MEM
> +int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
> #else
> +static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
> + struct kvm_memory_slot *slot, gfn_t gfn,
> + kvm_pfn_t *pfn, int *max_order)
Missing "bool prep" here ?
minor nit: Do we need to export both __kvm_gmem_get_pfn and
kvm_gmem_get_pfn ? I don't see anyone else using the former.
We could have :
#ifdef CONFIG_KVM_PRIVATE_MEM
int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep);
#else
static inline int __kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
kvm_pfn_t *pfn, int *max_order,
bool prep)
{
KVM_BUG_ON(1, kvm);
return -EIO;
}
#endif
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
kvm_pfn_t *pfn, int *max_order)
{
return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
}
Suzuki
> + KVM_BUG_ON(1, kvm);
> + return -EIO;
> +}
> +
> static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> kvm_pfn_t *pfn, int *max_order)
> @@ -2415,4 +2425,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
> +#endif
> +
> #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 2c964586aa14..992cf6ed86ef 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,3 +109,7 @@ config KVM_GENERIC_PRIVATE_MEM
> select KVM_GENERIC_MEMORY_ATTRIBUTES
> select KVM_PRIVATE_MEM
> bool
> +
> +config HAVE_KVM_GMEM_PREPARE
> + bool
> + depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index f6f1b17a319c..72ff8b7b31d5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -44,7 +44,40 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index)
> #endif
> }
>
> -static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> +static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
> +{
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> + struct list_head *gmem_list = &inode->i_mapping->private_list;
> + struct kvm_gmem *gmem;
> +
> + list_for_each_entry(gmem, gmem_list, entry) {
> + struct kvm_memory_slot *slot;
> + struct kvm *kvm = gmem->kvm;
> + struct page *page;
> + kvm_pfn_t pfn;
> + gfn_t gfn;
> + int rc;
> +
> + slot = xa_load(&gmem->bindings, index);
> + if (!slot)
> + continue;
> +
> + page = folio_file_page(folio, index);
> + pfn = page_to_pfn(page);
> + gfn = slot->base_gfn + index - slot->gmem.pgoff;
> + rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
> + if (rc) {
> + pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
> + index, rc);
> + return rc;
> + }
> + }
> +
> +#endif
> + return 0;
> +}
> +
> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prep)
> {
> struct folio *folio;
>
> @@ -74,6 +107,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> folio_mark_uptodate(folio);
> }
>
> + if (prep && kvm_gmem_prepare_folio(inode, index, folio)) {
> + folio_unlock(folio);
> + folio_put(folio);
> + return NULL;
> + }
> +
> /*
> * Ignore accessed, referenced, and dirty flags. The memory is
> * unevictable and there is no storage to write back to.
> @@ -178,7 +217,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> break;
> }
>
> - folio = kvm_gmem_get_folio(inode, index);
> + folio = kvm_gmem_get_folio(inode, index, true);
> if (!folio) {
> r = -ENOMEM;
> break;
> @@ -537,8 +576,8 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
> fput(file);
> }
>
> -int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> - gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
> {
> pgoff_t index, huge_index;
> struct kvm_gmem *gmem;
> @@ -559,7 +598,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> goto out_fput;
> }
>
> - folio = kvm_gmem_get_folio(file_inode(file), index);
> + folio = kvm_gmem_get_folio(file_inode(file), index, prep);
> if (!folio) {
> r = -ENOMEM;
> goto out_fput;
> @@ -600,4 +639,11 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> return r;
> }
> +EXPORT_SYMBOL_GPL(__kvm_gmem_get_pfn);
> +
> +int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
> +{
> + return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
> +} > EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
On 16/10/2023 12:50, Michael Roth wrote:
> In some cases, like with SEV-SNP, guest memory needs to be updated in a
> platform-specific manner before it can be safely freed back to the host.
> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> allow for special handling of this sort when freeing memory in response
> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> ahead and define an arch-specific hook for x86 since it will be needed
> for handling memory used for SEV-SNP guests.
Hi all,
Arm CCA has a similar need to prepare/unprepare memory (granule
delegate/undelegate using our terminology) before it is used for
protected memory.
However I see a problem with the current gmem implementation that the
"invalidations" are not precise enough for our RMI API. When punching a
hole in the memfd the code currently hits the same path (ending in
kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
the shared version). The Arm CCA architecture doesn't allow the
protected memory to be removed and refaulted without the permission of
the guest (the memory contents would be wiped in this case).
One option that I've considered is to implement a seperate CCA ioctl to
notify KVM whether the memory should be mapped protected. The
invalidations would then be ignored on ranges that are currently
protected for this guest.
This 'solves' the problem nicely except for the case where the VMM
deliberately punches holes in memory which the guest is using.
The issue in this case is that there's no way of failing the punch hole
operation - we can detect that the memory is in use and shouldn't be
freed, but this callback doesn't give the opportunity to actually block
the freeing of the memory.
Sadly there's no easy way to map from a physical page in a gmem back to
which VM (and where in the VM) the page is mapped. So actually ripping
the page out of the appropriate VM isn't really possible in this case.
How is this situation handled on x86? Is it possible to invalidate and
then refault a protected page without affecting the memory contents? My
guess is yes and that is a CCA specific problem - is my understanding
correct?
My current thoughts for CCA are one of three options:
1. Represent shared and protected memory as two separate memslots. This
matches the underlying architecture more closely (the top address bit is
repurposed as a 'shared' flag), but I don't like it because it's a
deviation from other CoCo architectures (notably pKVM).
2. Allow punch-hole to fail on CCA if the memory is mapped into the
guest's protected space. Again, this is CCA being different and also
creates nasty corner cases where the gmem descriptor could have to
outlive the VMM - so looks like a potential source of memory leaks.
3. 'Fix' the invalidation to provide more precise semantics. I haven't
yet prototyped it but it might be possible to simply provide a flag from
kvm_gmem_invalidate_begin specifying that the invalidation is for the
protected memory. KVM would then only unmap the protected memory when
this flag is set (avoiding issues with VMA updates causing spurious unmaps).
Fairly obviously (3) is my preferred option, but it relies on the
guarantees that the "invalidation" is actually a precise set of
addresses where the memory is actually being freed.
Comments, thoughts, objections welcome!
Steve
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 7 +++++++
> include/linux/kvm_host.h | 4 ++++
> virt/kvm/Kconfig | 4 ++++
> virt/kvm/guest_memfd.c | 14 ++++++++++++++
> 6 files changed, 31 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 0c113f42d5c7..f1505a5fa781 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL(gmem_invalidate)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 66fc89d1858f..dbec74783f48 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1754,6 +1754,7 @@ struct kvm_x86_ops {
> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> + void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 33a4cc33d86d..0e95c3a95e59 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> }
> #endif
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +{
> + static_call_cond(kvm_x86_gmem_invalidate)(start, end);
> +}
> +#endif
> +
> int kvm_spec_ctrl_test_value(u64 value)
> {
> /*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c7f82c2f1bcf..840a5be5962a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
> #endif
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +#endif
> +
> #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 992cf6ed86ef..7fd1362a7ebe 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM
> config HAVE_KVM_GMEM_PREPARE
> bool
> depends on KVM_PRIVATE_MEM
> +
> +config HAVE_KVM_GMEM_INVALIDATE
> + bool
> + depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 72ff8b7b31d5..b4c4df259fb8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
> return MF_DELAYED;
> }
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +static void kvm_gmem_free_folio(struct folio *folio)
> +{
> + struct page *page = folio_page(folio, 0);
> + kvm_pfn_t pfn = page_to_pfn(page);
> + int order = folio_order(folio);
> +
> + kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> +}
> +#endif
> +
> static const struct address_space_operations kvm_gmem_aops = {
> .dirty_folio = noop_dirty_folio,
> #ifdef CONFIG_MIGRATION
> .migrate_folio = kvm_gmem_migrate_folio,
> #endif
> .error_remove_page = kvm_gmem_error_page,
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> + .free_folio = kvm_gmem_free_folio,
> +#endif
> };
>
> static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
On Fri, Feb 09, 2024, Steven Price wrote:
> On 16/10/2023 12:50, Michael Roth wrote:
> > In some cases, like with SEV-SNP, guest memory needs to be updated in a
> > platform-specific manner before it can be safely freed back to the host.
> > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> > allow for special handling of this sort when freeing memory in response
> > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> > ahead and define an arch-specific hook for x86 since it will be needed
> > for handling memory used for SEV-SNP guests.
>
> Hi all,
>
> Arm CCA has a similar need to prepare/unprepare memory (granule
> delegate/undelegate using our terminology) before it is used for
> protected memory.
>
> However I see a problem with the current gmem implementation that the
> "invalidations" are not precise enough for our RMI API. When punching a
> hole in the memfd the code currently hits the same path (ending in
> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
> the shared version).
>
> The Arm CCA architecture doesn't allow the protected memory to be removed and
> refaulted without the permission of the guest (the memory contents would be
> wiped in this case).
TDX behaves almost exactly like CCA. Well, that's not technically true, strictly
speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
!PRESENT, but those aren't in play for invalidation events like this.
SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
would do, is destructive, so SNP also behaves the same way for all intents and
purposes.
> One option that I've considered is to implement a seperate CCA ioctl to
> notify KVM whether the memory should be mapped protected.
That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> The invalidations would then be ignored on ranges that are currently
> protected for this guest.
That's backwards. Invalidations on a guest_memfd should affect only *protected*
mappings. And for that, the plan/proposal is to plumb only_{shared,private} flags
into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
mappings, and mmu_notifier invalidation don't zap private mappings. Sample usage
in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
patch despite, I only provided a rough sketch).
[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com
> This 'solves' the problem nicely except for the case where the VMM
> deliberately punches holes in memory which the guest is using.
I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive,
so don't do that.
> The issue in this case is that there's no way of failing the punch hole
> operation - we can detect that the memory is in use and shouldn't be
> freed, but this callback doesn't give the opportunity to actually block
> the freeing of the memory.
Why is this KVM's problem? E.g. the same exact thing happens without guest_memfd
if userspace munmap()s memory the guest is using.
> Sadly there's no easy way to map from a physical page in a gmem back to
> which VM (and where in the VM) the page is mapped. So actually ripping
> the page out of the appropriate VM isn't really possible in this case.
I don't follow. guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
not know what exactly needs to be invalidated?
> How is this situation handled on x86? Is it possible to invalidate and
> then refault a protected page without affecting the memory contents? My
> guess is yes and that is a CCA specific problem - is my understanding
> correct?
>
> My current thoughts for CCA are one of three options:
>
> 1. Represent shared and protected memory as two separate memslots. This
> matches the underlying architecture more closely (the top address bit is
> repurposed as a 'shared' flag), but I don't like it because it's a
> deviation from other CoCo architectures (notably pKVM).
>
> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
> guest's protected space. Again, this is CCA being different and also
> creates nasty corner cases where the gmem descriptor could have to
> outlive the VMM - so looks like a potential source of memory leaks.
>
> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
> yet prototyped it but it might be possible to simply provide a flag from
> kvm_gmem_invalidate_begin specifying that the invalidation is for the
> protected memory. KVM would then only unmap the protected memory when
> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
>
> Fairly obviously (3) is my preferred option, but it relies on the
> guarantees that the "invalidation" is actually a precise set of
> addresses where the memory is actually being freed.
#3 is what we are planning for x86, and except for the only_{shared,private} flags,
the requisite functionality should already be in Linus' tree, though it does need
to be wired up for ARM.
On Fri, Feb 09, 2024, Steven Price wrote:
> >> One option that I've considered is to implement a seperate CCA ioctl to
> >> notify KVM whether the memory should be mapped protected.
> >
> > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
>
> Sorry, I really didn't explain that well. Yes effectively this is the
> attribute flag, but there's corner cases for destruction of the VM. My
> thought was that if the VMM wanted to tear down part of the protected
> range (without making it shared) then a separate ioctl would be needed
> to notify KVM of the unmap.
No new uAPI should be needed, because the only scenario time a benign VMM should
do this is if the guest also knows the memory is being removed, in which case
PUNCH_HOLE will suffice.
> >> This 'solves' the problem nicely except for the case where the VMM
> >> deliberately punches holes in memory which the guest is using.
> >
> > I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive,
> > so don't do that.
>
> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> my concern here is a VMM which is trying to break the host. In this case
> either the PUNCH_HOLE needs to fail, or we actually need to recover the
> memory from the guest (effectively killing the guest in the process).
The latter. IIRC, we talked about this exact case somewhere in the hour-long
rambling discussion on guest_memfd at PUCK[1]. And we've definitely discussed
this multiple times on-list, though I don't know that there is a single thread
that captures the entire plan.
The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
instance that's attached to a given guest_memfd inode when a page is being fully
removed, i.e. when a page is being freed back to the normal memory pool. Something
like this proposed SNP patch[2].
Mike, do have WIP patches you can share?
[1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
[2] https://lore.kernel.org/all/[email protected]
On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
> > >> One option that I've considered is to implement a seperate CCA ioctl to
> > >> notify KVM whether the memory should be mapped protected.
> > >
> > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> >
> > Sorry, I really didn't explain that well. Yes effectively this is the
> > attribute flag, but there's corner cases for destruction of the VM. My
> > thought was that if the VMM wanted to tear down part of the protected
> > range (without making it shared) then a separate ioctl would be needed
> > to notify KVM of the unmap.
>
> No new uAPI should be needed, because the only scenario time a benign VMM should
> do this is if the guest also knows the memory is being removed, in which case
> PUNCH_HOLE will suffice.
>
> > >> This 'solves' the problem nicely except for the case where the VMM
> > >> deliberately punches holes in memory which the guest is using.
> > >
> > > I don't see what problem there is to solve in this case. PUNCH_HOLE is destructive,
> > > so don't do that.
> >
> > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > my concern here is a VMM which is trying to break the host. In this case
> > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > memory from the guest (effectively killing the guest in the process).
>
> The latter. IIRC, we talked about this exact case somewhere in the hour-long
> rambling discussion on guest_memfd at PUCK[1]. And we've definitely discussed
> this multiple times on-list, though I don't know that there is a single thread
> that captures the entire plan.
>
> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> instance that's attached to a given guest_memfd inode when a page is being fully
> removed, i.e. when a page is being freed back to the normal memory pool. Something
> like this proposed SNP patch[2].
>
> Mike, do have WIP patches you can share?
Sorry, I missed this query earlier. I'm a bit confused though, I thought
the kvm_arch_gmem_invalidate() hook provided in this patch was what we
ended up agreeing on during the PUCK call in question.
There was an open question about what to do if a use-case came along
where we needed to pass additional parameters to
kvm_arch_gmem_invalidate() other than just the start/end PFN range for
the pages being freed, but we'd determined that SNP and TDX did not
currently need this, so I didn't have any changes planned in this
regard.
If we now have such a need, what we had proposed was to modify
__filemap_remove_folio()/page_cache_delete() to defer setting
folio->mapping to NULL so that we could still access it in
kvm_gmem_free_folio() so that we can still access mapping->i_private_list
to get the list of gmem/KVM instances and pass them on via
kvm_arch_gmem_invalidate().
So that's doable, but it's not clear from this discussion that that's
needed. If the idea to block/kill the guest if VMM tries to hole-punch,
and ARM CCA already has plans to wire up the shared/private flags in
kvm_unmap_gfn_range(), wouldn't that have all the information needed to
kill that guest? At that point, kvm_gmem_free_folio() can handle
additional per-page cleanup (with additional gmem/KVM info plumbed in
if necessary).
-Mike
[1] https://lore.kernel.org/kvm/[email protected]/T/
>
> [1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
> [2] https://lore.kernel.org/all/[email protected]