LinuxLists.cc - [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

2023-06-15 20:35:35

Subject: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

From: Isaku Yamahata <[email protected]>

Hello. This is an RFC patch series based on KVM gmem [1] and [2]
for the common use of TDX and SEV-SNP. I'd like to discuss three items.

* Flags for kvm_gfn_range:
Roth, By adding flags to kvm_gfn_range, kvm_arch_gmem_invalidate() won't be
needed. Maybe x86 gmem_invalidate callback is still needed, though.

* Page fault error code or how to figure out the private page fault
There is an issue to set kvm_page_fault.is_private. I came up with two new
error codes. Is this a way or any other way?

* VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
- Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
- Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
the future)
- any other way?

[1] KVM gmem patches
https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo

[2] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support
https://lore.kernel.org/lkml/[email protected]/

Isaku Yamahata (6):
KVM: selftests: Fix test_add_overlapping_private_memory_regions()
KVM: selftests: Fix guest_memfd()
KVM: x86/mmu: Pass round full 64-bit error code for the KVM page fault
KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private
KVM: Add flags to struct kvm_gfn_range
KVM: x86: Add is_vm_type_supported callback

arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu/mmu.c | 14 +++++++++-----
arch/x86/kvm/mmu/mmu_internal.h | 8 ++++----
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 4 ++--
arch/x86/kvm/svm/svm.c | 7 +++++++
arch/x86/kvm/vmx/vmx.c | 6 ++++++
arch/x86/kvm/x86.c | 10 +++++++++-
arch/x86/kvm/x86.h | 2 ++
include/linux/kvm_host.h | 11 ++++++++++-
tools/testing/selftests/kvm/guest_memfd_test.c | 4 ++--
.../selftests/kvm/set_memory_region_test.c | 16 ++++++++++++++--
virt/kvm/guest_mem.c | 10 +++++++---
virt/kvm/kvm_main.c | 4 +++-
16 files changed, 83 insertions(+), 22 deletions(-)

base-commit: be8abcec83c87d4e15ae04816b685fe260c4bcfd
--
2.25.1

2023-06-15 20:39:06

by Isaku Yamahata

[permalink] [raw]

Subject: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range

From: Isaku Yamahata <[email protected]>

TDX and SEV-SNP need to know the reason for a callback by
kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM
gmem callback. The callback handler changes the behavior or does the
additional housekeeping operation. For mmu notifier, it's zapping shared
PTE. For set memory attributes, it's the conversion of memory attributes
(private <=> shared). For KVM gmem, it's punching a hole in the range, and
releasing the file.

Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/kvm_host.h | 11 ++++++++++-
virt/kvm/guest_mem.c | 10 +++++++---
virt/kvm/kvm_main.c | 4 +++-
3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1a47cedae8a1..c049c0aa44d6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+
+#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0)
+#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1)
+#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2)
+
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
gfn_t end;
- pte_t pte;
+ union {
+ pte_t pte;
+ u64 attrs;
+ };
bool may_block;
+ unsigned int flags;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index cdf2d84683c8..30b8f66784d4 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -99,7 +99,8 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
}

static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
- pgoff_t start, pgoff_t end)
+ pgoff_t start, pgoff_t end,
+ unsigned int flags)
{
struct kvm_memory_slot *slot;
unsigned long index;
@@ -118,6 +119,7 @@ static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
.slot = slot,
.pte = __pte(0),
.may_block = true,
+ .flags = flags,
};

kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
@@ -156,7 +158,8 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
*/
filemap_invalidate_lock(file->f_mapping);

- kvm_gmem_invalidate_begin(kvm, gmem, start, end);
+ kvm_gmem_invalidate_begin(kvm, gmem, start, end,
+ KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE);

truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1);

@@ -263,7 +266,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
* Free the backing memory, and more importantly, zap all SPTEs that
* pointed at this file.
*/
- kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul);
+ kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul,
+ KVM_GFN_RANGE_FLAGS_GMEM_RELEASE);
truncate_inode_pages_final(file->f_mapping);
kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 422d49634c56..9cdfa2fb675f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -613,6 +613,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;
+ gfn_range.flags = 0;

if (!locked) {
locked = true;
@@ -2391,8 +2392,9 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
bool flush = false;
int i;

- gfn_range.pte = __pte(0);
+ gfn_range.attrs = attrs;
gfn_range.may_block = true;
+ gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR;

for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
--
2.25.1

2023-06-19 19:50:11

by Vishal Annapurve

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Thu, Jun 15, 2023 at 1:12 PM <[email protected]> wrote:
> ...
>
> * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> the future)
> - any other way?

There are selftests posted[1] in context of this work, which rely on
KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
VMs. In future there might be more work to expand this usecase to
full-scale VMs. So it would be better to treat protected VMs as a
separate type which can be used on any platform without the need of
enabling TDX/SEV functionality.

TDX VM type can possibly serve as a specialized type of protected VM
with additional arch specific capabilities enabled.

[1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo

2023-06-19 20:36:30

by Zhi Wang

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Mon, 19 Jun 2023 12:11:50 -0700
Vishal Annapurve <[email protected]> wrote:

> On Thu, Jun 15, 2023 at 1:12___PM <[email protected]> wrote:
> > ...
> >
> > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> > - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> > - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> > the future)
> > - any other way?
>
> There are selftests posted[1] in context of this work, which rely on
> KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
> VMs. In future there might be more work to expand this usecase to
> full-scale VMs. So it would be better to treat protected VMs as a
> separate type which can be used on any platform without the need of
> enabling TDX/SEV functionality.
>

Out of curiosity, is this really a valid case in practice except selftest?
It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be tied
with a platform-specific CC type.

> TDX VM type can possibly serve as a specialized type of protected VM
> with additional arch specific capabilities enabled.
>
> [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo

2023-06-19 22:34:12

by Vishal Annapurve

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Mon, Jun 19, 2023 at 1:11 PM Zhi Wang <[email protected]> wrote:
>
> On Mon, 19 Jun 2023 12:11:50 -0700
> Vishal Annapurve <[email protected]> wrote:
>
> > On Thu, Jun 15, 2023 at 1:12___PM <[email protected]> wrote:
> > > ...
> > >
> > > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> > > - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> > > - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> > > the future)
> > > - any other way?
> >
> > There are selftests posted[1] in context of this work, which rely on
> > KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
> > VMs. In future there might be more work to expand this usecase to
> > full-scale VMs. So it would be better to treat protected VMs as a
> > separate type which can be used on any platform without the need of
> > enabling TDX/SEV functionality.
> >
>
> Out of curiosity, is this really a valid case in practice except selftest?
> It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be tied
> with a platform-specific CC type.

Protected VM effort is about being able to have guest memory ranges
not mapped into Userspace VMM and so are unreachable for most of the
cases from KVM as well. Non-CC VMs can use this support to mitigate
any unintended accesses from userspace VMM/KVM possibly using
enlightened kernels.

Exact implementation of such a support warrants more discussion but it
should be in the line of sight here as a future work item.

>
> > TDX VM type can possibly serve as a specialized type of protected VM
> > with additional arch specific capabilities enabled.
> >
> > [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo
>

2023-06-20 16:33:59

by Michael Roth

[permalink] [raw]

Subject: Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range

On Thu, Jun 15, 2023 at 01:12:18PM -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDX and SEV-SNP need to know the reason for a callback by
> kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM
> gmem callback. The callback handler changes the behavior or does the
> additional housekeeping operation. For mmu notifier, it's zapping shared
> PTE. For set memory attributes, it's the conversion of memory attributes
> (private <=> shared). For KVM gmem, it's punching a hole in the range, and
> releasing the file.

I think it's still an open topic that we need to hear more from Sean about:

https://lore.kernel.org/lkml/[email protected]/

but I *think* we were leaning toward decoupling the act of invalidating
GFNs, vs. the act of invalidating/free'ing gmem pages.

One concrete example of where this seperation makes sense if with
hole-punching. SNP has unique platform-specific stuff it has to do before
free'ing that gmem page back to the host. If we try to plumb this through
kvm_unmap_gfn_range() via a special flag then it's a little awkward
because:

a) Presumably that hole-punch would have occurred after a preceeding
KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
state in the xarray. So all it should really need to do is handle
that platform-specific behavior, like updating RMP table in case of
SNP. But none of the other details like GFN ranges are relevant in
that case, RMP updates here only need the PFN, so we end up walking
memslots to do GFN->PFN translations, when it would actually be much
more efficient to do these translations by translating the
hole-punched FD offset range to the corresponding folio()'s backing
those ranges

b) It places an unecessary dependency on having an active memslot to do
those translations. This ends up not being an issue with current
version of gmem patchset because the release() happens *before*
gmem_unbind(), so there is a memslot associated with the ranges at
gmem_release() time, but in the initial version of gmem it was the
reverse, so if things ever changed again in this regard we'd once
again have to completely rework how to issue these platform-specific
invalidation callbacks.

I really *really* like having a separate, simple invalidation mechanism
in place that just converts FD offsets to PFNs and then passes those on
to platform-defined handlers to clean up pages before free'ing them back
to the system. It's versatile in that it can be called pretty much
anywhere regardless of where we are in KVM lifecycle, it's robust in
that it doesn't rely on unecessary outside dependencies, and it avoids
added uneeded complexity to paths like kvm_unmap_gfn_range().

That's the approach taken with SNP hypervisor v9 series, with the
gmem hook being introduced here:

https://lore.kernel.org/kvm/[email protected]/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043

and the SEV-SNP implementation of that hook being here:

https://lore.kernel.org/kvm/[email protected]/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e

Would a similar approach work for TDX? At least WRT cleaning up pages
before returning them back to the host? If we could isolate that
requirement/handling from all the other aspects of invalidations it
really seems like it would cause us less headaches down the road.

>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> include/linux/kvm_host.h | 11 ++++++++++-
> virt/kvm/guest_mem.c | 10 +++++++---
> virt/kvm/kvm_main.c | 4 +++-
> 3 files changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1a47cedae8a1..c049c0aa44d6 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #endif
>
> #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> +
> +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0)

Can you go into more detail on why special handling is needed for
SET_MEM_ATTR?

> +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1)
> +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2)

Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
TDX case if you take the above approach? For SNP, the answer is yes. If
that's also the case for TDX I think that would be another argument in
favor of decoupling these from existing KVM MMU invalidation path.

-Mike

> +
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> gfn_t end;
> - pte_t pte;
> + union {
> + pte_t pte;
> + u64 attrs;
> + };
> bool may_block;
> + unsigned int flags;
> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index cdf2d84683c8..30b8f66784d4 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -99,7 +99,8 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
> }
>
> static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
> - pgoff_t start, pgoff_t end)
> + pgoff_t start, pgoff_t end,
> + unsigned int flags)
> {
> struct kvm_memory_slot *slot;
> unsigned long index;
> @@ -118,6 +119,7 @@ static void kvm_gmem_invalidate_begin(struct kvm *kvm, struct kvm_gmem *gmem,
> .slot = slot,
> .pte = __pte(0),
> .may_block = true,
> + .flags = flags,
> };
>
> kvm_mmu_invalidate_range_add(kvm, gfn_range.start, gfn_range.end);
> @@ -156,7 +158,8 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
> */
> filemap_invalidate_lock(file->f_mapping);
>
> - kvm_gmem_invalidate_begin(kvm, gmem, start, end);
> + kvm_gmem_invalidate_begin(kvm, gmem, start, end,
> + KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE);
>
> truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1);
>
> @@ -263,7 +266,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
> * Free the backing memory, and more importantly, zap all SPTEs that
> * pointed at this file.
> */
> - kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul);
> + kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul,
> + KVM_GFN_RANGE_FLAGS_GMEM_RELEASE);
> truncate_inode_pages_final(file->f_mapping);
> kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul);
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 422d49634c56..9cdfa2fb675f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -613,6 +613,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
> gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
> + gfn_range.flags = 0;
>
> if (!locked) {
> locked = true;
> @@ -2391,8 +2392,9 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> bool flush = false;
> int i;
>
> - gfn_range.pte = __pte(0);
> + gfn_range.attrs = attrs;
> gfn_range.may_block = true;
> + gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR;
>
> for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
> --
> 2.25.1
>

2023-06-20 19:29:17

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range

On Tue, Jun 20, 2023 at 11:28:35AM -0500,
Michael Roth <[email protected]> wrote:

> On Thu, Jun 15, 2023 at 01:12:18PM -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > TDX and SEV-SNP need to know the reason for a callback by
> > kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM
> > gmem callback. The callback handler changes the behavior or does the
> > additional housekeeping operation. For mmu notifier, it's zapping shared
> > PTE. For set memory attributes, it's the conversion of memory attributes
> > (private <=> shared). For KVM gmem, it's punching a hole in the range, and
> > releasing the file.
>
> I think it's still an open topic that we need to hear more from Sean about:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> but I *think* we were leaning toward decoupling the act of invalidating
> GFNs, vs. the act of invalidating/free'ing gmem pages.
>
> One concrete example of where this seperation makes sense if with
> hole-punching. SNP has unique platform-specific stuff it has to do before
> free'ing that gmem page back to the host. If we try to plumb this through
> kvm_unmap_gfn_range() via a special flag then it's a little awkward
> because:
>
> a) Presumably that hole-punch would have occurred after a preceeding
> KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
> state in the xarray. So all it should really need to do is handle
> that platform-specific behavior, like updating RMP table in case of
> SNP. But none of the other details like GFN ranges are relevant in
> that case, RMP updates here only need the PFN, so we end up walking
> memslots to do GFN->PFN translations, when it would actually be much
> more efficient to do these translations by translating the
> hole-punched FD offset range to the corresponding folio()'s backing
> those ranges
>
> b) It places an unecessary dependency on having an active memslot to do
> those translations. This ends up not being an issue with current
> version of gmem patchset because the release() happens *before*
> gmem_unbind(), so there is a memslot associated with the ranges at
> gmem_release() time, but in the initial version of gmem it was the
> reverse, so if things ever changed again in this regard we'd once
> again have to completely rework how to issue these platform-specific
> invalidation callbacks.
>
> I really *really* like having a separate, simple invalidation mechanism
> in place that just converts FD offsets to PFNs and then passes those on
> to platform-defined handlers to clean up pages before free'ing them back
> to the system. It's versatile in that it can be called pretty much
> anywhere regardless of where we are in KVM lifecycle, it's robust in
> that it doesn't rely on unecessary outside dependencies, and it avoids
> added uneeded complexity to paths like kvm_unmap_gfn_range().
>
> That's the approach taken with SNP hypervisor v9 series, with the
> gmem hook being introduced here:
>
> https://lore.kernel.org/kvm/[email protected]/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043
>
> and the SEV-SNP implementation of that hook being here:
>
> https://lore.kernel.org/kvm/[email protected]/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e
>
> Would a similar approach work for TDX? At least WRT cleaning up pages
> before returning them back to the host? If we could isolate that
> requirement/handling from all the other aspects of invalidations it
> really seems like it would cause us less headaches down the road.

In short, TDX KVM doesn't need an extra callback for invalidating/freeing gmem
pages. kvm_unmap_gfn_range() callback works. Instead TDX needs attributes
(private-or-shared) for it.

The reason is, TDX uses encrypted page table for guest (We call it Secure-EPT),
and decicated operation on it. The TDX KVM mmu hooks TDP MMU operations.
In in the case of invalidating/releasing page, it eventually hooks
handle_removed_pt() for additional operation.

Because TDX simply won't use gmem_invalidate callback, I'm fine with it.

> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > include/linux/kvm_host.h | 11 ++++++++++-
> > virt/kvm/guest_mem.c | 10 +++++++---
> > virt/kvm/kvm_main.c | 4 +++-
> > 3 files changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 1a47cedae8a1..c049c0aa44d6 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > +
> > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0)
>
> Can you go into more detail on why special handling is needed for
> SET_MEM_ATTR?

When in TDX, the VMM zaps a private page from the encrypted page table and the
VMM adds the page back to the same GPA, it results in zeroing page and guest
needs to accept the page again. When converting a page from shared to private,
KVM needs to zap only shared pages. So the callback needs to know this zap
is for converting shared-to-private or private-to-shared.

> > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1)
> > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2)
>
> Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
> TDX case if you take the above approach? For SNP, the answer is yes. If
> that's also the case for TDX I think that would be another argument in
> favor of decoupling these from existing KVM MMU invalidation path.

TDX doesn't need gmem_invalidate callback. TDx doesn't need the difference
betwee punch hole and release. So in short TDX needs
KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR and KVM_GFN_RANGE_FLAGS_GMEM.

With KVM_GFN_RANGE_FLAGS_GMEM, the callback can know that invalidating private
pages. Maybe by (ab)using KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR(attr=shared),
KVM_GFN_RANGE_FLAGS_GMEM can be dropped.

Thanks,
Isaku Yamahata
--
Isaku Yamahata <[email protected]>

2023-06-21 08:25:57

by Zhi Wang

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Mon, 19 Jun 2023 14:55:09 -0700
Vishal Annapurve <[email protected]> wrote:

> On Mon, Jun 19, 2023 at 1:11___PM Zhi Wang <[email protected]> wrote:
> >
> > On Mon, 19 Jun 2023 12:11:50 -0700
> > Vishal Annapurve <[email protected]> wrote:
> >
> > > On Thu, Jun 15, 2023 at 1:12___PM <[email protected]> wrote:
> > > > ...
> > > >
> > > > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> > > > - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> > > > - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> > > > the future)
> > > > - any other way?
> > >
> > > There are selftests posted[1] in context of this work, which rely on
> > > KVM_X86_PROTECTED_VM being just the software-only psuedo-confidential
> > > VMs. In future there might be more work to expand this usecase to
> > > full-scale VMs. So it would be better to treat protected VMs as a
> > > separate type which can be used on any platform without the need of
> > > enabling TDX/SEV functionality.
> > >
> >
> > Out of curiosity, is this really a valid case in practice except selftest?
> > It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be tied
> > with a platform-specific CC type.
>
> Protected VM effort is about being able to have guest memory ranges
> not mapped into Userspace VMM and so are unreachable for most of the
> cases from KVM as well. Non-CC VMs can use this support to mitigate
> any unintended accesses from userspace VMM/KVM possibly using
> enlightened kernels.
>
> Exact implementation of such a support warrants more discussion but it
> should be in the line of sight here as a future work item.
>
>

IIUC, what you are saying is (KVM_X86_PROTECTED_VM == (VMs with UPM or GMEM))
&& (KVM_X86_PROTECTED_VM != KVM_X86_CC_VM) && KVM_X86_CC_VM requires
KVM_X86_PROTECTED_VM.

If we think KVM_X86_PROTECTED_VM as a dedicated feature, not tightly coupled
with CC techs, it seems we needs another defination of KVM_X86_CC_VM to represent
CC VM and CC platform types like KVM_X86_CC_TDX_VM to tell which CC tech sits
behind it?

I don't think it is good to mix the usages of KVM_X86_PROTECTED_VM and KVM_X86_CC_VM
together if we are sure KVM_X86_PROTECTED_VM is not going to represent CC VMs
in the code.

>
>
> >
> > > TDX VM type can possibly serve as a specialized type of protected VM
> > > with additional arch specific capabilities enabled.
> > >
> > > [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo
> >

2023-06-21 18:55:42

by Dong, Eddie

[permalink] [raw]

Subject: RE: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

> -----Original Message-----
> From: Vishal Annapurve <[email protected]>
> Sent: Monday, June 19, 2023 2:55 PM
> To: Zhi Wang <[email protected]>
> Cc: Yamahata, Isaku <[email protected]>; [email protected];
> [email protected]; [email protected]; Paolo Bonzini
> <[email protected]>; Aktas, Erdem <[email protected]>;
> Christopherson,, Sean <[email protected]>; Shahar, Sagi
> <[email protected]>; David Matlack <[email protected]>; Huang, Kai
> <[email protected]>; Chen, Bo2 <[email protected]>; linux-
> [email protected]; Chao Peng <[email protected]>; Ackerley
> Tng <[email protected]>; Michael Roth <[email protected]>
> Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
>
> On Mon, Jun 19, 2023 at 1:11 PM Zhi Wang <[email protected]>
> wrote:
> >
> > On Mon, 19 Jun 2023 12:11:50 -0700
> > Vishal Annapurve <[email protected]> wrote:
> >
> > > On Thu, Jun 15, 2023 at 1:12___PM <[email protected]> wrote:
> > > > ...
> > > >
> > > > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we
> proceed?
> > > > - Keep KVM_X86_PROTECTED_VM for its use. Introduce
> KVM_X86_TDX_VM
> > > > - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce
> another type in
> > > > the future)
> > > > - any other way?
> > >
> > > There are selftests posted[1] in context of this work, which rely on
> > > KVM_X86_PROTECTED_VM being just the software-only
> > > psuedo-confidential VMs. In future there might be more work to
> > > expand this usecase to full-scale VMs. So it would be better to
> > > treat protected VMs as a separate type which can be used on any
> > > platform without the need of enabling TDX/SEV functionality.
> > >
> >
> > Out of curiosity, is this really a valid case in practice except selftest?
> > It sounds to me whenever KVM_X86_PROTECTED_VM is used, it has to be
> > tied with a platform-specific CC type.
>
> Protected VM effort is about being able to have guest memory ranges not
> mapped into Userspace VMM and so are unreachable for most of the cases
> from KVM as well. Non-CC VMs can use this support to mitigate any
> unintended accesses from userspace VMM/KVM possibly using enlightened
> kernels.

"PROTECTED" seems to be not very close to what you mean here. "PROTECTED_MEM" ?
What case of non-CC VMs may use this feature in reality? Or do you have any expected cases?

>
> Exact implementation of such a support warrants more discussion but it
> should be in the line of sight here as a future work item.
>
>
>
>
> >
> > > TDX VM type can possibly serve as a specialized type of protected VM
> > > with additional arch specific capabilities enabled.
> > >
> > > [1] - https://github.com/sean-jc/linux/commits/x86/kvm_gmem_solo
> >

2023-06-21 21:28:40

by Vishal Annapurve

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Wed, Jun 21, 2023 at 11:20 AM Dong, Eddie <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Vishal Annapurve <[email protected]>
> > Sent: Monday, June 19, 2023 2:55 PM
> > To: Zhi Wang <[email protected]>
> > Cc: Yamahata, Isaku <[email protected]>; [email protected];
> > [email protected]; [email protected]; Paolo Bonzini
> > <[email protected]>; Aktas, Erdem <[email protected]>;
> > Christopherson,, Sean <[email protected]>; Shahar, Sagi
> > <[email protected]>; David Matlack <[email protected]>; Huang, Kai
> > <[email protected]>; Chen, Bo2 <[email protected]>; linux-
> > [email protected]; Chao Peng <[email protected]>; Ackerley
> > Tng <[email protected]>; Michael Roth <[email protected]>
> > Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement
> >
> > On Mon, Jun 19, 2023 at 1:11 PM Zhi Wang <[email protected]>
> > wrote:
> > >
> > > On Mon, 19 Jun 2023 12:11:50 -0700
> ...
> >
> > Protected VM effort is about being able to have guest memory ranges not
> > mapped into Userspace VMM and so are unreachable for most of the cases
> > from KVM as well. Non-CC VMs can use this support to mitigate any
> > unintended accesses from userspace VMM/KVM possibly using enlightened
> > kernels.
>
> "PROTECTED" seems to be not very close to what you mean here. "PROTECTED_MEM" ?
> What case of non-CC VMs may use this feature in reality? Or do you have any expected cases?
>

Similar to pKvm efforts [1], PROTECTED_VM functionality may be used to
unmap guest memory ranges from the host and userspace VMM on x86
platforms. If the KVM/host kernel and the guest VMs are enlightened
for this usecase, then it should be possible to deploy this feature
for normal VMs irrespective of the platforms they are running on.

Primary usecase here would be to prevent unintended accesses from
KVM/userspace VMM which would normally go undetected at runtime or are
hard to trace back to the original culprit.

[1] https://source.android.com/docs/core/virtualization/architecture#hypervisor

2023-06-23 20:14:39

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range

On Tue, Jun 20, 2023, Isaku Yamahata wrote:
> On Tue, Jun 20, 2023 at 11:28:35AM -0500,
> Michael Roth <[email protected]> wrote:
>
> > On Thu, Jun 15, 2023 at 01:12:18PM -0700, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > TDX and SEV-SNP need to know the reason for a callback by
> > > kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM
> > > gmem callback. The callback handler changes the behavior or does the
> > > additional housekeeping operation. For mmu notifier, it's zapping shared
> > > PTE. For set memory attributes, it's the conversion of memory attributes
> > > (private <=> shared). For KVM gmem, it's punching a hole in the range, and
> > > releasing the file.
> >
> > I think it's still an open topic that we need to hear more from Sean about:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > but I *think* we were leaning toward decoupling the act of invalidating
> > GFNs, vs. the act of invalidating/free'ing gmem pages.

Yes. Ignore any comments I made about not introducing new hooks, I messed up and
forgot the full context.

> > One concrete example of where this seperation makes sense if with
> > hole-punching. SNP has unique platform-specific stuff it has to do before
> > free'ing that gmem page back to the host. If we try to plumb this through
> > kvm_unmap_gfn_range() via a special flag then it's a little awkward
> > because:
> >
> > a) Presumably that hole-punch would have occurred after a preceeding
> > KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
> > state in the xarray. So all it should really need to do is handle
> > that platform-specific behavior, like updating RMP table in case of
> > SNP. But none of the other details like GFN ranges are relevant in
> > that case, RMP updates here only need the PFN, so we end up walking
> > memslots to do GFN->PFN translations, when it would actually be much
> > more efficient to do these translations by translating the
> > hole-punched FD offset range to the corresponding folio()'s backing
> > those ranges
> >
> > b) It places an unecessary dependency on having an active memslot to do
> > those translations. This ends up not being an issue with current
> > version of gmem patchset because the release() happens *before*
> > gmem_unbind(), so there is a memslot associated with the ranges at
> > gmem_release() time, but in the initial version of gmem it was the
> > reverse, so if things ever changed again in this regard we'd once
> > again have to completely rework how to issue these platform-specific
> > invalidation callbacks.
> >
> > I really *really* like having a separate, simple invalidation mechanism
> > in place that just converts FD offsets to PFNs and then passes those on
> > to platform-defined handlers to clean up pages before free'ing them back
> > to the system. It's versatile in that it can be called pretty much
> > anywhere regardless of where we are in KVM lifecycle, it's robust in
> > that it doesn't rely on unecessary outside dependencies, and it avoids
> > added uneeded complexity to paths like kvm_unmap_gfn_range().
> >
> > That's the approach taken with SNP hypervisor v9 series, with the
> > gmem hook being introduced here:
> >
> > https://lore.kernel.org/kvm/[email protected]/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043
> >
> > and the SEV-SNP implementation of that hook being here:
> >
> > https://lore.kernel.org/kvm/[email protected]/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e
> >
> > Would a similar approach work for TDX? At least WRT cleaning up pages
> > before returning them back to the host? If we could isolate that
> > requirement/handling from all the other aspects of invalidations it
> > really seems like it would cause us less headaches down the road.
>
> In short, TDX KVM doesn't need an extra callback for invalidating/freeing gmem
> pages. kvm_unmap_gfn_range() callback works. Instead TDX needs attributes
> (private-or-shared) for it.

Just because TDX doesn't strictly need a separate callback doesn't mean it
wouldn't be useful and beneficial. SNP doesn't "need" a separate callback either,
i.e. we could make kvm_unmap_gfn_range() work, but it would be ugly and inflexible.

E.g. as Mike pointed out in the other thread, reclaiming physical memory when
SPTEs are zapped is suboptimal if the memory is not actually discarded.

: There's also cases like userspaces that opt to not discard memory after
: conversions because they highly favor performance over memory usage. In
: those cases it would make sense to defer marking the pages as shared in
: the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
: KVM MMU invalidation path after a conversion.

And to some extent, I would even argue that TDX does "need" the separate hook,
because doing PAGE.WBINVD while holding mmu_lock for write is going to be slooow.
Even if the PAGE.WBINVD isn't redundant, i.e. the memory is never converted back
to private, deferring the cache flush until the backing store is freed is a win
for guest performance during conversions.

> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 1a47cedae8a1..c049c0aa44d6 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > #endif
> > >
> > > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > > +
> > > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0)
> >
> > Can you go into more detail on why special handling is needed for
> > SET_MEM_ATTR?
>
> When in TDX, the VMM zaps a private page from the encrypted page table and the
> VMM adds the page back to the same GPA, it results in zeroing page and guest
> needs to accept the page again. When converting a page from shared to private,
> KVM needs to zap only shared pages. So the callback needs to know this zap
> is for converting shared-to-private or private-to-shared.

That doesn't answer Mike's question. You answered why KVM needs to know whether
to zap shared vs. private, but not why SET_MEM_ATTR is a special case.

> > > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1)
> > > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2)
> >
> > Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
> > TDX case if you take the above approach? For SNP, the answer is yes. If
> > that's also the case for TDX I think that would be another argument in
> > favor of decoupling these from existing KVM MMU invalidation path.
>
> TDX doesn't need gmem_invalidate callback. TDx doesn't need the difference
> betwee punch hole and release. So in short TDX needs
> KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR and KVM_GFN_RANGE_FLAGS_GMEM.

TDX needs to now what flavor of mappings, i.e. EPT vs. S-EPT, are in scope, TDX
doesn't need to know who/what initiated a zap. And for that, a simple private vs.
shared flag would suffice.

However, looking at kvm_mem_attrs_changed() again, I think invoking kvm_unmap_gfn_range()
from generic KVM code is a mistake and shortsighted. Zapping in response to
*any* attribute change is very private/shared centric. E.g. if/when we extend
attributes to provide per-page RWX protections, zapping existing SPTEs in response
to granting *more* permissions may not be necessary or even desirable.

And for SNP, isn't zapping unnecessary? KVM needs to update the RMP, but there's
no need to zap NPT entries. Or do RMP lookups need to be blocked while the RMP
is being updated?

Regardless of what SNP needs to do on attribute changes, rather than using
kvm_unmap_gfn_range(), I think kvm_mem_attrs_changed() should look something like:

struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
struct kvm_memslot_iter iter;
bool flush = false;
int i;

gfn_range.may_block = true;
gfn_range.attrs = attrs;

for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);

kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
slot = iter.slot;
gfn_range.start = max(start, slot->base_gfn);
gfn_range.end = min(end, slot->base_gfn + slot->npages);
if (gfn_range.start >= gfn_range.end)
continue;
gfn_range.slot = slot;

flush |= kvm_arch_set_memory_attributes(kvm, &gfn_range);
}
}

if (flush)
kvm_flush_remote_tlbs(kvm);

At that point, a single "is_private" or "is_shared" flag is awkward and arguably
wrong, because the changing attributes affect both. The fact that TDX only needs
to zap one or the other is an implementation detail, not a fundamental truth of
the update itself.

We could just have kvm_arch_set_memory_attributes() take individual params instead
of taking a kvm_gfn_range pointer, but that's a missed opportunitiy IMO as this
really is just another variation of gfn-based notification to arch MMU code.

Going with my union suggestion from the MGLRU thread[*], I think we should aim
for making kvm_gfn_range look like this:

struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
gfn_t end;
union {
struct test_clear_young_metadata *metadata;
unsigned long attributes;
pte_t pte;
unsigned long callback_arg; /* needs a better name */
};
bool only_private;
bool only_shared;
bool may_block;
};

That way kvm_arch_set_memory_attributes() can communicate that it affects both
shared and private mappings, i.e. can leave it to TDX to precisely zap only the
necessary tree. It's a bit unfortunate that the existing mmu_notifier usage
would need to explicitly say "only_shared", but that's literally one line of code
in __kvm_handle_hva_range(), and on the plus side would clearly document that
hva-based notifications are shared-only.

[*] https://lore.kernel.org/all/[email protected]

2023-06-24 00:42:26

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Thu, Jun 15, 2023, [email protected] wrote:
> * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> the future)

How would KVM differentiate between software-protected VMs and TDX VMs if we go
with this option?

> - any other way?

2023-06-26 21:27:54

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] KVM: guest memory: Misc enhacnement

On Fri, Jun 23, 2023 at 05:09:26PM -0700,
Sean Christopherson <[email protected]> wrote:

> On Thu, Jun 15, 2023, [email protected] wrote:
> > * VM type: Now we have KVM_X86_PROTECTED_VM. How do we proceed?
> > - Keep KVM_X86_PROTECTED_VM for its use. Introduce KVM_X86_TDX_VM
> > - Use KVM_X86_PROTECTED_VM for TDX. (If necessary, introduce another type in
> > the future)
>
> How would KVM differentiate between software-protected VMs and TDX VMs if we go
> with this option?

Let's introduce new two VM type. I'm fine with two new VM types. I had KVM
capability in mind if SEV implementation doesn't like new VM types.
--
Isaku Yamahata <[email protected]>