2024-05-01 09:03:14

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 00/20] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

This patchset is also available at:

https://github.com/amdese/linux/commits/snp-host-v15

and is based on top of the series:

"Add SEV-ES hypervisor support for GHCB protocol version 2"
https://lore.kernel.org/kvm/[email protected]/
https://github.com/amdese/linux/commits/sev-init2-ghcb-v1

which in turn is based on commit 20cc50a0410f (just before v14 SNP patches):

https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue


Patch Layout
------------

01-02: These patches revert+replace the existing .gmem_validate_fault hook
with a similar .private_max_mapping_level as suggested by Sean[1]

03-04: These patches add some basic infrastructure and introduces a new
KVM_X86_SNP_VM vm_type to handle differences verses the existing
KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types.

05-07: These implement the KVM API to handle the creation of a
cryptographic launch context, encrypt/measure the initial image
into guest memory, and finalize it before launching it.

08-12: These implement handling for various guest-generated events such
as page state changes, onlining of additional vCPUs, etc.

13-16: These implement the gmem/mmu hooks needed to prepare gmem-allocated
pages before mapping them into guest private memory ranges as
well as cleaning them up prior to returning them to the host for
use as normal memory. Because this supplants certain activities
like issued WBINVDs during KVM MMU invalidations, there's also
a patch to avoid duplicating that work to avoid unecessary
overhead.

17: With all the core support in place, the patch adds a kvm_amd module
parameter to enable SNP support.

18-20: These patches all deal with the servicing of guest requests to handle
things like attestation, as well as some related host-management
interfaces.

[1] https://lore.kernel.org/kvm/[email protected]/#t


Testing
-------

For testing this via QEMU, use the following tree:

https://github.com/amdese/qemu/commits/snp-v4-wip3c

A patched OVMF is also needed due to upstream KVM no longer supporting MMIO
ranges that are mapped as private. It is recommended you build the AmdSevX64
variant as it provides the kernel-hashing support present in this series:

https://github.com/amdese/ovmf/commits/apic-mmio-fix1d

A basic command-line invocation for SNP would be:

qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
-machine q35,confidential-guest-support=sev0,memory-backend=ram1
-object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
-object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
-bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd

With kernel-hashing and certificate data supplied:

qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
-machine q35,confidential-guest-support=sev0,memory-backend=ram1
-object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
-object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=,certs-path=/home/mroth/cert.blob,kernel-hashes=on
-bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
-kernel /boot/vmlinuz-$ver
-initrd /boot/initrd.img-$ver
-append "root=UUID=d72a6d1c-06cf-4b79-af43-f1bac4f620f9 ro console=ttyS0,115200n8"

With standard X64 OVMF package with separate image for persistent NVRAM:

qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
-machine q35,confidential-guest-support=sev0,memory-backend=ram1
-object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
-object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
-bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d.fd
-drive if=pflash,format=raw,unit=0,file=OVMF_VARS-upstream-20240410-apic-mmio-fix1d.fd,readonly=off


Known issues / TODOs
--------------------

* Base tree in some cases reports "Unpatched return thunk in use. This should
not happen!" the first time it runs an SVM/SEV/SNP guests. This a recent
regression upstream and unrelated to this series:

https://lore.kernel.org/linux-kernel/CANpmjNOcKzEvLHoGGeL-boWDHJobwfwyVxUqMq2kWeka3N4tXA@mail.gmail.com/T/

* 2MB hugepage support has been dropped pending discussion on how we plan to
re-enable it in gmem.

* Host kexec should work, but there is a known issue with host kdump support
while SNP guests are running that will be addressed as a follow-up.

* SNP kselftests are currently a WIP and will be included as part of SNP
upstreaming efforts in the near-term.


SEV-SNP Overview
----------------

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required to add KVM support for SEV-SNP. This series builds upon
SEV-SNP guest support, which is now in mainline, and and SEV-SNP host
initialization support, which is now in linux-next.

While series provides the basic building blocks to support booting the
SEV-SNP VMs, it does not cover all the security enhancement introduced by
the SEV-SNP such as interrupt protection, which will added in the future.

With SNP, when pages are marked as guest-owned in the RMP table, they are
assigned to a specific guest/ASID, as well as a specific GFN with in the
guest. Any attempts to map it in the RMP table to a different guest/ASID,
or a different GFN within a guest/ASID, will result in an RMP nested page
fault.

Prior to accessing a guest-owned page, the guest must validate it with a
special PVALIDATE instruction which will set a special bit in the RMP table
for the guest. This is the only way to set the validated bit outside of the
initial pre-encrypted guest payload/image; any attempts outside the guest to
modify the RMP entry from that point forward will result in the validated
bit being cleared, at which point the guest will trigger an exception if it
attempts to access that page so it can be made aware of possible tampering.

One exception to this is the initial guest payload, which is pre-validated
by the firmware prior to launching. The guest can use Guest Message requests
to fetch an attestation report which will include the measurement of the
initial image so that the guest can verify it was booted with the expected
image/environment.

After boot, guests can use Page State Change requests to switch pages
between shared/hypervisor-owned and private/guest-owned to share data for
things like DMA, virtio buffers, and other GHCB requests.

In this implementation of SEV-SNP, private guest memory is managed by a new
kernel framework called guest_memfd (gmem). With gmem, a new
KVM_SET_MEMORY_ATTRIBUTES KVM ioctl has been added to tell the KVM
MMU whether a particular GFN should be backed by shared (normal) memory or
private (gmem-allocated) memory. To tie into this, Page State Change
requests are forward to userspace via KVM_EXIT_VMGEXIT exits, which will
then issue the corresponding KVM_SET_MEMORY_ATTRIBUTES call to set the
private/shared state in the KVM MMU.

The gmem / KVM MMU hooks implemented in this series will then update the RMP
table entries for the backing PFNs to set them to guest-owned/private when
mapping private pages into the guest via KVM MMU, or use the normal KVM MMU
handling in the case of shared pages where the corresponding RMP table
entries are left in the default shared/hypervisor-owned state.

Feedback/review is very much appreciated!

-Mike


Changes since v14:

* switch to vendor-agnostic KVM_HC_MAP_GPA_RANGE exit for forwarding
page-state change requests to userspace instead of an SNP-specific exit
(Sean)
* drop SNP_PAUSE_ATTESTATION/SNP_RESUME_ATTESTATION interfaces, instead
add handling in KVM_EXIT_VMGEXIT so that VMMs can implement their own
mechanisms for keeping userspace-supplied certificates in-sync with
firmware's TCB/endorsement key (Sean)
* carve out SEV-ES-specific handling for GHCB protocol 2, add control of
the protocol version, and post as a separate prereq patchset (Sean)
* use more consistent error-handling in snp_launch_{start,update,finish},
simplify logic based on review comments (Sean)
* rename .gmem_validate_fault to .private_max_mapping_level and rework
logic based on review suggestions (Sean)
* reduce number of pr_debug()'s in series, avoid multiple WARN's in
succession (Sean)
* improve documentation and comments throughout

Changes since v13:

* rebase to new kvm-coco-queue and wire up to PFERR_PRIVATE_ACCESS (Paolo)
* handle setting kvm->arch.has_private_mem in same location as
kvm->arch.has_protected_state (Paolo)
* add flags and additional padding fields to
snp_launch{start,update,finish} APIs to address alignment and
expandability (Paolo)
* update snp_launch_update() to update input struct values to reflect
current progress of command in situations where mulitple calls are
needed (Paolo)
* update snp_launch_update() to avoid copying/accessing 'src' parameter
when dealing with zero pages. (Paolo)
* update snp_launch_update() to use u64 as length input parameter instead
of u32 and adjust padding accordingly
* modify ordering of SNP_POLICY_MASK_* definitions to be consistent with
bit order of corresponding flags
* let firmware handle enforcement of policy bits corresponding to
user-specified minimum API version
* add missing "0x" prefixs in pr_debug()'s for snp_launch_start()
* fix handling of VMSAs during in-place migration (Paolo)

Changes since v12:

* rebased to latest kvm-coco-queue branch (commit 4d2deb62185f)
* add more input validation for SNP_LAUNCH_START, especially for handling
things like MBO/MBZ policy bits, and API major/minor minimums. (Paolo)
* block SNP KVM instances from being able to run legacy SEV commands (Paolo)
* don't attempt to measure VMSA for vcpu 0/BSP before the others, let
userspace deal with the ordering just like with SEV-ES (Paolo)
* fix up docs for SNP_LAUNCH_FINISH (Paolo)
* introduce svm->sev_es.snp_has_guest_vmsa flag to better distinguish
handling for guest-mapped vs non-guest-mapped VMSAs, rename
'snp_ap_create' flag to 'snp_ap_waiting_for_reset' (Paolo)
* drop "KVM: SEV: Use a VMSA physical address variable for populating VMCB"
as it is no longer needed due to above VMSA rework
* replace pr_debug_ratelimited() messages for RMP #NPFs with a single trace
event
* handle transient PSMASH_FAIL_INUSE return codes in kvm_gmem_invalidate(),
switch to WARN_ON*()'s to indicate remaining error cases are not expected
and should not be seen in practice. (Paolo)
* add a cond_resched() in kvm_gmem_invalidate() to avoid soft lock-ups when
cleaning up large guest memory ranges.
* rename VLEK_REQUIRED to VCEK_DISABLE. it's be more applicable if another
key type ever gets added.
* don't allow attestation to be paused while an attestation request is
being processed by firmware (Tom)
* add missing Documentation entry for SNP_VLEK_LOAD
* collect Reviewed-by's from Paolo and Tom


----------------------------------------------------------------
Ashish Kalra (1):
KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP

Brijesh Singh (8):
KVM: SEV: Add initial SEV-SNP support
KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
KVM: SEV: Add support to handle GHCB GPA register VMGEXIT
KVM: SEV: Add support to handle RMP nested page faults
KVM: SVM: Add module parameter to enable SEV-SNP
KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event

Michael Roth (10):
Revert "KVM: x86: Add gmem hook for determining max NPT mapping level"
KVM: x86: Add hook for determining max NPT mapping level
KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y
KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
KVM: SEV: Add support to handle Page State Change VMGEXIT
KVM: SEV: Implement gmem hook for initializing private pages
KVM: SEV: Implement gmem hook for invalidating private pages
KVM: x86: Implement hook for determining max NPT mapping level
KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
crypto: ccp: Add the SNP_VLEK_LOAD command

Tom Lendacky (1):
KVM: SEV: Support SEV-SNP AP Creation NAE event

Documentation/virt/coco/sev-guest.rst | 19 +
Documentation/virt/kvm/api.rst | 87 ++
.../virt/kvm/x86/amd-memory-encryption.rst | 110 +-
arch/x86/include/asm/kvm-x86-ops.h | 2 +-
arch/x86/include/asm/kvm_host.h | 5 +-
arch/x86/include/asm/sev-common.h | 25 +
arch/x86/include/asm/sev.h | 3 +
arch/x86/include/asm/svm.h | 9 +-
arch/x86/include/uapi/asm/kvm.h | 48 +
arch/x86/kvm/Kconfig | 3 +
arch/x86/kvm/mmu.h | 2 -
arch/x86/kvm/mmu/mmu.c | 27 +-
arch/x86/kvm/svm/sev.c | 1538 +++++++++++++++++++-
arch/x86/kvm/svm/svm.c | 44 +-
arch/x86/kvm/svm/svm.h | 52 +
arch/x86/kvm/trace.h | 31 +
arch/x86/kvm/x86.c | 17 +
drivers/crypto/ccp/sev-dev.c | 36 +
include/linux/psp-sev.h | 4 +-
include/uapi/linux/kvm.h | 23 +
include/uapi/linux/psp-sev.h | 27 +
include/uapi/linux/sev-guest.h | 9 +
virt/kvm/guest_memfd.c | 4 +-
23 files changed, 2081 insertions(+), 44 deletions(-)



2024-05-01 09:04:25

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages

This will handle the RMP table updates needed to put a page into a
private state before mapping it into an SEV-SNP guest.

Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/svm/sev.c | 98 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/kvm/svm/svm.h | 5 +++
arch/x86/kvm/x86.c | 5 +++
virt/kvm/guest_memfd.c | 4 +-
6 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 5e72faca4e8f..10768f13b240 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -137,6 +137,7 @@ config KVM_AMD_SEV
depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
select ARCH_HAS_CC_PLATFORM
select KVM_GENERIC_PRIVATE_MEM
+ select HAVE_KVM_GMEM_PREPARE
help
Provides support for launching Encrypted VMs (SEV) and Encrypted VMs
with Encrypted State (SEV-ES) on AMD processors.
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 69ec8f577763..0439ec12fa90 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4557,3 +4557,101 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
out_no_trace:
put_page(pfn_to_page(pfn));
}
+
+static bool is_pfn_range_shared(kvm_pfn_t start, kvm_pfn_t end)
+{
+ kvm_pfn_t pfn = start;
+
+ while (pfn < end) {
+ int ret, rmp_level;
+ bool assigned;
+
+ ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+ if (ret) {
+ pr_warn_ratelimited("SEV: Failed to retrieve RMP entry: PFN 0x%llx GFN start 0x%llx GFN end 0x%llx RMP level %d error %d\n",
+ pfn, start, end, rmp_level, ret);
+ return false;
+ }
+
+ if (assigned) {
+ pr_debug("%s: overlap detected, PFN 0x%llx start 0x%llx end 0x%llx RMP level %d\n",
+ __func__, pfn, start, end, rmp_level);
+ return false;
+ }
+
+ pfn++;
+ }
+
+ return true;
+}
+
+static u8 max_level_for_order(int order)
+{
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+ return PG_LEVEL_2M;
+
+ return PG_LEVEL_4K;
+}
+
+static bool is_large_rmp_possible(struct kvm *kvm, kvm_pfn_t pfn, int order)
+{
+ kvm_pfn_t pfn_aligned = ALIGN_DOWN(pfn, PTRS_PER_PMD);
+
+ /*
+ * If this is a large folio, and the entire 2M range containing the
+ * PFN is currently shared, then the entire 2M-aligned range can be
+ * set to private via a single 2M RMP entry.
+ */
+ if (max_level_for_order(order) > PG_LEVEL_4K &&
+ is_pfn_range_shared(pfn_aligned, pfn_aligned + PTRS_PER_PMD))
+ return true;
+
+ return false;
+}
+
+int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ kvm_pfn_t pfn_aligned;
+ gfn_t gfn_aligned;
+ int level, rc;
+ bool assigned;
+
+ if (!sev_snp_guest(kvm))
+ return 0;
+
+ rc = snp_lookup_rmpentry(pfn, &assigned, &level);
+ if (rc) {
+ pr_err_ratelimited("SEV: Failed to look up RMP entry: GFN %llx PFN %llx error %d\n",
+ gfn, pfn, rc);
+ return -ENOENT;
+ }
+
+ if (assigned) {
+ pr_debug("%s: already assigned: gfn %llx pfn %llx max_order %d level %d\n",
+ __func__, gfn, pfn, max_order, level);
+ return 0;
+ }
+
+ if (is_large_rmp_possible(kvm, pfn, max_order)) {
+ level = PG_LEVEL_2M;
+ pfn_aligned = ALIGN_DOWN(pfn, PTRS_PER_PMD);
+ gfn_aligned = ALIGN_DOWN(gfn, PTRS_PER_PMD);
+ } else {
+ level = PG_LEVEL_4K;
+ pfn_aligned = pfn;
+ gfn_aligned = gfn;
+ }
+
+ rc = rmp_make_private(pfn_aligned, gfn_to_gpa(gfn_aligned), level, sev->asid, false);
+ if (rc) {
+ pr_err_ratelimited("SEV: Failed to update RMP entry: GFN %llx PFN %llx level %d error %d\n",
+ gfn, pfn, level, rc);
+ return -EINVAL;
+ }
+
+ pr_debug("%s: updated: gfn %llx pfn %llx pfn_aligned %llx max_order %d level %d\n",
+ __func__, gfn, pfn, pfn_aligned, max_order, level);
+
+ return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b70556608e8d..60783e9f2ae8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5085,6 +5085,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+
+ .gmem_prepare = sev_gmem_prepare,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 858e74a26fab..ff1aca7e10e9 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -736,6 +736,7 @@ extern unsigned int max_sev_asid;
void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
void sev_vcpu_unblocking(struct kvm_vcpu *vcpu);
void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
+int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
#else
static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -752,6 +753,10 @@ static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXI
static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
static inline void sev_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
static inline void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu) {}
+static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
+{
+ return 0;
+}

#endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b20f6c1b8214..0fb76ef9b7e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13610,6 +13610,11 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
EXPORT_SYMBOL_GPL(kvm_arch_no_poll);

#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+bool kvm_arch_gmem_prepare_needed(struct kvm *kvm)
+{
+ return kvm->arch.vm_type == KVM_X86_SNP_VM;
+}
+
int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
{
return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index a44f983eb673..7d3932e5a689 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -46,8 +46,8 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
gfn = slot->base_gfn + index - slot->gmem.pgoff;
rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
if (rc) {
- pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
- index, rc);
+ pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx GFN %llx PFN %llx error %d.\n",
+ index, gfn, pfn, rc);
return rc;
}
}
--
2.25.1


2024-05-01 09:04:58

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 14/20] KVM: SEV: Implement gmem hook for invalidating private pages

Implement a platform hook to do the work of restoring the direct map
entries of gmem-managed pages and transitioning the corresponding RMP
table entries back to the default shared/hypervisor-owned state.

Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/svm/sev.c | 64 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 2 ++
4 files changed, 68 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 10768f13b240..2a7f69abcac3 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -138,6 +138,7 @@ config KVM_AMD_SEV
select ARCH_HAS_CC_PLATFORM
select KVM_GENERIC_PRIVATE_MEM
select HAVE_KVM_GMEM_PREPARE
+ select HAVE_KVM_GMEM_INVALIDATE
help
Provides support for launching Encrypted VMs (SEV) and Encrypted VMs
with Encrypted State (SEV-ES) on AMD processors.
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0439ec12fa90..cb89f6eba6ea 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4655,3 +4655,67 @@ int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)

return 0;
}
+
+void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+{
+ kvm_pfn_t pfn;
+
+ pr_debug("%s: PFN start 0x%llx PFN end 0x%llx\n", __func__, start, end);
+
+ for (pfn = start; pfn < end;) {
+ bool use_2m_update = false;
+ int rc, rmp_level;
+ bool assigned;
+
+ rc = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+ if (WARN_ONCE(rc, "SEV: Failed to retrieve RMP entry for PFN 0x%llx error %d\n",
+ pfn, rc))
+ goto next_pfn;
+
+ if (!assigned)
+ goto next_pfn;
+
+ use_2m_update = IS_ALIGNED(pfn, PTRS_PER_PMD) &&
+ end >= (pfn + PTRS_PER_PMD) &&
+ rmp_level > PG_LEVEL_4K;
+
+ /*
+ * If an unaligned PFN corresponds to a 2M region assigned as a
+ * large page in the RMP table, PSMASH the region into individual
+ * 4K RMP entries before attempting to convert a 4K sub-page.
+ */
+ if (!use_2m_update && rmp_level > PG_LEVEL_4K) {
+ /*
+ * This shouldn't fail, but if it does, report it, but
+ * still try to update RMP entry to shared and pray this
+ * was a spurious error that can be addressed later.
+ */
+ rc = snp_rmptable_psmash(pfn);
+ WARN_ONCE(rc, "SEV: Failed to PSMASH RMP entry for PFN 0x%llx error %d\n",
+ pfn, rc);
+ }
+
+ rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
+ if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
+ pfn, rc))
+ goto next_pfn;
+
+ /*
+ * SEV-ES avoids host/guest cache coherency issues through
+ * WBINVD hooks issued via MMU notifiers during run-time, and
+ * KVM's VM destroy path at shutdown. Those MMU notifier events
+ * don't cover gmem since there is no requirement to map pages
+ * to a HVA in order to use them for a running guest. While the
+ * shutdown path would still likely cover things for SNP guests,
+ * userspace may also free gmem pages during run-time via
+ * hole-punching operations on the guest_memfd, so flush the
+ * cache entries for these pages before free'ing them back to
+ * the host.
+ */
+ clflush_cache_range(__va(pfn_to_hpa(pfn)),
+ use_2m_update ? PMD_SIZE : PAGE_SIZE);
+next_pfn:
+ pfn += use_2m_update ? PTRS_PER_PMD : 1;
+ cond_resched();
+ }
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 60783e9f2ae8..29dc5fa28d97 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5087,6 +5087,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.alloc_apic_backing_page = svm_alloc_apic_backing_page,

.gmem_prepare = sev_gmem_prepare,
+ .gmem_invalidate = sev_gmem_invalidate,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ff1aca7e10e9..f91096722e29 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -737,6 +737,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
void sev_vcpu_unblocking(struct kvm_vcpu *vcpu);
void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
+void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
#else
static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -757,6 +758,7 @@ static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, in
{
return 0;
}
+static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}

#endif

--
2.25.1


2024-05-01 09:05:19

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 15/20] KVM: x86: Implement hook for determining max NPT mapping level

In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:

- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
- guest later converts that shared page back to private

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.

Implement a kvm_x86_ops.private_max_mapping_level() hook for SEV that
checks for this condition and adjusts the mapping level accordingly.

Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 15 +++++++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 5 +++++
3 files changed, 21 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index cb89f6eba6ea..224fdab32950 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4719,3 +4719,18 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
cond_resched();
}
}
+
+int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ int level, rc;
+ bool assigned;
+
+ if (!sev_snp_guest(kvm))
+ return 0;
+
+ rc = snp_lookup_rmpentry(pfn, &assigned, &level);
+ if (rc || !assigned)
+ return PG_LEVEL_4K;
+
+ return level;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 29dc5fa28d97..426ad49325d7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5088,6 +5088,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {

.gmem_prepare = sev_gmem_prepare,
.gmem_invalidate = sev_gmem_invalidate,
+ .private_max_mapping_level = sev_private_max_mapping_level,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f91096722e29..e325ede0f463 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -738,6 +738,7 @@ void sev_vcpu_unblocking(struct kvm_vcpu *vcpu);
void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
#else
static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -759,6 +760,10 @@ static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, in
return 0;
}
static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
+static inline int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ return 0;
+}

#endif

--
2.25.1


2024-05-01 09:05:40

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 16/20] KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP

From: Ashish Kalra <[email protected]>

With SNP/guest_memfd, private/encrypted memory should not be mappable,
and MMU notifications for HVA-mapped memory will only be relevant to
unencrypted guest memory. Therefore, the rationale behind issuing a
wbinvd_on_all_cpus() in sev_guest_memory_reclaimed() should not apply
for SNP guests and can be ignored.

Signed-off-by: Ashish Kalra <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
[mdr: Add some clarifications in commit]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 224fdab32950..e94e3aa4d932 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3039,7 +3039,13 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)

void sev_guest_memory_reclaimed(struct kvm *kvm)
{
- if (!sev_guest(kvm))
+ /*
+ * With SNP+gmem, private/encrypted memory is unreachable via the
+ * hva-based mmu notifiers, so these events are only actually
+ * pertaining to shared pages where there is no need to perform
+ * the WBINVD to flush associated caches.
+ */
+ if (!sev_guest(kvm) || sev_snp_guest(kvm))
return;

wbinvd_on_all_cpus();
--
2.25.1


2024-05-01 09:06:04

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 17/20] KVM: SVM: Add module parameter to enable SEV-SNP

From: Brijesh Singh <[email protected]>

Add a module parameter than can be used to enable or disable the SEV-SNP
feature. Now that KVM contains the support for the SNP set the GHCB
hypervisor feature flag to indicate that SNP is supported.

Signed-off-by: Brijesh Singh <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/kvm/svm/sev.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e94e3aa4d932..112041ee55e9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -49,7 +49,8 @@ static bool sev_es_enabled = true;
module_param_named(sev_es, sev_es_enabled, bool, 0444);

/* enable/disable SEV-SNP support */
-static bool sev_snp_enabled;
+static bool sev_snp_enabled = true;
+module_param_named(sev_snp, sev_snp_enabled, bool, 0444);

/* enable/disable SEV-ES DebugSwap support */
static bool sev_es_debug_swap_enabled = true;
--
2.25.1


2024-05-01 09:06:25

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 18/20] KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event

From: Brijesh Singh <[email protected]>

Version 2 of GHCB specification added support for the SNP Guest Request
Message NAE event. The event allows for an SEV-SNP guest to make
requests to the SEV-SNP firmware through hypervisor using the
SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification.

This is used by guests primarily to request attestation reports from
firmware. There are other request types are available as well, but the
specifics of what guest requests are being made are opaque to the
hypervisor, which only serves as a proxy for the guest requests and
firmware responses.

Implement handling for these events.

Signed-off-by: Brijesh Singh <[email protected]>
Co-developed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Reviewed-by: Tom Lendacky <[email protected]>
[mdr: ensure FW command failures are indicated to guest, drop extended
request handling to be re-written as separate patch, massage commit]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 86 ++++++++++++++++++++++++++++++++++
include/uapi/linux/sev-guest.h | 9 ++++
2 files changed, 95 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 112041ee55e9..5c6262f3232f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -19,6 +19,7 @@
#include <linux/misc_cgroup.h>
#include <linux/processor.h>
#include <linux/trace_events.h>
+#include <uapi/linux/sev-guest.h>

#include <asm/pkru.h>
#include <asm/trapnr.h>
@@ -3292,6 +3293,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
if (!sev_snp_guest(vcpu->kvm) || !kvm_ghcb_sw_scratch_is_valid(svm))
goto vmgexit_err;
break;
+ case SVM_VMGEXIT_GUEST_REQUEST:
+ if (!sev_snp_guest(vcpu->kvm))
+ goto vmgexit_err;
+ break;
default:
reason = GHCB_ERR_INVALID_EVENT;
goto vmgexit_err;
@@ -3906,6 +3911,83 @@ static int sev_snp_ap_creation(struct vcpu_svm *svm)
return ret;
}

+static int snp_setup_guest_buf(struct kvm *kvm, struct sev_data_snp_guest_request *data,
+ gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ kvm_pfn_t req_pfn, resp_pfn;
+
+ if (!PAGE_ALIGNED(req_gpa) || !PAGE_ALIGNED(resp_gpa))
+ return -EINVAL;
+
+ req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
+ if (is_error_noslot_pfn(req_pfn))
+ return -EINVAL;
+
+ resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
+ if (is_error_noslot_pfn(resp_pfn))
+ return -EINVAL;
+
+ if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
+ return -EINVAL;
+
+ data->gctx_paddr = __psp_pa(sev->snp_context);
+ data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
+ data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
+
+ return 0;
+}
+
+static int snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data)
+{
+ u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
+
+ if (snp_page_reclaim(pfn) || rmp_make_shared(pfn, PG_LEVEL_4K))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int __snp_handle_guest_req(struct kvm *kvm, gpa_t req_gpa, gpa_t resp_gpa,
+ sev_ret_code *fw_err)
+{
+ struct sev_data_snp_guest_request data = {0};
+ struct kvm_sev_info *sev;
+ int ret;
+
+ if (!sev_snp_guest(kvm))
+ return -EINVAL;
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ ret = snp_setup_guest_buf(kvm, &data, req_gpa, resp_gpa);
+ if (ret)
+ return ret;
+
+ ret = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, fw_err);
+ if (ret)
+ return ret;
+
+ ret = snp_cleanup_guest_buf(&data);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static void snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ sev_ret_code fw_err = 0;
+ int vmm_ret = 0;
+
+ if (__snp_handle_guest_req(kvm, req_gpa, resp_gpa, &fw_err))
+ vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -4178,6 +4260,10 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, GHCB_ERR_INVALID_INPUT);
}

+ ret = 1;
+ break;
+ case SVM_VMGEXIT_GUEST_REQUEST:
+ snp_handle_guest_req(svm, control->exit_info_1, control->exit_info_2);
ret = 1;
break;
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
diff --git a/include/uapi/linux/sev-guest.h b/include/uapi/linux/sev-guest.h
index 154a87a1eca9..7bd78e258569 100644
--- a/include/uapi/linux/sev-guest.h
+++ b/include/uapi/linux/sev-guest.h
@@ -89,8 +89,17 @@ struct snp_ext_report_req {
#define SNP_GUEST_FW_ERR_MASK GENMASK_ULL(31, 0)
#define SNP_GUEST_VMM_ERR_SHIFT 32
#define SNP_GUEST_VMM_ERR(x) (((u64)x) << SNP_GUEST_VMM_ERR_SHIFT)
+#define SNP_GUEST_FW_ERR(x) ((x) & SNP_GUEST_FW_ERR_MASK)
+#define SNP_GUEST_ERR(vmm_err, fw_err) (SNP_GUEST_VMM_ERR(vmm_err) | \
+ SNP_GUEST_FW_ERR(fw_err))

+/*
+ * The GHCB spec only formally defines INVALID_LEN/BUSY VMM errors, but define
+ * a GENERIC error code such that it won't ever conflict with GHCB-defined
+ * errors if any get added in the future.
+ */
#define SNP_GUEST_VMM_ERR_INVALID_LEN 1
#define SNP_GUEST_VMM_ERR_BUSY 2
+#define SNP_GUEST_VMM_ERR_GENERIC BIT(31)

#endif /* __UAPI_LINUX_SEV_GUEST_H_ */
--
2.25.1


2024-05-01 09:06:47

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event

Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but allows for
additional certificate data to be supplied via an additional
guest-supplied buffer to be used mainly for verifying the signature of
an attestation report as returned by firmware.

This certificate data is supplied by userspace, so unlike with
SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first
forwarded to userspace via a KVM_EXIT_VMGEXIT exit structure, and then
the firmware request is made after the certificate data has been fetched
from userspace.

Since there is a potential for race conditions where the
userspace-supplied certificate data may be out-of-sync relative to the
reported TCB or VLEK that firmware will use when signing attestation
reports, a hook is also provided so that userspace can be informed once
the attestation request is actually completed. See the updates to
Documentation/ for more details on these aspects.

Signed-off-by: Michael Roth <[email protected]>
---
Documentation/virt/kvm/api.rst | 87 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/sev.c | 86 +++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.h | 3 ++
include/uapi/linux/kvm.h | 23 +++++++++
4 files changed, 199 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..f3780ac98d56 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,93 @@ Please note that the kernel is allowed to use the kvm_run structure as the
primary storage for certain register types. Therefore, the kernel may use the
values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.

+::
+
+ /* KVM_EXIT_VMGEXIT */
+ struct kvm_user_vmgexit {
+ #define KVM_USER_VMGEXIT_REQ_CERTS 1
+ __u32 type; /* KVM_USER_VMGEXIT_* type */
+ union {
+ struct {
+ __u64 data_gpa;
+ __u64 data_npages;
+ #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN 1
+ #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY 2
+ #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC (1 << 31)
+ __u32 ret;
+ #define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE BIT(0)
+ __u8 flags;
+ #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING 0
+ #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE 1
+ __u8 status;
+ } req_certs;
+ };
+ };
+
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects of handling a VMGEXIT are done in userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+KVM_USER_VMGEXIT_REQ_CERTS
+--------------------------
+
+When an SEV-SNP issues a guest request for an attestation report, it has the
+option of issuing it in the form an *extended* guest request when a
+certificate blob is returned alongside the attestation report so the guest
+can validate the endorsement key used by SNP firmware to sign the report.
+These certificates are managed by userspace and are requested via
+KVM_EXIT_VMGEXITs using the KVM_USER_VMGEXIT_REQ_CERTS type.
+
+For the KVM_USER_VMGEXIT_REQ_CERTS type, the req_certs union type
+is used. The kernel will supply in 'data_gpa' the value the guest supplies
+via the RAX field of the GHCB when issuing extended guest requests.
+'data_npages' will similarly contain the value the guest supplies in RBX
+denoting the number of shared pages available to write the certificate
+data into.
+
+ - If the supplied number of pages is sufficient, userspace should write
+ the certificate data blob (in the format defined by the GHCB spec) in
+ the address indicated by 'data_gpa' and set 'ret' to 0.
+
+ - If the number of pages supplied is not sufficient, userspace must write
+ the required number of pages in 'data_npages' and then set 'ret' to 1.
+
+ - If userspace is temporarily unable to handle the request, 'ret' should
+ be set to 2 to inform the guest to retry later.
+
+ - If some other error occurred, userspace should set 'ret' to a non-zero
+ value that is distinct from the specific return values mentioned above.
+
+Generally some care needs be taken to keep the returned certificate data in
+sync with the actual endorsement key in use by firmware at the time the
+attestation request is sent to SNP firmware. The recommended scheme to do
+this is for the VMM to obtain a shared or exclusive lock on the path the
+certificate blob file resides at before reading it and returning it to KVM,
+and that it continues to hold the lock until the attestation request is
+actually sent to firmware. To facilitate this, the VMM can set the
+KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE flag before returning the
+certificate blob, in which case another KVM_EXIT_VMGEXIT of type
+KVM_USER_VMGEXIT_REQ_CERTS will be sent to userspace with
+KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE being set in the status field to
+indicate the request is fully-completed and that any associated locks can be
+released.
+
+Tools/libraries that perform updates to SNP firmware TCB values or endorsement
+keys (e.g. firmware interfaces such as SNP_COMMIT, SNP_SET_CONFIG, or
+SNP_VLEK_LOAD, see Documentation/virt/coco/sev-guest.rst for more details) in
+such a way that the certificate blob needs to be updated, should similarly
+take an exclusive lock on the certificate blob for the duration of any updates
+to firmware or the certificate blob contents to ensure that VMMs using the
+above scheme will not return certificate blob data that is out of sync with
+firmware.

6. Capabilities that can be enabled on vCPUs
============================================
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 5c6262f3232f..35f0bd91f92e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3297,6 +3297,11 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
if (!sev_snp_guest(vcpu->kvm))
goto vmgexit_err;
break;
+ case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+ if (!sev_snp_guest(vcpu->kvm) || !kvm_ghcb_rax_is_valid(svm) ||
+ !kvm_ghcb_rbx_is_valid(svm))
+ goto vmgexit_err;
+ break;
default:
reason = GHCB_ERR_INVALID_EVENT;
goto vmgexit_err;
@@ -3988,6 +3993,84 @@ static void snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp
ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
}

+static int snp_complete_ext_guest_req(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ struct vmcb_control_area *control;
+ struct kvm *kvm = vcpu->kvm;
+ sev_ret_code fw_err = 0;
+ int vmm_ret;
+
+ vmm_ret = vcpu->run->vmgexit.req_certs.ret;
+ if (vmm_ret) {
+ if (vmm_ret == SNP_GUEST_VMM_ERR_INVALID_LEN)
+ vcpu->arch.regs[VCPU_REGS_RBX] =
+ vcpu->run->vmgexit.req_certs.data_npages;
+ goto out;
+ }
+
+ /*
+ * The request was completed on the previous completion callback and
+ * this completion is only for the STATUS_DONE userspace notification.
+ */
+ if (vcpu->run->vmgexit.req_certs.status == KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE)
+ goto out_resume;
+
+ control = &svm->vmcb->control;
+
+ if (__snp_handle_guest_req(kvm, control->exit_info_1,
+ control->exit_info_2, &fw_err))
+ vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+
+out:
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+
+ if (vcpu->run->vmgexit.req_certs.flags & KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE) {
+ vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE;
+ vcpu->run->vmgexit.req_certs.flags = 0;
+ return 0; /* notify userspace of completion */
+ }
+
+out_resume:
+ return 1; /* resume guest */
+}
+
+static int snp_begin_ext_guest_req(struct kvm_vcpu *vcpu)
+{
+ int vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+ struct vcpu_svm *svm = to_svm(vcpu);
+ unsigned long data_npages;
+ sev_ret_code fw_err;
+ gpa_t data_gpa;
+
+ if (!sev_snp_guest(vcpu->kvm))
+ goto abort_request;
+
+ data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
+ data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
+
+ if (!IS_ALIGNED(data_gpa, PAGE_SIZE))
+ goto abort_request;
+
+ /*
+ * Grab the certificates from userspace so that can be bundled with
+ * attestation/guest requests.
+ */
+ vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+ vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_REQ_CERTS;
+ vcpu->run->vmgexit.req_certs.data_gpa = data_gpa;
+ vcpu->run->vmgexit.req_certs.data_npages = data_npages;
+ vcpu->run->vmgexit.req_certs.flags = 0;
+ vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING;
+ vcpu->arch.complete_userspace_io = snp_complete_ext_guest_req;
+
+ return 0; /* forward request to userspace */
+
+abort_request:
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+ return 1; /* resume guest */
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -4266,6 +4349,9 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
snp_handle_guest_req(svm, control->exit_info_1, control->exit_info_2);
ret = 1;
break;
+ case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+ ret = snp_begin_ext_guest_req(vcpu);
+ break;
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e325ede0f463..83c562b4712a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -309,6 +309,9 @@ struct vcpu_svm {

/* Guest GIF value, used when vGIF is not enabled */
bool guest_gif;
+
+ /* Transaction ID associated with SNP config updates */
+ u64 snp_transaction_id;
};

struct svm_cpu_data {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..106367d87189 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,26 @@ struct kvm_xen_exit {
} u;
};

+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_REQ_CERTS 1
+ __u32 type; /* KVM_USER_VMGEXIT_* type */
+ union {
+ struct {
+ __u64 data_gpa;
+ __u64 data_npages;
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN 1
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY 2
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC (1 << 31)
+ __u32 ret;
+#define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE BIT(0)
+ __u8 flags;
+#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING 0
+#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE 1
+ __u8 status;
+ } req_certs;
+ };
+};
+
#define KVM_S390_GET_SKEYS_NONE 1
#define KVM_S390_SKEYS_MAX 1048576

@@ -178,6 +198,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_NOTIFY 37
#define KVM_EXIT_LOONGARCH_IOCSR 38
#define KVM_EXIT_MEMORY_FAULT 39
+#define KVM_EXIT_VMGEXIT 40

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -433,6 +454,8 @@ struct kvm_run {
__u64 gpa;
__u64 size;
} memory_fault;
+ /* KVM_EXIT_VMGEXIT */
+ struct kvm_user_vmgexit vmgexit;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1


2024-05-01 09:07:18

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 20/20] crypto: ccp: Add the SNP_VLEK_LOAD command

When requesting an attestation report a guest is able to specify whether
it wants SNP firmware to sign the report using either a Versioned Chip
Endorsement Key (VCEK), which is derived from chip-unique secrets, or a
Versioned Loaded Endorsement Key (VLEK) which is obtained from an AMD
Key Derivation Service (KDS) and derived from seeds allocated to
enrolled cloud service providers (CSPs).

For VLEK keys, an SNP_VLEK_LOAD SNP firmware command is used to load
them into the system after obtaining them from the KDS. Add a
corresponding userspace interface so to allow the loading of VLEK keys
into the system.

See SEV-SNP Firmware ABI 1.54, SNP_VLEK_LOAD for more details.

Reviewed-by: Tom Lendacky <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
Documentation/virt/coco/sev-guest.rst | 19 ++++++++++++++
drivers/crypto/ccp/sev-dev.c | 36 +++++++++++++++++++++++++++
include/uapi/linux/psp-sev.h | 27 ++++++++++++++++++++
3 files changed, 82 insertions(+)

diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
index e1eaf6a830ce..de68d3a4b540 100644
--- a/Documentation/virt/coco/sev-guest.rst
+++ b/Documentation/virt/coco/sev-guest.rst
@@ -176,6 +176,25 @@ to SNP_CONFIG command defined in the SEV-SNP spec. The current values of
the firmware parameters affected by this command can be queried via
SNP_PLATFORM_STATUS.

+2.7 SNP_VLEK_LOAD
+-----------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_user_data_snp_vlek_load
+:Returns (out): 0 on success, -negative on error
+
+When requesting an attestation report a guest is able to specify whether
+it wants SNP firmware to sign the report using either a Versioned Chip
+Endorsement Key (VCEK), which is derived from chip-unique secrets, or a
+Versioned Loaded Endorsement Key (VLEK) which is obtained from an AMD
+Key Derivation Service (KDS) and derived from seeds allocated to
+enrolled cloud service providers.
+
+In the case of VLEK keys, the SNP_VLEK_LOAD SNP command is used to load
+them into the system after obtaining them from the KDS, and corresponds
+closely to the SNP_VLEK_LOAD firmware command specified in the SEV-SNP
+spec.
+
3. SEV-SNP CPUID Enforcement
============================

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 2102377f727b..97a7959406ee 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -2027,6 +2027,39 @@ static int sev_ioctl_do_snp_set_config(struct sev_issue_cmd *argp, bool writable
return __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
}

+static int sev_ioctl_do_snp_vlek_load(struct sev_issue_cmd *argp, bool writable)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_user_data_snp_vlek_load input;
+ void *blob;
+ int ret;
+
+ if (!sev->snp_initialized || !argp->data)
+ return -EINVAL;
+
+ if (!writable)
+ return -EPERM;
+
+ if (copy_from_user(&input, u64_to_user_ptr(argp->data), sizeof(input)))
+ return -EFAULT;
+
+ if (input.len != sizeof(input) || input.vlek_wrapped_version != 0)
+ return -EINVAL;
+
+ blob = psp_copy_user_blob(input.vlek_wrapped_address,
+ sizeof(struct sev_user_data_snp_wrapped_vlek_hashstick));
+ if (IS_ERR(blob))
+ return PTR_ERR(blob);
+
+ input.vlek_wrapped_address = __psp_pa(blob);
+
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_VLEK_LOAD, &input, &argp->error);
+
+ kfree(blob);
+
+ return ret;
+}
+
static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
{
void __user *argp = (void __user *)arg;
@@ -2087,6 +2120,9 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
case SNP_SET_CONFIG:
ret = sev_ioctl_do_snp_set_config(&input, writable);
break;
+ case SNP_VLEK_LOAD:
+ ret = sev_ioctl_do_snp_vlek_load(&input, writable);
+ break;
default:
ret = -EINVAL;
goto out;
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index b7a2c2ee35b7..2289b7c76c59 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -31,6 +31,7 @@ enum {
SNP_PLATFORM_STATUS,
SNP_COMMIT,
SNP_SET_CONFIG,
+ SNP_VLEK_LOAD,

SEV_MAX,
};
@@ -214,6 +215,32 @@ struct sev_user_data_snp_config {
__u8 rsvd1[52];
} __packed;

+/**
+ * struct sev_data_snp_vlek_load - SNP_VLEK_LOAD structure
+ *
+ * @len: length of the command buffer read by the PSP
+ * @vlek_wrapped_version: version of wrapped VLEK hashstick (Must be 0h)
+ * @rsvd: reserved
+ * @vlek_wrapped_address: address of a wrapped VLEK hashstick
+ * (struct sev_user_data_snp_wrapped_vlek_hashstick)
+ */
+struct sev_user_data_snp_vlek_load {
+ __u32 len; /* In */
+ __u8 vlek_wrapped_version; /* In */
+ __u8 rsvd[3]; /* In */
+ __u64 vlek_wrapped_address; /* In */
+} __packed;
+
+/**
+ * struct sev_user_data_snp_vlek_wrapped_vlek_hashstick - Wrapped VLEK data
+ *
+ * @data: Opaque data provided by AMD KDS (as described in SEV-SNP Firmware ABI
+ * 1.54, SNP_VLEK_LOAD)
+ */
+struct sev_user_data_snp_wrapped_vlek_hashstick {
+ __u8 data[432]; /* In */
+} __packed;
+
/**
* struct sev_issue_cmd - SEV ioctl parameters
*
--
2.25.1


2024-05-01 09:07:32

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 01/20] Revert "KVM: x86: Add gmem hook for determining max NPT mapping level"

This reverts commit 20cc50a0410f338657e23e77fcc21fee2bc291e6.

As pointed out here[1], this patch has a few issues:
- the error response could theoretically kill a guest in cases where
retrying based on mmu_invalidate_seq might have been sufficient and
so it should purely be a means to find the max mapping level that
never returns error
- the gpa/private arguments are not currently needed for anything
- it's not really a "gmem" hook but uses the same naming convention
as actual gmem hooks

Revert it so can replaced with a fully-intact replacement patch that
addresses the above.

[1] https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 -
arch/x86/include/asm/kvm_host.h | 2 --
arch/x86/kvm/mmu/mmu.c | 8 --------
3 files changed, 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 2db87a6fd52a..c81990937ab4 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -140,7 +140,6 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
-KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
KVM_X86_OP_OPTIONAL(gmem_invalidate)

#undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4c9d8a22840a..c6c5018376be 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1816,8 +1816,6 @@ struct kvm_x86_ops {
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
- int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, bool is_private,
- u8 *max_level);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eebb1562c5bc..510eb1117012 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4292,14 +4292,6 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);

- r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
- fault->gfn, fault->is_private,
- &fault->max_level);
- if (r) {
- kvm_release_pfn_clean(fault->pfn);
- return r;
- }
-
return RET_PF_CONTINUE;
}

--
2.25.1


2024-05-01 09:07:51

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 02/20] KVM: x86: Add hook for determining max NPT mapping level

In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:

- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
- guest later converts that shared page back to private

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.

Add a hook to determine the max NPT mapping size in situations like
this.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++--
3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index c81990937ab4..566d19b02483 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -140,6 +140,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
KVM_X86_OP_OPTIONAL(gmem_invalidate)

#undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c6c5018376be..87265b73906a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1816,6 +1816,7 @@ struct kvm_x86_ops {
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
+ int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 510eb1117012..0d556da052f6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4271,6 +4271,20 @@ static inline u8 kvm_max_level_for_order(int order)
return PG_LEVEL_4K;
}

+static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
+ u8 max_level, int gmem_order)
+{
+ if (max_level == PG_LEVEL_4K)
+ return PG_LEVEL_4K;
+
+ max_level = min(kvm_max_level_for_order(gmem_order), max_level);
+ if (max_level == PG_LEVEL_4K)
+ return PG_LEVEL_4K;
+
+ return min(max_level,
+ static_call(kvm_x86_private_max_mapping_level)(kvm, pfn));
+}
+
static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
@@ -4288,9 +4302,9 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
return r;
}

- fault->max_level = min(kvm_max_level_for_order(max_order),
- fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+ fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
+ fault->max_level, max_order);

return RET_PF_CONTINUE;
}
--
2.25.1


2024-05-01 09:08:35

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 04/20] KVM: SEV: Add initial SEV-SNP support

From: Brijesh Singh <[email protected]>

SEV-SNP builds upon existing SEV and SEV-ES functionality while adding
new hardware-based security protection. SEV-SNP adds strong memory
encryption and integrity protection to help prevent malicious
hypervisor-based attacks such as data replay, memory re-mapping, and
more, to create an isolated execution environment.

Define a new KVM_X86_SNP_VM type which makes use of these capabilities
and extend the KVM_SEV_INIT2 ioctl to support it. Also add a basic
helper to check whether SNP is enabled and set PFERR_PRIVATE_ACCESS for
private #NPFs so they are handled appropriately by KVM MMU.

Signed-off-by: Brijesh Singh <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/svm.h | 3 ++-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/svm/sev.c | 21 ++++++++++++++++++++-
arch/x86/kvm/svm/svm.c | 8 +++++++-
arch/x86/kvm/svm/svm.h | 12 ++++++++++++
5 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 728c98175b9c..544a43c1cf11 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -285,7 +285,8 @@ static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_

#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF)

-#define SVM_SEV_FEAT_DEBUG_SWAP BIT(5)
+#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)
+#define SVM_SEV_FEAT_DEBUG_SWAP BIT(5)

struct vmcb_seg {
u16 selector;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9fae1b73b529..d2ae5fcc0275 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -874,5 +874,6 @@ struct kvm_hyperv_eventfd {
#define KVM_X86_SW_PROTECTED_VM 1
#define KVM_X86_SEV_VM 2
#define KVM_X86_SEV_ES_VM 3
+#define KVM_X86_SNP_VM 4

#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a4bde1193b92..be831e2c06eb 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -47,6 +47,9 @@ module_param_named(sev, sev_enabled, bool, 0444);
static bool sev_es_enabled = true;
module_param_named(sev_es, sev_es_enabled, bool, 0444);

+/* enable/disable SEV-SNP support */
+static bool sev_snp_enabled;
+
/* enable/disable SEV-ES DebugSwap support */
static bool sev_es_debug_swap_enabled = true;
module_param_named(debug_swap, sev_es_debug_swap_enabled, bool, 0444);
@@ -288,6 +291,9 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
if (sev->es_active && !sev->ghcb_version)
sev->ghcb_version = GHCB_VERSION_DEFAULT;

+ if (vm_type == KVM_X86_SNP_VM)
+ sev->vmsa_features |= SVM_SEV_FEAT_SNP_ACTIVE;
+
ret = sev_asid_new(sev);
if (ret)
goto e_no_asid;
@@ -348,7 +354,8 @@ static int sev_guest_init2(struct kvm *kvm, struct kvm_sev_cmd *argp)
return -EINVAL;

if (kvm->arch.vm_type != KVM_X86_SEV_VM &&
- kvm->arch.vm_type != KVM_X86_SEV_ES_VM)
+ kvm->arch.vm_type != KVM_X86_SEV_ES_VM &&
+ kvm->arch.vm_type != KVM_X86_SNP_VM)
return -EINVAL;

if (copy_from_user(&data, u64_to_user_ptr(argp->data), sizeof(data)))
@@ -2328,11 +2335,16 @@ void __init sev_set_cpu_caps(void)
kvm_cpu_cap_set(X86_FEATURE_SEV_ES);
kvm_caps.supported_vm_types |= BIT(KVM_X86_SEV_ES_VM);
}
+ if (sev_snp_enabled) {
+ kvm_cpu_cap_set(X86_FEATURE_SEV_SNP);
+ kvm_caps.supported_vm_types |= BIT(KVM_X86_SNP_VM);
+ }
}

void __init sev_hardware_setup(void)
{
unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
+ bool sev_snp_supported = false;
bool sev_es_supported = false;
bool sev_supported = false;

@@ -2413,6 +2425,7 @@ void __init sev_hardware_setup(void)
sev_es_asid_count = min_sev_asid - 1;
WARN_ON_ONCE(misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count));
sev_es_supported = true;
+ sev_snp_supported = sev_snp_enabled && cc_platform_has(CC_ATTR_HOST_SEV_SNP);

out:
if (boot_cpu_has(X86_FEATURE_SEV))
@@ -2425,9 +2438,15 @@ void __init sev_hardware_setup(void)
pr_info("SEV-ES %s (ASIDs %u - %u)\n",
sev_es_supported ? "enabled" : "disabled",
min_sev_asid > 1 ? 1 : 0, min_sev_asid - 1);
+ if (boot_cpu_has(X86_FEATURE_SEV_SNP))
+ pr_info("SEV-SNP %s (ASIDs %u - %u)\n",
+ sev_snp_supported ? "enabled" : "disabled",
+ min_sev_asid > 1 ? 1 : 0, min_sev_asid - 1);

sev_enabled = sev_supported;
sev_es_enabled = sev_es_supported;
+ sev_snp_enabled = sev_snp_supported;
+
if (!sev_es_enabled || !cpu_feature_enabled(X86_FEATURE_DEBUG_SWAP) ||
!cpu_feature_enabled(X86_FEATURE_NO_NESTED_DATA_BP))
sev_es_debug_swap_enabled = false;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 535018f152a3..422b452fbc3b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2056,6 +2056,9 @@ static int npf_interception(struct kvm_vcpu *vcpu)
if (WARN_ON_ONCE(error_code & PFERR_SYNTHETIC_MASK))
error_code &= ~PFERR_SYNTHETIC_MASK;

+ if (sev_snp_guest(vcpu->kvm) && (error_code & PFERR_GUEST_ENC_MASK))
+ error_code |= PFERR_PRIVATE_ACCESS;
+
trace_kvm_page_fault(vcpu, fault_address, error_code);
return kvm_mmu_page_fault(vcpu, fault_address, error_code,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
@@ -4899,8 +4902,11 @@ static int svm_vm_init(struct kvm *kvm)

if (type != KVM_X86_DEFAULT_VM &&
type != KVM_X86_SW_PROTECTED_VM) {
- kvm->arch.has_protected_state = (type == KVM_X86_SEV_ES_VM);
+ kvm->arch.has_protected_state =
+ (type == KVM_X86_SEV_ES_VM || type == KVM_X86_SNP_VM);
to_kvm_sev_info(kvm)->need_init = true;
+
+ kvm->arch.has_private_mem = (type == KVM_X86_SNP_VM);
}

if (!pause_filter_count || !pause_filter_thresh)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9ae0c57c7d20..1407acf45a23 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -349,6 +349,18 @@ static __always_inline bool sev_es_guest(struct kvm *kvm)
#endif
}

+static __always_inline bool sev_snp_guest(struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_AMD_SEV
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+ return (sev->vmsa_features & SVM_SEV_FEAT_SNP_ACTIVE) &&
+ !WARN_ON_ONCE(!sev_es_guest(kvm));
+#else
+ return false;
+#endif
+}
+
static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
{
vmcb->control.clean = 0;
--
2.25.1


2024-05-01 09:08:55

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 05/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command

From: Brijesh Singh <[email protected]>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 28 ++-
arch/x86/include/uapi/asm/kvm.h | 11 ++
arch/x86/kvm/svm/sev.c | 176 +++++++++++++++++-
arch/x86/kvm/svm/svm.h | 1 +
4 files changed, 212 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 9677a0714a39..dd179e162a87 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -466,6 +466,30 @@ issued by the hypervisor to make the guest ready for execution.

Returns: 0 on success, -negative on error

+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. It must be called prior to issuing
+KVM_SEV_SNP_LAUNCH_UPDATE or KVM_SEV_SNP_LAUNCH_FINISH;
+
+Parameters (in): struct kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_start {
+ __u64 policy; /* Guest policy to use. */
+ __u8 gosvw[16]; /* Guest OS visible workarounds. */
+ __u16 flags; /* Must be zero. */
+ __u8 pad0[6];
+ __u64 pad1[4];
+ };
+
+See SNP_LAUNCH_START in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_start``.
+
Device attribute API
====================

@@ -497,9 +521,11 @@ References
==========


-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.

.. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
.. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index d2ae5fcc0275..693a80ffe40a 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -697,6 +697,9 @@ enum sev_cmd_id {
/* Second time is the charm; improved versions of the above ioctls. */
KVM_SEV_INIT2,

+ /* SNP-specific commands */
+ KVM_SEV_SNP_LAUNCH_START = 100,
+
KVM_SEV_NR_MAX,
};

@@ -824,6 +827,14 @@ struct kvm_sev_receive_update_data {
__u32 pad2;
};

+struct kvm_sev_snp_launch_start {
+ __u64 policy;
+ __u8 gosvw[16];
+ __u16 flags;
+ __u8 pad0[6];
+ __u64 pad1[4];
+};
+
#define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0)
#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index be831e2c06eb..4676ce171aaa 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
#include <asm/fpu/xcr.h>
#include <asm/fpu/xstate.h>
#include <asm/debugreg.h>
+#include <asm/sev.h>

#include "mmu.h"
#include "x86.h"
@@ -59,6 +60,21 @@ static u64 sev_supported_vmsa_features;
#define AP_RESET_HOLD_NAE_EVENT 1
#define AP_RESET_HOLD_MSR_PROTO 2

+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_API_MINOR GENMASK_ULL(7, 0)
+#define SNP_POLICY_MASK_API_MAJOR GENMASK_ULL(15, 8)
+#define SNP_POLICY_MASK_SMT BIT_ULL(16)
+#define SNP_POLICY_MASK_RSVD_MBO BIT_ULL(17)
+#define SNP_POLICY_MASK_DEBUG BIT_ULL(19)
+#define SNP_POLICY_MASK_SINGLE_SOCKET BIT_ULL(20)
+
+#define SNP_POLICY_MASK_VALID (SNP_POLICY_MASK_API_MINOR | \
+ SNP_POLICY_MASK_API_MAJOR | \
+ SNP_POLICY_MASK_SMT | \
+ SNP_POLICY_MASK_RSVD_MBO | \
+ SNP_POLICY_MASK_DEBUG | \
+ SNP_POLICY_MASK_SINGLE_SOCKET)
+
static u8 sev_enc_bit;
static DECLARE_RWSEM(sev_deactivate_lock);
static DEFINE_MUTEX(sev_bitmap_lock);
@@ -69,6 +85,8 @@ static unsigned int nr_asids;
static unsigned long *sev_asid_bitmap;
static unsigned long *sev_reclaim_asid_bitmap;

+static int snp_decommission_context(struct kvm *kvm);
+
struct enc_region {
struct list_head list;
unsigned long npages;
@@ -95,12 +113,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
down_write(&sev_deactivate_lock);

wbinvd_on_all_cpus();
- ret = sev_guest_df_flush(&error);
+
+ if (sev_snp_enabled)
+ ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+ else
+ ret = sev_guest_df_flush(&error);

up_write(&sev_deactivate_lock);

if (ret)
- pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+ pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+ sev_snp_enabled ? "-SNP" : "", ret, error);

return ret;
}
@@ -1998,6 +2021,106 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val)
}
}

+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct sev_data_snp_addr data = {};
+ void *context;
+ int rc;
+
+ /* Allocate memory for context page */
+ context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+ if (!context)
+ return NULL;
+
+ data.address = __psp_pa(context);
+ rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+ if (rc) {
+ pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+ rc, argp->error);
+ snp_free_firmware_page(context);
+ return NULL;
+ }
+
+ return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_activate data = {0};
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+ data.asid = sev_get_asid(kvm);
+ return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_start start = {0};
+ struct kvm_sev_snp_launch_start params;
+ int rc;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+ return -EFAULT;
+
+ /* Don't allow userspace to allocate memory for more than 1 SNP context. */
+ if (sev->snp_context)
+ return -EINVAL;
+
+ sev->snp_context = snp_context_create(kvm, argp);
+ if (!sev->snp_context)
+ return -ENOTTY;
+
+ if (params.flags)
+ return -EINVAL;
+
+ if (params.policy & ~SNP_POLICY_MASK_VALID)
+ return -EINVAL;
+
+ /* Check for policy bits that must be set */
+ if (!(params.policy & SNP_POLICY_MASK_RSVD_MBO) ||
+ !(params.policy & SNP_POLICY_MASK_SMT))
+ return -EINVAL;
+
+ if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET)
+ return -EINVAL;
+
+ start.gctx_paddr = __psp_pa(sev->snp_context);
+ start.policy = params.policy;
+ memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+ rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+ if (rc) {
+ pr_debug("%s: SEV_CMD_SNP_LAUNCH_START firmware command failed, rc %d\n",
+ __func__, rc);
+ goto e_free_context;
+ }
+
+ sev->fd = argp->sev_fd;
+ rc = snp_bind_asid(kvm, &argp->error);
+ if (rc) {
+ pr_debug("%s: Failed to bind ASID to SEV-SNP context, rc %d\n",
+ __func__, rc);
+ goto e_free_context;
+ }
+
+ return 0;
+
+e_free_context:
+ snp_decommission_context(kvm);
+
+ return rc;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2021,6 +2144,15 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
goto out;
}

+ /*
+ * Once KVM_SEV_INIT2 initializes a KVM instance as an SNP guest, only
+ * allow the use of SNP-specific commands.
+ */
+ if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) {
+ r = -EPERM;
+ goto out;
+ }
+
switch (sev_cmd.id) {
case KVM_SEV_ES_INIT:
if (!sev_es_enabled) {
@@ -2085,6 +2217,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_RECEIVE_FINISH:
r = sev_receive_finish(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_START:
+ r = snp_launch_start(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2280,6 +2415,31 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
return ret;
}

+static int snp_decommission_context(struct kvm *kvm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_addr data = {};
+ int ret;
+
+ /* If context is not created then do nothing */
+ if (!sev->snp_context)
+ return 0;
+
+ /* Do the decommision, which will unbind the ASID from the SNP context */
+ data.address = __sme_pa(sev->snp_context);
+ down_write(&sev_deactivate_lock);
+ ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+ up_write(&sev_deactivate_lock);
+
+ if (WARN_ONCE(ret, "Failed to release guest context, ret %d", ret))
+ return ret;
+
+ snp_free_firmware_page(sev->snp_context);
+ sev->snp_context = NULL;
+
+ return 0;
+}
+
void sev_vm_destroy(struct kvm *kvm)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2321,7 +2481,17 @@ void sev_vm_destroy(struct kvm *kvm)
}
}

- sev_unbind_asid(kvm, sev->handle);
+ if (sev_snp_guest(kvm)) {
+ /*
+ * Decomission handles unbinding of the ASID. If it fails for
+ * some unexpected reason, just leak the ASID.
+ */
+ if (snp_decommission_context(kvm))
+ return;
+ } else {
+ sev_unbind_asid(kvm, sev->handle);
+ }
+
sev_asid_free(sev);
}

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1407acf45a23..d175059fa7c8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -93,6 +93,7 @@ struct kvm_sev_info {
struct list_head mirror_entry; /* Use as a list entry of mirrors */
struct misc_cg *misc_cg; /* For misc cgroup accounting */
atomic_t migration_in_progress;
+ void *snp_context; /* SNP guest context page */
};

struct kvm_svm {
--
2.25.1


2024-05-01 09:09:21

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 06/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command

From: Brijesh Singh <[email protected]>

A key aspect of a launching an SNP guest is initializing it with a
known/measured payload which is then encrypted into guest memory as
pre-validated private pages and then measured into the cryptographic
launch context created with KVM_SEV_SNP_LAUNCH_START so that the guest
can attest itself after booting.

Since all private pages are provided by guest_memfd, make use of the
kvm_gmem_populate() interface to handle this. The general flow is that
guest_memfd will handle allocating the pages associated with the GPA
ranges being initialized by each particular call of
KVM_SEV_SNP_LAUNCH_UPDATE, copying data from userspace into those pages,
and then the post_populate callback will do the work of setting the
RMP entries for these pages to private and issuing the SNP firmware
calls to encrypt/measure them.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 54 ++++
arch/x86/include/uapi/asm/kvm.h | 19 ++
arch/x86/kvm/svm/sev.c | 230 ++++++++++++++++++
3 files changed, 303 insertions(+)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index dd179e162a87..cc16a7426d18 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -490,6 +490,60 @@ Returns: 0 on success, -negative on error
See SNP_LAUNCH_START in the SEV-SNP specification [snp-fw-abi]_ for further
details on the input parameters in ``struct kvm_sev_snp_launch_start``.

+19. KVM_SEV_SNP_LAUNCH_UPDATE
+-----------------------------
+
+The KVM_SEV_SNP_LAUNCH_UPDATE command is used for loading userspace-provided
+data into a guest GPA range, measuring the contents into the SNP guest context
+created by KVM_SEV_SNP_LAUNCH_START, and then encrypting/validating that GPA
+range so that it will be immediately readable using the encryption key
+associated with the guest context once it is booted, after which point it can
+attest the measurement associated with its context before unlocking any
+secrets.
+
+It is required that the GPA ranges initialized by this command have had the
+KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set in advance. See the documentation
+for KVM_SET_MEMORY_ATTRIBUTES for more details on this aspect.
+
+Upon success, this command is not guaranteed to have processed the entire
+range requested. Instead, the ``gfn_start``, ``uaddr``, and ``len`` fields of
+``struct kvm_sev_snp_launch_update`` will be updated to correspond to the
+remaining range that has yet to be processed. The caller should continue
+calling this command until those fields indicate the entire range has been
+processed, e.g. ``len`` is 0, ``gfn_start`` is equal to the last GFN in the
+range plus 1, and ``uaddr`` is the last byte of the userspace-provided source
+buffer address plus 1. In the case where ``type`` is KVM_SEV_SNP_PAGE_TYPE_ZERO,
+``uaddr`` will be ignored completely.
+
+Parameters (in): struct kvm_sev_snp_launch_update
+
+Returns: 0 on success, < 0 on error, -EAGAIN if caller should retry
+
+::
+
+ struct kvm_sev_snp_launch_update {
+ __u64 gfn_start; /* Guest page number to load/encrypt data into. */
+ __u64 uaddr; /* Userspace address of data to be loaded/encrypted. */
+ __u64 len; /* 4k-aligned length in bytes to copy into guest memory.*/
+ __u8 type; /* The type of the guest pages being initialized. */
+ __u8 pad0;
+ __u16 flags; /* Must be zero. */
+ __u32 pad1;
+ __u64 pad2[4];
+
+ };
+
+where the allowed values for page_type are #define'd as::
+
+ KVM_SEV_SNP_PAGE_TYPE_NORMAL
+ KVM_SEV_SNP_PAGE_TYPE_ZERO
+ KVM_SEV_SNP_PAGE_TYPE_UNMEASURED
+ KVM_SEV_SNP_PAGE_TYPE_SECRETS
+ KVM_SEV_SNP_PAGE_TYPE_CPUID
+
+See the SEV-SNP spec [snp-fw-abi]_ for further details on how each page type is
+used/measured.
+
Device attribute API
====================

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 693a80ffe40a..5935dc8a7e02 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -699,6 +699,7 @@ enum sev_cmd_id {

/* SNP-specific commands */
KVM_SEV_SNP_LAUNCH_START = 100,
+ KVM_SEV_SNP_LAUNCH_UPDATE,

KVM_SEV_NR_MAX,
};
@@ -835,6 +836,24 @@ struct kvm_sev_snp_launch_start {
__u64 pad1[4];
};

+/* Kept in sync with firmware values for simplicity. */
+#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
+#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
+#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
+#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
+#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6
+
+struct kvm_sev_snp_launch_update {
+ __u64 gfn_start;
+ __u64 uaddr;
+ __u64 len;
+ __u8 type;
+ __u8 pad0;
+ __u16 flags;
+ __u32 pad1;
+ __u64 pad2[4];
+};
+
#define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0)
#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4676ce171aaa..f31f87655a67 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -259,6 +259,45 @@ static void sev_decommission(unsigned int handle)
sev_guest_decommission(&decommission, NULL);
}

+/*
+ * Certain page-states, such as Pre-Guest and Firmware pages (as documented
+ * in Chapter 5 of the SEV-SNP Firmware ABI under "Page States") cannot be
+ * directly transitioned back to normal/hypervisor-owned state via RMPUPDATE
+ * unless they are reclaimed first.
+ *
+ * Until they are reclaimed and subsequently transitioned via RMPUPDATE, they
+ * might not be usable by the host due to being set as immutable or still
+ * being associated with a guest ASID.
+ */
+static int snp_page_reclaim(u64 pfn)
+{
+ struct sev_data_snp_page_reclaim data = {0};
+ int err, rc;
+
+ data.paddr = __sme_set(pfn << PAGE_SHIFT);
+ rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
+ if (WARN_ONCE(rc, "Failed to reclaim PFN %llx", pfn))
+ snp_leak_pages(pfn, 1);
+
+ return rc;
+}
+
+/*
+ * Transition a page to hypervisor-owned/shared state in the RMP table. This
+ * should not fail under normal conditions, but leak the page should that
+ * happen since it will no longer be usable by the host due to RMP protections.
+ */
+static int host_rmp_make_shared(u64 pfn, enum pg_level level)
+{
+ int rc;
+
+ rc = rmp_make_shared(pfn, level);
+ if (WARN_ON_ONCE(rc))
+ snp_leak_pages(pfn, page_level_size(level) >> PAGE_SHIFT);
+
+ return rc;
+}
+
static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
{
struct sev_data_deactivate deactivate;
@@ -2121,6 +2160,194 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
return rc;
}

+struct sev_gmem_populate_args {
+ __u8 type;
+ int sev_fd;
+ int fw_error;
+};
+
+static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn,
+ void __user *src, int order, void *opaque)
+{
+ struct sev_gmem_populate_args *sev_populate_args = opaque;
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ int n_private = 0, ret, i;
+ int npages = (1 << order);
+ gfn_t gfn;
+
+ if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src))
+ return -EINVAL;
+
+ for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) {
+ struct sev_data_snp_launch_update fw_args = {0};
+ bool assigned;
+ int level;
+
+ if (!kvm_mem_is_private(kvm, gfn)) {
+ pr_debug("%s: Failed to ensure GFN 0x%llx has private memory attribute set\n",
+ __func__, gfn);
+ ret = -EINVAL;
+ goto err;
+ }
+
+ ret = snp_lookup_rmpentry((u64)pfn + i, &assigned, &level);
+ if (ret || assigned) {
+ pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n",
+ __func__, gfn, ret, assigned);
+ ret = -EINVAL;
+ goto err;
+ }
+
+ if (src) {
+ void *vaddr = kmap_local_pfn(pfn + i);
+
+ ret = copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
+ if (ret)
+ goto err;
+ kunmap_local(vaddr);
+ }
+
+ ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K,
+ sev_get_asid(kvm), true);
+ if (ret)
+ goto err;
+
+ n_private++;
+
+ fw_args.gctx_paddr = __psp_pa(sev->snp_context);
+ fw_args.address = __sme_set(pfn_to_hpa(pfn + i));
+ fw_args.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K);
+ fw_args.page_type = sev_populate_args->type;
+
+ ret = __sev_issue_cmd(sev_populate_args->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
+ &fw_args, &sev_populate_args->fw_error);
+ if (ret)
+ goto fw_err;
+ }
+
+ return 0;
+
+fw_err:
+ /*
+ * If the firmware command failed handle the reclaim and cleanup of that
+ * PFN specially vs. prior pages which can be cleaned up below without
+ * needing to reclaim in advance.
+ *
+ * Additionally, when invalid CPUID function entries are detected,
+ * firmware writes the expected values into the page and leaves it
+ * unencrypted so it can be used for debugging and error-reporting.
+ *
+ * Copy this page back into the source buffer so userspace can use this
+ * information to provide information on which CPUID leaves/fields
+ * failed CPUID validation.
+ */
+ if (!snp_page_reclaim(pfn + i) && !host_rmp_make_shared(pfn + i, PG_LEVEL_4K) &&
+ sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
+ sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
+ void *vaddr = kmap_local_pfn(pfn + i);
+
+ if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE))
+ pr_debug("Failed to write CPUID page back to userspace\n");
+
+ kunmap_local(vaddr);
+ }
+
+ /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */
+ n_private--;
+
+err:
+ pr_debug("%s: exiting with error ret %d (fw_error %d), restoring %d gmem PFNs to shared.\n",
+ __func__, ret, sev_populate_args->fw_error, n_private);
+ for (i = 0; i < n_private; i++)
+ host_rmp_make_shared(pfn + i, PG_LEVEL_4K);
+
+ return ret;
+}
+
+static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_gmem_populate_args sev_populate_args = {0};
+ struct kvm_sev_snp_launch_update params;
+ struct kvm_memory_slot *memslot;
+ long npages, count;
+ void __user *src;
+ int ret = 0;
+
+ if (!sev_snp_guest(kvm) || !sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+ return -EFAULT;
+
+ pr_debug("%s: GFN start 0x%llx length 0x%llx type %d flags %d\n", __func__,
+ params.gfn_start, params.len, params.type, params.flags);
+
+ if (!PAGE_ALIGNED(params.len) || params.flags ||
+ (params.type != KVM_SEV_SNP_PAGE_TYPE_NORMAL &&
+ params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
+ params.type != KVM_SEV_SNP_PAGE_TYPE_UNMEASURED &&
+ params.type != KVM_SEV_SNP_PAGE_TYPE_SECRETS &&
+ params.type != KVM_SEV_SNP_PAGE_TYPE_CPUID))
+ return -EINVAL;
+
+ npages = params.len / PAGE_SIZE;
+
+ /*
+ * For each GFN that's being prepared as part of the initial guest
+ * state, the following pre-conditions are verified:
+ *
+ * 1) The backing memslot is a valid private memslot.
+ * 2) The GFN has been set to private via KVM_SET_MEMORY_ATTRIBUTES
+ * beforehand.
+ * 3) The PFN of the guest_memfd has not already been set to private
+ * in the RMP table.
+ *
+ * The KVM MMU relies on kvm->mmu_invalidate_seq to retry nested page
+ * faults if there's a race between a fault and an attribute update via
+ * KVM_SET_MEMORY_ATTRIBUTES, and a similar approach could be utilized
+ * here. However, kvm->slots_lock guards against both this as well as
+ * concurrent memslot updates occurring while these checks are being
+ * performed, so use that here to make it easier to reason about the
+ * initial expected state and better guard against unexpected
+ * situations.
+ */
+ mutex_lock(&kvm->slots_lock);
+
+ memslot = gfn_to_memslot(kvm, params.gfn_start);
+ if (!kvm_slot_can_be_private(memslot)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ sev_populate_args.sev_fd = argp->sev_fd;
+ sev_populate_args.type = params.type;
+ src = params.type == KVM_SEV_SNP_PAGE_TYPE_ZERO ? NULL : u64_to_user_ptr(params.uaddr);
+
+ count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
+ sev_gmem_post_populate, &sev_populate_args);
+ if (count < 0) {
+ argp->error = sev_populate_args.fw_error;
+ pr_debug("%s: kvm_gmem_populate failed, ret %ld (fw_error %d)\n",
+ __func__, count, argp->error);
+ ret = -EIO;
+ } else {
+ params.gfn_start += count;
+ params.len -= count * PAGE_SIZE;
+ if (params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
+ params.uaddr += count * PAGE_SIZE;
+
+ ret = 0;
+ if (copy_to_user(u64_to_user_ptr(argp->data), &params, sizeof(params)))
+ ret = -EFAULT;
+ }
+
+out:
+ mutex_unlock(&kvm->slots_lock);
+
+ return ret;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2220,6 +2447,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_START:
r = snp_launch_start(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_UPDATE:
+ r = snp_launch_update(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
--
2.25.1


2024-05-01 09:09:45

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 07/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command

From: Brijesh Singh <[email protected]>

Add a KVM_SEV_SNP_LAUNCH_FINISH command to finalize the cryptographic
launch digest which stores the measurement of the guest at launch time.
Also extend the existing SNP firmware data structures to support
disabling the use of Versioned Chip Endorsement Keys (VCEK) by guests as
part of this command.

While finalizing the launch flow, the code also issues the LAUNCH_UPDATE
SNP firmware commands to encrypt/measure the initial VMSA pages for each
configured vCPU, which requires setting the RMP entries for those pages
to private, so also add handling to clean up the RMP entries for these
pages whening freeing vCPUs during shutdown.

Signed-off-by: Brijesh Singh <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Harald Hoyer <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 28 ++++
arch/x86/include/uapi/asm/kvm.h | 17 +++
arch/x86/kvm/svm/sev.c | 127 ++++++++++++++++++
include/linux/psp-sev.h | 4 +-
4 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index cc16a7426d18..1ddb6a86ce7f 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -544,6 +544,34 @@ where the allowed values for page_type are #define'd as::
See the SEV-SNP spec [snp-fw-abi]_ for further details on how each page type is
used/measured.

+20. KVM_SEV_SNP_LAUNCH_FINISH
+-----------------------------
+
+After completion of the SNP guest launch flow, the KVM_SEV_SNP_LAUNCH_FINISH
+command can be issued to make the guest ready for execution.
+
+Parameters (in): struct kvm_sev_snp_launch_finish
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 vcek_disabled;
+ __u8 host_data[32];
+ __u8 pad0[3];
+ __u16 flags; /* Must be zero */
+ __u64 pad1[4];
+ };
+
+
+See SNP_LAUNCH_FINISH in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_finish``.
+
Device attribute API
====================

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5935dc8a7e02..988b5204d636 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -700,6 +700,7 @@ enum sev_cmd_id {
/* SNP-specific commands */
KVM_SEV_SNP_LAUNCH_START = 100,
KVM_SEV_SNP_LAUNCH_UPDATE,
+ KVM_SEV_SNP_LAUNCH_FINISH,

KVM_SEV_NR_MAX,
};
@@ -854,6 +855,22 @@ struct kvm_sev_snp_launch_update {
__u64 pad2[4];
};

+#define KVM_SEV_SNP_ID_BLOCK_SIZE 96
+#define KVM_SEV_SNP_ID_AUTH_SIZE 4096
+#define KVM_SEV_SNP_FINISH_DATA_SIZE 32
+
+struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 vcek_disabled;
+ __u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
+ __u8 pad0[3];
+ __u16 flags;
+ __u64 pad1[4];
+};
+
#define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0)
#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f31f87655a67..797230f810f8 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -75,6 +75,8 @@ static u64 sev_supported_vmsa_features;
SNP_POLICY_MASK_DEBUG | \
SNP_POLICY_MASK_SINGLE_SOCKET)

+#define INITIAL_VMSA_GPA 0xFFFFFFFFF000
+
static u8 sev_enc_bit;
static DECLARE_RWSEM(sev_deactivate_lock);
static DEFINE_MUTEX(sev_bitmap_lock);
@@ -2348,6 +2350,115 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
}

+static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_update data = {};
+ struct kvm_vcpu *vcpu;
+ unsigned long i;
+ int ret;
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+ data.page_type = SNP_PAGE_TYPE_VMSA;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct vcpu_svm *svm = to_svm(vcpu);
+ u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
+
+ ret = sev_es_sync_vmsa(svm);
+ if (ret)
+ return ret;
+
+ /* Transition the VMSA page to a firmware state. */
+ ret = rmp_make_private(pfn, INITIAL_VMSA_GPA, PG_LEVEL_4K, sev->asid, true);
+ if (ret)
+ return ret;
+
+ /* Issue the SNP command to encrypt the VMSA */
+ data.address = __sme_pa(svm->sev_es.vmsa);
+ ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
+ &data, &argp->error);
+ if (ret) {
+ if (!snp_page_reclaim(pfn))
+ host_rmp_make_shared(pfn, PG_LEVEL_4K);
+
+ return ret;
+ }
+
+ svm->vcpu.arch.guest_state_protected = true;
+ }
+
+ return 0;
+}
+
+static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct kvm_sev_snp_launch_finish params;
+ struct sev_data_snp_launch_finish *data;
+ void *id_block = NULL, *id_auth = NULL;
+ int ret;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+ return -EFAULT;
+
+ if (params.flags)
+ return -EINVAL;
+
+ /* Measure all vCPUs using LAUNCH_UPDATE before finalizing the launch flow. */
+ ret = snp_launch_update_vmsa(kvm, argp);
+ if (ret)
+ return ret;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+ if (!data)
+ return -ENOMEM;
+
+ if (params.id_block_en) {
+ id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
+ if (IS_ERR(id_block)) {
+ ret = PTR_ERR(id_block);
+ goto e_free;
+ }
+
+ data->id_block_en = 1;
+ data->id_block_paddr = __sme_pa(id_block);
+
+ id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
+ if (IS_ERR(id_auth)) {
+ ret = PTR_ERR(id_auth);
+ goto e_free_id_block;
+ }
+
+ data->id_auth_paddr = __sme_pa(id_auth);
+
+ if (params.auth_key_en)
+ data->auth_key_en = 1;
+ }
+
+ data->vcek_disabled = params.vcek_disabled;
+
+ memcpy(data->host_data, params.host_data, KVM_SEV_SNP_FINISH_DATA_SIZE);
+ data->gctx_paddr = __psp_pa(sev->snp_context);
+ ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
+
+ kfree(id_auth);
+
+e_free_id_block:
+ kfree(id_block);
+
+e_free:
+ kfree(data);
+
+ return ret;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2450,6 +2561,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_UPDATE:
r = snp_launch_update(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_FINISH:
+ r = snp_launch_finish(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2940,11 +3054,24 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)

svm = to_svm(vcpu);

+ /*
+ * If it's an SNP guest, then the VMSA was marked in the RMP table as
+ * a guest-owned page. Transition the page to hypervisor state before
+ * releasing it back to the system.
+ */
+ if (sev_snp_guest(vcpu->kvm)) {
+ u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
+
+ if (host_rmp_make_shared(pfn, PG_LEVEL_4K))
+ goto skip_vmsa_free;
+ }
+
if (vcpu->arch.guest_state_protected)
sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);

__free_page(virt_to_page(svm->sev_es.vmsa));

+skip_vmsa_free:
if (svm->sev_es.ghcb_sa_free)
kvfree(svm->sev_es.ghcb_sa);
}
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 3705c2044fc0..903ddfea8585 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -658,6 +658,7 @@ struct sev_data_snp_launch_update {
* @id_auth_paddr: system physical address of ID block authentication structure
* @id_block_en: indicates whether ID block is present
* @auth_key_en: indicates whether author key is present in authentication structure
+ * @vcek_disabled: indicates whether use of VCEK is allowed for attestation reports
* @rsvd: reserved
* @host_data: host-supplied data for guest, not interpreted by firmware
*/
@@ -667,7 +668,8 @@ struct sev_data_snp_launch_finish {
u64 id_auth_paddr;
u8 id_block_en:1;
u8 auth_key_en:1;
- u64 rsvd:62;
+ u8 vcek_disabled:1;
+ u64 rsvd:61;
u8 host_data[32];
} __packed;

--
2.25.1


2024-05-01 09:10:11

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 08/20] KVM: SEV: Add support to handle GHCB GPA register VMGEXIT

From: Brijesh Singh <[email protected]>

SEV-SNP guests are required to perform a GHCB GPA registration. Before
using a GHCB GPA for a vCPU the first time, a guest must register the
vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then
it must respond back with the same GPA otherwise return -1.

On VMEXIT, verify that the GHCB GPA matches with the registered value.
If a mismatch is detected, then abort the guest.

Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev-common.h | 8 ++++++
arch/x86/kvm/svm/sev.c | 48 +++++++++++++++++++++++++++----
arch/x86/kvm/svm/svm.h | 7 +++++
3 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 5a8246dd532f..1006bfffe07a 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -59,6 +59,14 @@
#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS 12
#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0)

+/* Preferred GHCB GPA Request */
+#define GHCB_MSR_PREF_GPA_REQ 0x010
+#define GHCB_MSR_GPA_VALUE_POS 12
+#define GHCB_MSR_GPA_VALUE_MASK GENMASK_ULL(51, 0)
+
+#define GHCB_MSR_PREF_GPA_RESP 0x011
+#define GHCB_MSR_PREF_GPA_NONE 0xfffffffffffff
+
/* GHCB GPA Register */
#define GHCB_MSR_REG_GPA_REQ 0x012
#define GHCB_MSR_REG_GPA_REQ_VAL(v) \
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 797230f810f8..e1ac5af4cb74 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3540,6 +3540,32 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
set_ghcb_msr_bits(svm, GHCB_MSR_HV_FT_RESP,
GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
break;
+ case GHCB_MSR_PREF_GPA_REQ:
+ if (!sev_snp_guest(vcpu->kvm))
+ goto out_terminate;
+
+ set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_NONE, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_RESP, GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ case GHCB_MSR_REG_GPA_REQ: {
+ u64 gfn;
+
+ if (!sev_snp_guest(vcpu->kvm))
+ goto out_terminate;
+
+ gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+
+ svm->sev_es.ghcb_registered_gpa = gfn_to_gpa(gfn);
+
+ set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_REG_GPA_RESP, GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ }
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;

@@ -3552,12 +3578,7 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
pr_info("SEV-ES guest requested termination: %#llx:%#llx\n",
reason_set, reason_code);

- vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
- vcpu->run->system_event.type = KVM_SYSTEM_EVENT_SEV_TERM;
- vcpu->run->system_event.ndata = 1;
- vcpu->run->system_event.data[0] = control->ghcb_gpa;
-
- return 0;
+ goto out_terminate;
}
default:
/* Error, keep GHCB MSR value as-is */
@@ -3568,6 +3589,14 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
control->ghcb_gpa, ret);

return ret;
+
+out_terminate:
+ vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+ vcpu->run->system_event.type = KVM_SYSTEM_EVENT_SEV_TERM;
+ vcpu->run->system_event.ndata = 1;
+ vcpu->run->system_event.data[0] = control->ghcb_gpa;
+
+ return 0;
}

int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
@@ -3603,6 +3632,13 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
trace_kvm_vmgexit_enter(vcpu->vcpu_id, svm->sev_es.ghcb);

sev_es_sync_from_ghcb(svm);
+
+ /* SEV-SNP guest requires that the GHCB GPA must be registered */
+ if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
+ vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
+ return -EINVAL;
+ }
+
ret = sev_es_validate_vmgexit(svm);
if (ret)
return ret;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index d175059fa7c8..bbfbeed4c676 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -209,6 +209,8 @@ struct vcpu_sev_es_state {
u32 ghcb_sa_len;
bool ghcb_sa_sync;
bool ghcb_sa_free;
+
+ u64 ghcb_registered_gpa;
};

struct vcpu_svm {
@@ -362,6 +364,11 @@ static __always_inline bool sev_snp_guest(struct kvm *kvm)
#endif
}

+static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
+{
+ return svm->sev_es.ghcb_registered_gpa == val;
+}
+
static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
{
vmcb->control.clean = 0;
--
2.25.1


2024-05-07 18:15:02

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v15 00/20] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

On Tue, May 07, 2024 at 08:04:50PM +0200, Paolo Bonzini wrote:
> On Wed, May 1, 2024 at 11:03 AM Michael Roth <[email protected]> wrote:
> >
> > This patchset is also available at:
> >
> > https://github.com/amdese/linux/commits/snp-host-v15
> >
> > and is based on top of the series:
> >
> > "Add SEV-ES hypervisor support for GHCB protocol version 2"
> > https://lore.kernel.org/kvm/[email protected]/
> > https://github.com/amdese/linux/commits/sev-init2-ghcb-v1
> >
> > which in turn is based on commit 20cc50a0410f (just before v14 SNP patches):
> >
> > https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue
>
> I have mostly reviewed this, with the exception of the
> snp_begin/complete_psc parts.

Thanks Paolo. We actually recently uncovered some issues with
snp_begin/complete_psc using some internal kvm-unit-tests that exercise
some edge cases, so I would hold off on reviewing that. Will send a
fix-up patch today after a bit more testing.

>
> I am not sure about removing all the pr_debug() - I am sure it will be
> a bit more painful for userspace developers to figure out what exactly
> has gone wrong in some cases. But we can add them later, if needed -
> I'm certainly not going to make a fuss about it.

Yah, they do tend to be useful for that purpose. I think if we do add
them back we can consolidate the information a little better versus what
I had previously.

-Mike

>
> Paolo
>
>
> > Patch Layout
> > ------------
> >
> > 01-02: These patches revert+replace the existing .gmem_validate_fault hook
> > with a similar .private_max_mapping_level as suggested by Sean[1]
> >
> > 03-04: These patches add some basic infrastructure and introduces a new
> > KVM_X86_SNP_VM vm_type to handle differences verses the existing
> > KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types.
> >
> > 05-07: These implement the KVM API to handle the creation of a
> > cryptographic launch context, encrypt/measure the initial image
> > into guest memory, and finalize it before launching it.
> >
> > 08-12: These implement handling for various guest-generated events such
> > as page state changes, onlining of additional vCPUs, etc.
> >
> > 13-16: These implement the gmem/mmu hooks needed to prepare gmem-allocated
> > pages before mapping them into guest private memory ranges as
> > well as cleaning them up prior to returning them to the host for
> > use as normal memory. Because this supplants certain activities
> > like issued WBINVDs during KVM MMU invalidations, there's also
> > a patch to avoid duplicating that work to avoid unecessary
> > overhead.
> >
> > 17: With all the core support in place, the patch adds a kvm_amd module
> > parameter to enable SNP support.
> >
> > 18-20: These patches all deal with the servicing of guest requests to handle
> > things like attestation, as well as some related host-management
> > interfaces.
> >
> > [1] https://lore.kernel.org/kvm/[email protected]/#t
> >
> >
> > Testing
> > -------
> >
> > For testing this via QEMU, use the following tree:
> >
> > https://github.com/amdese/qemu/commits/snp-v4-wip3c
> >
> > A patched OVMF is also needed due to upstream KVM no longer supporting MMIO
> > ranges that are mapped as private. It is recommended you build the AmdSevX64
> > variant as it provides the kernel-hashing support present in this series:
> >
> > https://github.com/amdese/ovmf/commits/apic-mmio-fix1d
> >
> > A basic command-line invocation for SNP would be:
> >
> > qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> > -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> > -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> > -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
> > -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
> >
> > With kernel-hashing and certificate data supplied:
> >
> > qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> > -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> > -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> > -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=,certs-path=/home/mroth/cert.blob,kernel-hashes=on
> > -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
> > -kernel /boot/vmlinuz-$ver
> > -initrd /boot/initrd.img-$ver
> > -append "root=UUID=d72a6d1c-06cf-4b79-af43-f1bac4f620f9 ro console=ttyS0,115200n8"
> >
> > With standard X64 OVMF package with separate image for persistent NVRAM:
> >
> > qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> > -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> > -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> > -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
> > -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d.fd
> > -drive if=pflash,format=raw,unit=0,file=OVMF_VARS-upstream-20240410-apic-mmio-fix1d.fd,readonly=off
> >
> >
> > Known issues / TODOs
> > --------------------
> >
> > * Base tree in some cases reports "Unpatched return thunk in use. This should
> > not happen!" the first time it runs an SVM/SEV/SNP guests. This a recent
> > regression upstream and unrelated to this series:
> >
> > https://lore.kernel.org/linux-kernel/CANpmjNOcKzEvLHoGGeL-boWDHJobwfwyVxUqMq2kWeka3N4tXA@mail.gmail.com/T/
> >
> > * 2MB hugepage support has been dropped pending discussion on how we plan to
> > re-enable it in gmem.
> >
> > * Host kexec should work, but there is a known issue with host kdump support
> > while SNP guests are running that will be addressed as a follow-up.
> >
> > * SNP kselftests are currently a WIP and will be included as part of SNP
> > upstreaming efforts in the near-term.
> >
> >
> > SEV-SNP Overview
> > ----------------
> >
> > This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
> > changes required to add KVM support for SEV-SNP. This series builds upon
> > SEV-SNP guest support, which is now in mainline, and and SEV-SNP host
> > initialization support, which is now in linux-next.
> >
> > While series provides the basic building blocks to support booting the
> > SEV-SNP VMs, it does not cover all the security enhancement introduced by
> > the SEV-SNP such as interrupt protection, which will added in the future.
> >
> > With SNP, when pages are marked as guest-owned in the RMP table, they are
> > assigned to a specific guest/ASID, as well as a specific GFN with in the
> > guest. Any attempts to map it in the RMP table to a different guest/ASID,
> > or a different GFN within a guest/ASID, will result in an RMP nested page
> > fault.
> >
> > Prior to accessing a guest-owned page, the guest must validate it with a
> > special PVALIDATE instruction which will set a special bit in the RMP table
> > for the guest. This is the only way to set the validated bit outside of the
> > initial pre-encrypted guest payload/image; any attempts outside the guest to
> > modify the RMP entry from that point forward will result in the validated
> > bit being cleared, at which point the guest will trigger an exception if it
> > attempts to access that page so it can be made aware of possible tampering.
> >
> > One exception to this is the initial guest payload, which is pre-validated
> > by the firmware prior to launching. The guest can use Guest Message requests
> > to fetch an attestation report which will include the measurement of the
> > initial image so that the guest can verify it was booted with the expected
> > image/environment.
> >
> > After boot, guests can use Page State Change requests to switch pages
> > between shared/hypervisor-owned and private/guest-owned to share data for
> > things like DMA, virtio buffers, and other GHCB requests.
> >
> > In this implementation of SEV-SNP, private guest memory is managed by a new
> > kernel framework called guest_memfd (gmem). With gmem, a new
> > KVM_SET_MEMORY_ATTRIBUTES KVM ioctl has been added to tell the KVM
> > MMU whether a particular GFN should be backed by shared (normal) memory or
> > private (gmem-allocated) memory. To tie into this, Page State Change
> > requests are forward to userspace via KVM_EXIT_VMGEXIT exits, which will
> > then issue the corresponding KVM_SET_MEMORY_ATTRIBUTES call to set the
> > private/shared state in the KVM MMU.
> >
> > The gmem / KVM MMU hooks implemented in this series will then update the RMP
> > table entries for the backing PFNs to set them to guest-owned/private when
> > mapping private pages into the guest via KVM MMU, or use the normal KVM MMU
> > handling in the case of shared pages where the corresponding RMP table
> > entries are left in the default shared/hypervisor-owned state.
> >
> > Feedback/review is very much appreciated!
> >
> > -Mike
> >
> >
> > Changes since v14:
> >
> > * switch to vendor-agnostic KVM_HC_MAP_GPA_RANGE exit for forwarding
> > page-state change requests to userspace instead of an SNP-specific exit
> > (Sean)
> > * drop SNP_PAUSE_ATTESTATION/SNP_RESUME_ATTESTATION interfaces, instead
> > add handling in KVM_EXIT_VMGEXIT so that VMMs can implement their own
> > mechanisms for keeping userspace-supplied certificates in-sync with
> > firmware's TCB/endorsement key (Sean)
> > * carve out SEV-ES-specific handling for GHCB protocol 2, add control of
> > the protocol version, and post as a separate prereq patchset (Sean)
> > * use more consistent error-handling in snp_launch_{start,update,finish},
> > simplify logic based on review comments (Sean)
> > * rename .gmem_validate_fault to .private_max_mapping_level and rework
> > logic based on review suggestions (Sean)
> > * reduce number of pr_debug()'s in series, avoid multiple WARN's in
> > succession (Sean)
> > * improve documentation and comments throughout
> >
> > Changes since v13:
> >
> > * rebase to new kvm-coco-queue and wire up to PFERR_PRIVATE_ACCESS (Paolo)
> > * handle setting kvm->arch.has_private_mem in same location as
> > kvm->arch.has_protected_state (Paolo)
> > * add flags and additional padding fields to
> > snp_launch{start,update,finish} APIs to address alignment and
> > expandability (Paolo)
> > * update snp_launch_update() to update input struct values to reflect
> > current progress of command in situations where mulitple calls are
> > needed (Paolo)
> > * update snp_launch_update() to avoid copying/accessing 'src' parameter
> > when dealing with zero pages. (Paolo)
> > * update snp_launch_update() to use u64 as length input parameter instead
> > of u32 and adjust padding accordingly
> > * modify ordering of SNP_POLICY_MASK_* definitions to be consistent with
> > bit order of corresponding flags
> > * let firmware handle enforcement of policy bits corresponding to
> > user-specified minimum API version
> > * add missing "0x" prefixs in pr_debug()'s for snp_launch_start()
> > * fix handling of VMSAs during in-place migration (Paolo)
> >
> > Changes since v12:
> >
> > * rebased to latest kvm-coco-queue branch (commit 4d2deb62185f)
> > * add more input validation for SNP_LAUNCH_START, especially for handling
> > things like MBO/MBZ policy bits, and API major/minor minimums. (Paolo)
> > * block SNP KVM instances from being able to run legacy SEV commands (Paolo)
> > * don't attempt to measure VMSA for vcpu 0/BSP before the others, let
> > userspace deal with the ordering just like with SEV-ES (Paolo)
> > * fix up docs for SNP_LAUNCH_FINISH (Paolo)
> > * introduce svm->sev_es.snp_has_guest_vmsa flag to better distinguish
> > handling for guest-mapped vs non-guest-mapped VMSAs, rename
> > 'snp_ap_create' flag to 'snp_ap_waiting_for_reset' (Paolo)
> > * drop "KVM: SEV: Use a VMSA physical address variable for populating VMCB"
> > as it is no longer needed due to above VMSA rework
> > * replace pr_debug_ratelimited() messages for RMP #NPFs with a single trace
> > event
> > * handle transient PSMASH_FAIL_INUSE return codes in kvm_gmem_invalidate(),
> > switch to WARN_ON*()'s to indicate remaining error cases are not expected
> > and should not be seen in practice. (Paolo)
> > * add a cond_resched() in kvm_gmem_invalidate() to avoid soft lock-ups when
> > cleaning up large guest memory ranges.
> > * rename VLEK_REQUIRED to VCEK_DISABLE. it's be more applicable if another
> > key type ever gets added.
> > * don't allow attestation to be paused while an attestation request is
> > being processed by firmware (Tom)
> > * add missing Documentation entry for SNP_VLEK_LOAD
> > * collect Reviewed-by's from Paolo and Tom
> >
> >
> > ----------------------------------------------------------------
> > Ashish Kalra (1):
> > KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP
> >
> > Brijesh Singh (8):
> > KVM: SEV: Add initial SEV-SNP support
> > KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
> > KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
> > KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
> > KVM: SEV: Add support to handle GHCB GPA register VMGEXIT
> > KVM: SEV: Add support to handle RMP nested page faults
> > KVM: SVM: Add module parameter to enable SEV-SNP
> > KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
> >
> > Michael Roth (10):
> > Revert "KVM: x86: Add gmem hook for determining max NPT mapping level"
> > KVM: x86: Add hook for determining max NPT mapping level
> > KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y
> > KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
> > KVM: SEV: Add support to handle Page State Change VMGEXIT
> > KVM: SEV: Implement gmem hook for initializing private pages
> > KVM: SEV: Implement gmem hook for invalidating private pages
> > KVM: x86: Implement hook for determining max NPT mapping level
> > KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
> > crypto: ccp: Add the SNP_VLEK_LOAD command
> >
> > Tom Lendacky (1):
> > KVM: SEV: Support SEV-SNP AP Creation NAE event
> >
> > Documentation/virt/coco/sev-guest.rst | 19 +
> > Documentation/virt/kvm/api.rst | 87 ++
> > .../virt/kvm/x86/amd-memory-encryption.rst | 110 +-
> > arch/x86/include/asm/kvm-x86-ops.h | 2 +-
> > arch/x86/include/asm/kvm_host.h | 5 +-
> > arch/x86/include/asm/sev-common.h | 25 +
> > arch/x86/include/asm/sev.h | 3 +
> > arch/x86/include/asm/svm.h | 9 +-
> > arch/x86/include/uapi/asm/kvm.h | 48 +
> > arch/x86/kvm/Kconfig | 3 +
> > arch/x86/kvm/mmu.h | 2 -
> > arch/x86/kvm/mmu/mmu.c | 27 +-
> > arch/x86/kvm/svm/sev.c | 1538 +++++++++++++++++++-
> > arch/x86/kvm/svm/svm.c | 44 +-
> > arch/x86/kvm/svm/svm.h | 52 +
> > arch/x86/kvm/trace.h | 31 +
> > arch/x86/kvm/x86.c | 17 +
> > drivers/crypto/ccp/sev-dev.c | 36 +
> > include/linux/psp-sev.h | 4 +-
> > include/uapi/linux/kvm.h | 23 +
> > include/uapi/linux/psp-sev.h | 27 +
> > include/uapi/linux/sev-guest.h | 9 +
> > virt/kvm/guest_memfd.c | 4 +-
> > 23 files changed, 2081 insertions(+), 44 deletions(-)
> >
>
>

2024-05-10 02:36:45

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

For hardware-protected VMs like SEV-SNP guests, certain conditions like
attempting to perform a write to a page which is not in the state that
the guest expects it to be in can result in a nested/extended #PF which
can only be satisfied by the host performing an implicit page state
change to transition the page into the expected shared/private state.
This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event
that gets forwarded to userspace to handle via
KVM_SET_MEMORY_ATTRIBUTES.

However, the fast_page_fault() code might misconstrue this situation as
being the result of a write-protected access, and treat it as a spurious
case when it sees that writes are already allowed for the sPTE. This
results in the KVM MMU trying to resume the guest rather than taking any
action to satisfy the real source of the #PF such as generating a
KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs.

For now, just skip the fast path for hardware-protected VMs since they
don't currently utilize any of this access-tracking machinery anyway. In
the future, these considerations will need to be taken into account if
there's any need/desire to re-enable the fast path for
hardware-protected VMs.

Since software-protected VMs don't have a notion of a shared vs. private
that's separate from what KVM is tracking, the above
KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the special
handling for that case for now.

Cc: Isaku Yamahata <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 62ad38b2a8c9..cecd8360378f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3296,7 +3296,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
return RET_PF_CONTINUE;
}

-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
/*
* Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
@@ -3307,6 +3307,32 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
if (fault->rsvd)
return false;

+ /*
+ * For hardware-protected VMs, certain conditions like attempting to
+ * perform a write to a page which is not in the state that the guest
+ * expects it to be in can result in a nested/extended #PF. In this
+ * case, the below code might misconstrue this situation as being the
+ * result of a write-protected access, and treat it as a spurious case
+ * rather than taking any action to satisfy the real source of the #PF
+ * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the
+ * guest spinning on a #PF indefinitely.
+ *
+ * For now, just skip the fast path for hardware-protected VMs since
+ * they don't currently utilize any of this machinery anyway. In the
+ * future, these considerations will need to be taken into account if
+ * there's any need/desire to re-enable the fast path for
+ * hardware-protected VMs.
+ *
+ * Since software-protected VMs don't have a notion of a shared vs.
+ * private that's separate from what KVM is tracking, the above
+ * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
+ * special handling for that case for now.
+ */
+ if (kvm_slot_can_be_private(fault->slot) &&
+ !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
+ vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
+ return false;
+
/*
* #PF can be fast if:
*
@@ -3407,7 +3433,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
u64 *sptep;
uint retry_count = 0;

- if (!page_fault_can_be_fast(fault))
+ if (!page_fault_can_be_fast(vcpu, fault))
return ret;

walk_shadow_page_lockless_begin(vcpu);
--
2.25.1


2024-05-10 02:37:08

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 22/23] KVM: SEV: Fix return code interpretation for RMP nested page faults

The intended logic when handling #NPFs with the RMP bit set (31) is to
first check to see if the #NPF requires a shared<->private transition
and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
get forwarded on to userspace before proceeding with any handling of
other potential RMP fault conditions like needing to PSMASH the RMP
entry/etc (which will be done later if the guest still re-faults after
the KVM_EXIT_MEMORY_FAULT is processed by userspace).

The determination of whether any userspace handling of
KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
of kvm_mmu_page_fault(). However, the current code misinterprets the
return code, expecting 0 to indicate a userspace exit rather than less
than 0 (-EFAULT). This leads to the following unexpected behavior:

- for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
conversions, warnings get printed from sev_handle_rmp_fault()
because it does not expect to be called for GPAs where
KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
generally do this, but it is allowed and should be handled
similarly to private->shared conversions rather than triggering any
sort of warnings

- if gmem support for 2MB folios is enabled (via currently out-of-tree
code), implicit shared<->private conversions will always result in
a PSMASH being attempted, even if it's not actually needed to
resolve the RMP fault. This doesn't cause any harm, but results in a
needless PSMASH and zapping of the sPTE

Resolve these issues by calling sev_handle_rmp_fault() only when
kvm_mmu_page_fault()'s return code is greater than or equal to 0,
indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
simplify the code slightly and fix up the associated comments for better
clarity.

Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")

Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/svm.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 426ad49325d7..9431ce74c7d4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
svm->vmcb->control.insn_len);

/*
- * rc == 0 indicates a userspace exit is needed to handle page
- * transitions, so do that first before updating the RMP table.
+ * rc < 0 indicates a userspace exit may be needed to handle page
+ * attribute updates, so deal with that first before handling other
+ * potential RMP fault conditions.
*/
- if (error_code & PFERR_GUEST_RMP_MASK) {
- if (rc == 0)
- return rc;
+ if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)
sev_handle_rmp_fault(vcpu, fault_address, error_code);
- }

return rc;
}
--
2.25.1


2024-05-10 02:37:48

by Michael Roth

[permalink] [raw]
Subject: [PATCH v15 23/23] KVM: SEV: Fix PSC handling for SMASH/UNSMASH and partial update ops

There are a few edge-cases that the current processing for GHCB PSC
requests doesn't handle properly:

- KVM properly ignores SMASH/UNSMASH ops when they are embedded in a
PSC request buffer which contains other PSC operations, but
inadvertantly forwards them to userspace as private->shared PSC
requests if they appear at the end of the buffer. Make sure these are
ignored instead, just like cases where they are not at the end of the
request buffer.

- Current code handles non-zero 'cur_page' fields when they are at the
beginning of a new GPA range, but it is not handling properly when
iterating through subsequent entries which are otherwise part of a
contiguous range. Fix up the handling so that these entries are not
combined into a larger contiguous range that include unintended GPA
ranges and are instead processed later as the start of a new
contiguous range.

- The page size variable used to track 2M entries in KVM for inflight PSCs
might be artifically set to a different value, which can lead to
unexpected values in the entry's final 'cur_page' update. Use the
entry's 'pagesize' field instead to determine what the value of
'cur_page' should be upon completion of processing.

While here, also add a small helper for clearing in-flight PSCs
variables and fix up comments for better readability.

Fixes: 266205d810d2 ("KVM: SEV: Add support to handle Page State Change VMGEXIT")
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 73 +++++++++++++++++++++++++++---------------
1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 35f0bd91f92e..ab23329e2bd0 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3555,43 +3555,50 @@ struct psc_buffer {

static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc);

-static int snp_complete_psc(struct kvm_vcpu *vcpu)
+static void snp_reset_inflight_psc(struct vcpu_svm *svm)
+{
+ svm->sev_es.psc_idx = 0;
+ svm->sev_es.psc_inflight = 0;
+ svm->sev_es.psc_2m = false;
+}
+
+static void __snp_complete_psc(struct vcpu_svm *svm)
{
- struct vcpu_svm *svm = to_svm(vcpu);
struct psc_buffer *psc = svm->sev_es.ghcb_sa;
struct psc_entry *entries = psc->entries;
struct psc_hdr *hdr = &psc->hdr;
- __u64 psc_ret;
__u16 idx;

- if (vcpu->run->hypercall.ret) {
- psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
- goto out_resume;
- }
-
/*
* Everything in-flight has been processed successfully. Update the
- * corresponding entries in the guest's PSC buffer.
+ * corresponding entries in the guest's PSC buffer and zero out the
+ * count of in-flight PSC entries.
*/
for (idx = svm->sev_es.psc_idx; svm->sev_es.psc_inflight;
svm->sev_es.psc_inflight--, idx++) {
struct psc_entry *entry = &entries[idx];

- entry->cur_page = svm->sev_es.psc_2m ? 512 : 1;
+ entry->cur_page = entry->pagesize ? 512 : 1;
}

hdr->cur_entry = idx;
+}

- /* Handle the next range (if any). */
- return snp_begin_psc(svm, psc);
+static int snp_complete_psc(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ struct psc_buffer *psc = svm->sev_es.ghcb_sa;

-out_resume:
- svm->sev_es.psc_idx = 0;
- svm->sev_es.psc_inflight = 0;
- svm->sev_es.psc_2m = false;
- ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
+ if (vcpu->run->hypercall.ret) {
+ snp_reset_inflight_psc(svm);
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, VMGEXIT_PSC_ERROR_GENERIC);
+ return 1; /* resume guest */
+ }

- return 1; /* resume guest */
+ __snp_complete_psc(svm);
+
+ /* Handle the next range (if any). */
+ return snp_begin_psc(svm, psc);
}

static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
@@ -3634,6 +3641,7 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
goto out_resume;
}

+next_range:
/* Find the start of the next range which needs processing. */
for (idx = idx_start; idx <= idx_end; idx++, hdr->cur_entry++) {
__u16 cur_page;
@@ -3642,11 +3650,6 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)

entry_start = entries[idx];

- /* Only private/shared conversions are currently supported. */
- if (entry_start.operation != VMGEXIT_PSC_OP_PRIVATE &&
- entry_start.operation != VMGEXIT_PSC_OP_SHARED)
- continue;
-
gfn = entry_start.gfn;
cur_page = entry_start.cur_page;
huge = entry_start.pagesize;
@@ -3687,6 +3690,7 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)

if (entry.operation != entry_start.operation ||
entry.gfn != entry_start.gfn + npages ||
+ entry.cur_page != 0 ||
!!entry.pagesize != svm->sev_es.psc_2m)
break;

@@ -3694,6 +3698,25 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
npages += entry_start.pagesize ? 512 : 1;
}

+ /*
+ * Only shared/private PSC operations are currently supported, so if the
+ * entire range consists of unsupported operations (e.g. SMASH/UNSMASH),
+ * then consider the entire range completed and avoid exiting to
+ * userspace. In theory snp_complete_psc() can always be called directly
+ * at this point to complete the current range and start the next one,
+ * but that could lead to unexpected levels of recursion, so only do
+ * that if there are no more entries to process and the entire request
+ * has been completed.
+ */
+ if (entry_start.operation != VMGEXIT_PSC_OP_PRIVATE &&
+ entry_start.operation != VMGEXIT_PSC_OP_SHARED) {
+ if (idx > idx_end)
+ return snp_complete_psc(vcpu);
+
+ __snp_complete_psc(svm);
+ goto next_range;
+ }
+
vcpu->run->exit_reason = KVM_EXIT_HYPERCALL;
vcpu->run->hypercall.nr = KVM_HC_MAP_GPA_RANGE;
vcpu->run->hypercall.args[0] = gpa;
@@ -3709,9 +3732,7 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
return 0; /* forward request to userspace */

out_resume:
- svm->sev_es.psc_idx = 0;
- svm->sev_es.psc_inflight = 0;
- svm->sev_es.psc_2m = false;
+ snp_reset_inflight_psc(svm);
ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);

return 1; /* resume guest */
--
2.25.1


2024-05-10 02:38:19

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v15 00/20] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

On Tue, May 07, 2024 at 01:14:24PM -0500, Michael Roth wrote:
> On Tue, May 07, 2024 at 08:04:50PM +0200, Paolo Bonzini wrote:
> > On Wed, May 1, 2024 at 11:03 AM Michael Roth <[email protected]> wrote:
> > >
> > > This patchset is also available at:
> > >
> > > https://github.com/amdese/linux/commits/snp-host-v15
> > >
> > > and is based on top of the series:
> > >
> > > "Add SEV-ES hypervisor support for GHCB protocol version 2"
> > > https://lore.kernel.org/kvm/[email protected]/
> > > https://github.com/amdese/linux/commits/sev-init2-ghcb-v1
> > >
> > > which in turn is based on commit 20cc50a0410f (just before v14 SNP patches):
> > >
> > > https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue
> >
> > I have mostly reviewed this, with the exception of the
> > snp_begin/complete_psc parts.
>
> Thanks Paolo. We actually recently uncovered some issues with
> snp_begin/complete_psc using some internal kvm-unit-tests that exercise
> some edge cases, so I would hold off on reviewing that. Will send a
> fix-up patch today after a bit more testing.

In the process if adding some additional units tests we ran uncovered a
couple of other issues in addition to the fixups for PSC:

[PATCH 21/20] KVM: MMU: Disable fast path for private memslots

addresses an issue with fast_page_fault() handling that can lead to
KVM_EXIT_MEMORY_FAULT cases being treated as spurious #NPFs which
results in the guest spinning forever. This seems like it could be
generally needed for both SNP/TDX, and would likely replace the need
for this patch from the TDX series:

KVM: x86/mmu: Disallow fast page fault on private GPA
https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

This is a standalone patch and not really a fixup for anything

[PATCH 22/20] KVM: SEV: Fix return code interpretation for RMP nested page faults

addresses an issue where the return code of kvm_mmu_page_fault() was
being misinterpreted, leading to sev_handle_rmp_fault() being called
unecessarily in some cases. Interestingly, because
sev_handle_rmp_fault() results in zapping sPTEs after PSMASH'ing them,
this bug was hiding the issue addressed in the above PATCH 21 by
forcing the fast path to get skipped. This can be squashed into:

KVM: SEV: Add support to handle RMP nested page faults

[PATCH 23/20] KVM: SEV: Fix PSC handling for SMASH/UNSMASH and partial update ops

fixes up the GHCB PSC handling code to address a number of situations
that aren't triggered by normal SNP guests, but are allowed by the
GHCB spec and could become issues with future/other guest
implementations. This can be squashed into:

KVM: SEV: Add support to handle Page State Change VMGEXIT

I've sent them all as a response to this series, but have them available
here applied on top of the your current kvm/queue (commit 15889fca49df):

https://github.com/mdroth/linux/commits/snp-host-v15c2-unsquashed
(the patch at the top can be ignored, it's only for testing 2MB gmem
backing pages)

I've also put together a branch with the patches already squashed in
(except for "KVM: MMU: Disable fast path for private memslots" which is
a standalone patch that is likely applicable to both TDX and SNP, so
I've simply moved it to the beginning of the SNP series)

https://github.com/mdroth/linux/commits/snp-host-v15c2

Sorry for the late fixes. Let me know if you want me to submit any of
these by some other means.

-Mike


>
>
> -Mike
>
> >
> > Paolo
> >
> >
> > > Patch Layout
> > > ------------
> > >
> > > 01-02: These patches revert+replace the existing .gmem_validate_fault hook
> > > with a similar .private_max_mapping_level as suggested by Sean[1]
> > >
> > > 03-04: These patches add some basic infrastructure and introduces a new
> > > KVM_X86_SNP_VM vm_type to handle differences verses the existing
> > > KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types.
> > >
> > > 05-07: These implement the KVM API to handle the creation of a
> > > cryptographic launch context, encrypt/measure the initial image
> > > into guest memory, and finalize it before launching it.
> > >
> > > 08-12: These implement handling for various guest-generated events such
> > > as page state changes, onlining of additional vCPUs, etc.
> > >
> > > 13-16: These implement the gmem/mmu hooks needed to prepare gmem-allocated
> > > pages before mapping them into guest private memory ranges as
> > > well as cleaning them up prior to returning them to the host for
> > > use as normal memory. Because this supplants certain activities
> > > like issued WBINVDs during KVM MMU invalidations, there's also
> > > a patch to avoid duplicating that work to avoid unecessary
> > > overhead.
> > >
> > > 17: With all the core support in place, the patch adds a kvm_amd module
> > > parameter to enable SNP support.
> > >
> > > 18-20: These patches all deal with the servicing of guest requests to handle
> > > things like attestation, as well as some related host-management
> > > interfaces.
> > >
> > > [1] https://lore.kernel.org/kvm/[email protected]/#t
> > >
> > >
> > > Testing
> > > -------
> > >
> > > For testing this via QEMU, use the following tree:
> > >
> > > https://github.com/amdese/qemu/commits/snp-v4-wip3c
> > >
> > > A patched OVMF is also needed due to upstream KVM no longer supporting MMIO
> > > ranges that are mapped as private. It is recommended you build the AmdSevX64
> > > variant as it provides the kernel-hashing support present in this series:
> > >
> > > https://github.com/amdese/ovmf/commits/apic-mmio-fix1d
> > >
> > > A basic command-line invocation for SNP would be:
> > >
> > > qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> > > -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> > > -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> > > -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
> > > -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
> > >
> > > With kernel-hashing and certificate data supplied:
> > >
> > > qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> > > -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> > > -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> > > -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=,certs-path=/home/mroth/cert.blob,kernel-hashes=on
> > > -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
> > > -kernel /boot/vmlinuz-$ver
> > > -initrd /boot/initrd.img-$ver
> > > -append "root=UUID=d72a6d1c-06cf-4b79-af43-f1bac4f620f9 ro console=ttyS0,115200n8"
> > >
> > > With standard X64 OVMF package with separate image for persistent NVRAM:
> > >
> > > qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> > > -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> > > -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> > > -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
> > > -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d.fd
> > > -drive if=pflash,format=raw,unit=0,file=OVMF_VARS-upstream-20240410-apic-mmio-fix1d.fd,readonly=off
> > >
> > >
> > > Known issues / TODOs
> > > --------------------
> > >
> > > * Base tree in some cases reports "Unpatched return thunk in use. This should
> > > not happen!" the first time it runs an SVM/SEV/SNP guests. This a recent
> > > regression upstream and unrelated to this series:
> > >
> > > https://lore.kernel.org/linux-kernel/CANpmjNOcKzEvLHoGGeL-boWDHJobwfwyVxUqMq2kWeka3N4tXA@mail.gmail.com/T/
> > >
> > > * 2MB hugepage support has been dropped pending discussion on how we plan to
> > > re-enable it in gmem.
> > >
> > > * Host kexec should work, but there is a known issue with host kdump support
> > > while SNP guests are running that will be addressed as a follow-up.
> > >
> > > * SNP kselftests are currently a WIP and will be included as part of SNP
> > > upstreaming efforts in the near-term.
> > >
> > >
> > > SEV-SNP Overview
> > > ----------------
> > >
> > > This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
> > > changes required to add KVM support for SEV-SNP. This series builds upon
> > > SEV-SNP guest support, which is now in mainline, and and SEV-SNP host
> > > initialization support, which is now in linux-next.
> > >
> > > While series provides the basic building blocks to support booting the
> > > SEV-SNP VMs, it does not cover all the security enhancement introduced by
> > > the SEV-SNP such as interrupt protection, which will added in the future.
> > >
> > > With SNP, when pages are marked as guest-owned in the RMP table, they are
> > > assigned to a specific guest/ASID, as well as a specific GFN with in the
> > > guest. Any attempts to map it in the RMP table to a different guest/ASID,
> > > or a different GFN within a guest/ASID, will result in an RMP nested page
> > > fault.
> > >
> > > Prior to accessing a guest-owned page, the guest must validate it with a
> > > special PVALIDATE instruction which will set a special bit in the RMP table
> > > for the guest. This is the only way to set the validated bit outside of the
> > > initial pre-encrypted guest payload/image; any attempts outside the guest to
> > > modify the RMP entry from that point forward will result in the validated
> > > bit being cleared, at which point the guest will trigger an exception if it
> > > attempts to access that page so it can be made aware of possible tampering.
> > >
> > > One exception to this is the initial guest payload, which is pre-validated
> > > by the firmware prior to launching. The guest can use Guest Message requests
> > > to fetch an attestation report which will include the measurement of the
> > > initial image so that the guest can verify it was booted with the expected
> > > image/environment.
> > >
> > > After boot, guests can use Page State Change requests to switch pages
> > > between shared/hypervisor-owned and private/guest-owned to share data for
> > > things like DMA, virtio buffers, and other GHCB requests.
> > >
> > > In this implementation of SEV-SNP, private guest memory is managed by a new
> > > kernel framework called guest_memfd (gmem). With gmem, a new
> > > KVM_SET_MEMORY_ATTRIBUTES KVM ioctl has been added to tell the KVM
> > > MMU whether a particular GFN should be backed by shared (normal) memory or
> > > private (gmem-allocated) memory. To tie into this, Page State Change
> > > requests are forward to userspace via KVM_EXIT_VMGEXIT exits, which will
> > > then issue the corresponding KVM_SET_MEMORY_ATTRIBUTES call to set the
> > > private/shared state in the KVM MMU.
> > >
> > > The gmem / KVM MMU hooks implemented in this series will then update the RMP
> > > table entries for the backing PFNs to set them to guest-owned/private when
> > > mapping private pages into the guest via KVM MMU, or use the normal KVM MMU
> > > handling in the case of shared pages where the corresponding RMP table
> > > entries are left in the default shared/hypervisor-owned state.
> > >
> > > Feedback/review is very much appreciated!
> > >
> > > -Mike
> > >
> > >
> > > Changes since v14:
> > >
> > > * switch to vendor-agnostic KVM_HC_MAP_GPA_RANGE exit for forwarding
> > > page-state change requests to userspace instead of an SNP-specific exit
> > > (Sean)
> > > * drop SNP_PAUSE_ATTESTATION/SNP_RESUME_ATTESTATION interfaces, instead
> > > add handling in KVM_EXIT_VMGEXIT so that VMMs can implement their own
> > > mechanisms for keeping userspace-supplied certificates in-sync with
> > > firmware's TCB/endorsement key (Sean)
> > > * carve out SEV-ES-specific handling for GHCB protocol 2, add control of
> > > the protocol version, and post as a separate prereq patchset (Sean)
> > > * use more consistent error-handling in snp_launch_{start,update,finish},
> > > simplify logic based on review comments (Sean)
> > > * rename .gmem_validate_fault to .private_max_mapping_level and rework
> > > logic based on review suggestions (Sean)
> > > * reduce number of pr_debug()'s in series, avoid multiple WARN's in
> > > succession (Sean)
> > > * improve documentation and comments throughout
> > >
> > > Changes since v13:
> > >
> > > * rebase to new kvm-coco-queue and wire up to PFERR_PRIVATE_ACCESS (Paolo)
> > > * handle setting kvm->arch.has_private_mem in same location as
> > > kvm->arch.has_protected_state (Paolo)
> > > * add flags and additional padding fields to
> > > snp_launch{start,update,finish} APIs to address alignment and
> > > expandability (Paolo)
> > > * update snp_launch_update() to update input struct values to reflect
> > > current progress of command in situations where mulitple calls are
> > > needed (Paolo)
> > > * update snp_launch_update() to avoid copying/accessing 'src' parameter
> > > when dealing with zero pages. (Paolo)
> > > * update snp_launch_update() to use u64 as length input parameter instead
> > > of u32 and adjust padding accordingly
> > > * modify ordering of SNP_POLICY_MASK_* definitions to be consistent with
> > > bit order of corresponding flags
> > > * let firmware handle enforcement of policy bits corresponding to
> > > user-specified minimum API version
> > > * add missing "0x" prefixs in pr_debug()'s for snp_launch_start()
> > > * fix handling of VMSAs during in-place migration (Paolo)
> > >
> > > Changes since v12:
> > >
> > > * rebased to latest kvm-coco-queue branch (commit 4d2deb62185f)
> > > * add more input validation for SNP_LAUNCH_START, especially for handling
> > > things like MBO/MBZ policy bits, and API major/minor minimums. (Paolo)
> > > * block SNP KVM instances from being able to run legacy SEV commands (Paolo)
> > > * don't attempt to measure VMSA for vcpu 0/BSP before the others, let
> > > userspace deal with the ordering just like with SEV-ES (Paolo)
> > > * fix up docs for SNP_LAUNCH_FINISH (Paolo)
> > > * introduce svm->sev_es.snp_has_guest_vmsa flag to better distinguish
> > > handling for guest-mapped vs non-guest-mapped VMSAs, rename
> > > 'snp_ap_create' flag to 'snp_ap_waiting_for_reset' (Paolo)
> > > * drop "KVM: SEV: Use a VMSA physical address variable for populating VMCB"
> > > as it is no longer needed due to above VMSA rework
> > > * replace pr_debug_ratelimited() messages for RMP #NPFs with a single trace
> > > event
> > > * handle transient PSMASH_FAIL_INUSE return codes in kvm_gmem_invalidate(),
> > > switch to WARN_ON*()'s to indicate remaining error cases are not expected
> > > and should not be seen in practice. (Paolo)
> > > * add a cond_resched() in kvm_gmem_invalidate() to avoid soft lock-ups when
> > > cleaning up large guest memory ranges.
> > > * rename VLEK_REQUIRED to VCEK_DISABLE. it's be more applicable if another
> > > key type ever gets added.
> > > * don't allow attestation to be paused while an attestation request is
> > > being processed by firmware (Tom)
> > > * add missing Documentation entry for SNP_VLEK_LOAD
> > > * collect Reviewed-by's from Paolo and Tom
> > >
> > >
> > > ----------------------------------------------------------------
> > > Ashish Kalra (1):
> > > KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP
> > >
> > > Brijesh Singh (8):
> > > KVM: SEV: Add initial SEV-SNP support
> > > KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
> > > KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
> > > KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
> > > KVM: SEV: Add support to handle GHCB GPA register VMGEXIT
> > > KVM: SEV: Add support to handle RMP nested page faults
> > > KVM: SVM: Add module parameter to enable SEV-SNP
> > > KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
> > >
> > > Michael Roth (10):
> > > Revert "KVM: x86: Add gmem hook for determining max NPT mapping level"
> > > KVM: x86: Add hook for determining max NPT mapping level
> > > KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y
> > > KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
> > > KVM: SEV: Add support to handle Page State Change VMGEXIT
> > > KVM: SEV: Implement gmem hook for initializing private pages
> > > KVM: SEV: Implement gmem hook for invalidating private pages
> > > KVM: x86: Implement hook for determining max NPT mapping level
> > > KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
> > > crypto: ccp: Add the SNP_VLEK_LOAD command
> > >
> > > Tom Lendacky (1):
> > > KVM: SEV: Support SEV-SNP AP Creation NAE event
> > >
> > > Documentation/virt/coco/sev-guest.rst | 19 +
> > > Documentation/virt/kvm/api.rst | 87 ++
> > > .../virt/kvm/x86/amd-memory-encryption.rst | 110 +-
> > > arch/x86/include/asm/kvm-x86-ops.h | 2 +-
> > > arch/x86/include/asm/kvm_host.h | 5 +-
> > > arch/x86/include/asm/sev-common.h | 25 +
> > > arch/x86/include/asm/sev.h | 3 +
> > > arch/x86/include/asm/svm.h | 9 +-
> > > arch/x86/include/uapi/asm/kvm.h | 48 +
> > > arch/x86/kvm/Kconfig | 3 +
> > > arch/x86/kvm/mmu.h | 2 -
> > > arch/x86/kvm/mmu/mmu.c | 27 +-
> > > arch/x86/kvm/svm/sev.c | 1538 +++++++++++++++++++-
> > > arch/x86/kvm/svm/svm.c | 44 +-
> > > arch/x86/kvm/svm/svm.h | 52 +
> > > arch/x86/kvm/trace.h | 31 +
> > > arch/x86/kvm/x86.c | 17 +
> > > drivers/crypto/ccp/sev-dev.c | 36 +
> > > include/linux/psp-sev.h | 4 +-
> > > include/uapi/linux/kvm.h | 23 +
> > > include/uapi/linux/psp-sev.h | 27 +
> > > include/uapi/linux/sev-guest.h | 9 +
> > > virt/kvm/guest_memfd.c | 4 +-
> > > 23 files changed, 2081 insertions(+), 44 deletions(-)
> > >
> >
> >
>

2024-05-10 13:50:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <[email protected]> wrote:
>
> > + * Since software-protected VMs don't have a notion of a shared vs.
> > + * private that's separate from what KVM is tracking, the above
> > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > + * special handling for that case for now.
>
> Very technically, it can occur if userspace _just_ modified the attributes. And
> as I've said multiple times, at least for now, I want to avoid special casing
> SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> is to allow testing flows that are impossible to excercise without SNP/TDX hardware.

Yep, it is not like they have to be optimized.

> > + */
> > + if (kvm_slot_can_be_private(fault->slot) &&
> > + !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
>
> Heh, !(x && y) kills me, I misread this like 4 times.
>
> Anyways, I don't like the heuristic. It doesn't tie the restriction back to the
> cause in any reasonable way. Can't this simply be?
>
> if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
> return false;

You beat me to it by seconds. And it can also be guarded by a check on
kvm->arch.has_private_mem to avoid the attributes lookup.

> Which is much, much more self-explanatory.

Both more self-explanatory and more correct.

Paolo


2024-05-10 13:58:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v15 22/23] KVM: SEV: Fix return code interpretation for RMP nested page faults

On Thu, May 09, 2024, Michael Roth wrote:
> The intended logic when handling #NPFs with the RMP bit set (31) is to
> first check to see if the #NPF requires a shared<->private transition
> and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
> get forwarded on to userspace before proceeding with any handling of
> other potential RMP fault conditions like needing to PSMASH the RMP
> entry/etc (which will be done later if the guest still re-faults after
> the KVM_EXIT_MEMORY_FAULT is processed by userspace).
>
> The determination of whether any userspace handling of
> KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
> of kvm_mmu_page_fault(). However, the current code misinterprets the
> return code, expecting 0 to indicate a userspace exit rather than less
> than 0 (-EFAULT). This leads to the following unexpected behavior:
>
> - for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
> conversions, warnings get printed from sev_handle_rmp_fault()
> because it does not expect to be called for GPAs where
> KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
> generally do this, but it is allowed and should be handled
> similarly to private->shared conversions rather than triggering any
> sort of warnings
>
> - if gmem support for 2MB folios is enabled (via currently out-of-tree
> code), implicit shared<->private conversions will always result in
> a PSMASH being attempted, even if it's not actually needed to
> resolve the RMP fault. This doesn't cause any harm, but results in a
> needless PSMASH and zapping of the sPTE
>
> Resolve these issues by calling sev_handle_rmp_fault() only when
> kvm_mmu_page_fault()'s return code is greater than or equal to 0,
> indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
> simplify the code slightly and fix up the associated comments for better
> clarity.
>
> Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")
>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 426ad49325d7..9431ce74c7d4 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> svm->vmcb->control.insn_len);
>
> /*
> - * rc == 0 indicates a userspace exit is needed to handle page
> - * transitions, so do that first before updating the RMP table.
> + * rc < 0 indicates a userspace exit may be needed to handle page
> + * attribute updates, so deal with that first before handling other
> + * potential RMP fault conditions.
> */
> - if (error_code & PFERR_GUEST_RMP_MASK) {
> - if (rc == 0)
> - return rc;
> + if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)

This isn't correct either. A return of '0' also indiciates "exit to userspace",
it just doesn't happen with SNP because '0' is returned only when KVM attempts
emulation, and that too gets short-circuited by svm_check_emulate_instruction().

And I would honestly drop the comment, KVM's less-than-pleasant 1/0/-errno return
values overload is ubiquitous enough that it should be relatively self-explanatory.

Or if you prefer to keep a comment, drop the part that specifically calls out
attributes updates, because that incorrectly implies that's the _only_ reason
why KVM checks the return. But my vote is to drop the comment, because it
essentially becomes "don't proceed to step 2 if step 1 failed", which kind of
makes the reader go "well, yeah".

2024-05-10 15:38:17

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <[email protected]> wrote:
> >
> > > + * Since software-protected VMs don't have a notion of a shared vs.
> > > + * private that's separate from what KVM is tracking, the above
> > > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > + * special handling for that case for now.
> >
> > Very technically, it can occur if userspace _just_ modified the attributes. And
> > as I've said multiple times, at least for now, I want to avoid special casing
> > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
>
> Yep, it is not like they have to be optimized.

Ok, I thought there were maybe some future plans to use sw-protected VMs
to get some added protections from userspace. But even then there'd
probably still be extra considerations for how to handle access tracking
so white-listing them probably isn't right anyway.

I was also partly tempted to take this route because it would cover this
TDX patch as well:

https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

and avoid any weirdness about checking kvm_mem_is_private() without
checking mmu_invalidate_seq, but I think those cases all end up
resolving themselves eventually and added some comments around that.

>
> > > + */
> > > + if (kvm_slot_can_be_private(fault->slot) &&
> > > + !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > > + vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
> >
> > Heh, !(x && y) kills me, I misread this like 4 times.
> >
> > Anyways, I don't like the heuristic. It doesn't tie the restriction back to the
> > cause in any reasonable way. Can't this simply be?
> >
> > if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
> > return false;
>
> You beat me to it by seconds. And it can also be guarded by a check on
> kvm->arch.has_private_mem to avoid the attributes lookup.

I re-tested with things implemented this way and everything seems to
look good. It's not clear to me whether this would cover the cases the
above-mentioned TDX patch handles, but no biggie if that's still needed.

The new version of the patch is here:

https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

And I've updated my branches with to replace the old patch and also
incorporate Sean's suggestions for patch 22:

https://github.com/mdroth/linux/commits/snp-host-v15c3-unsquashed

and have them here with things already squashed in/relocated:

https://github.com/mdroth/linux/commits/snp-host-v15c3

Thanks for the feedback Sean, Paolo.

-Mike

>
> > Which is much, much more self-explanatory.
>
> Both more self-explanatory and more correct.
>
> Paolo
>
>

2024-05-10 15:59:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v15 21/23] KVM: MMU: Disable fast path for private memslots

On Fri, May 10, 2024, Michael Roth wrote:
> On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > > + * Since software-protected VMs don't have a notion of a shared vs.
> > > > + * private that's separate from what KVM is tracking, the above
> > > > + * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > + * special handling for that case for now.
> > >
> > > Very technically, it can occur if userspace _just_ modified the attributes. And
> > > as I've said multiple times, at least for now, I want to avoid special casing
> > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> >
> > Yep, it is not like they have to be optimized.
>
> Ok, I thought there were maybe some future plans to use sw-protected VMs
> to get some added protections from userspace. But even then there'd
> probably still be extra considerations for how to handle access tracking
> so white-listing them probably isn't right anyway.
>
> I was also partly tempted to take this route because it would cover this
> TDX patch as well:
>
> https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
fixing, just in a less precise way. S-EPT entries only support RWX=0 and RWX=111b,
i.e. it should be impossible to have a write-fault to a present S-EPT entry.

And if TDX is running afoul of this code:

if (!fault->present)
return !kvm_ad_enabled();

then KVM should do the sane thing and require A/D support be enabled for TDX.

And if it's something else entirely, that changelog has some explaining to do.

> and avoid any weirdness about checking kvm_mem_is_private() without
> checking mmu_invalidate_seq, but I think those cases all end up
> resolving themselves eventually and added some comments around that.

Yep, checking state that is protected by mmu_invalidate_seq outside of mmu_lock
is definitely allowed, e.g. the entire fast page fault path operates outside of
mmu_lock and thus outside of mmu_invalidate_seq's purview.

It's a-ok because the SPTE are done with an atomic CMPXCHG, and so KVM only needs
to ensure its page tables aren't outright _freed_. If the zap triggered by the
attributes change "wins", then the fast #PF path will fail the CMPXCHG and be an
expensive NOP. If the fast #PF wins, the zap will pave over the fast #PF fix,
and the IPI+flush that is needed for all zaps, to ensure vCPUs don't have stale
references, does the rest.

And if there's an attributes race that causes the fast #PF to bail early, the vCPU
will see the correct state on the next page fault.

2024-05-10 16:39:19

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v15 22/23] KVM: SEV: Fix return code interpretation for RMP nested page faults

On Fri, May 10, 2024 at 06:01:52PM +0200, Paolo Bonzini wrote:
> On 5/10/24 15:58, Sean Christopherson wrote:
> > On Thu, May 09, 2024, Michael Roth wrote:
> > > The intended logic when handling #NPFs with the RMP bit set (31) is to
> > > first check to see if the #NPF requires a shared<->private transition
> > > and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
> > > get forwarded on to userspace before proceeding with any handling of
> > > other potential RMP fault conditions like needing to PSMASH the RMP
> > > entry/etc (which will be done later if the guest still re-faults after
> > > the KVM_EXIT_MEMORY_FAULT is processed by userspace).
> > >
> > > The determination of whether any userspace handling of
> > > KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
> > > of kvm_mmu_page_fault(). However, the current code misinterprets the
> > > return code, expecting 0 to indicate a userspace exit rather than less
> > > than 0 (-EFAULT). This leads to the following unexpected behavior:
> > >
> > > - for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
> > > conversions, warnings get printed from sev_handle_rmp_fault()
> > > because it does not expect to be called for GPAs where
> > > KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
> > > generally do this, but it is allowed and should be handled
> > > similarly to private->shared conversions rather than triggering any
> > > sort of warnings
> > >
> > > - if gmem support for 2MB folios is enabled (via currently out-of-tree
> > > code), implicit shared<->private conversions will always result in
> > > a PSMASH being attempted, even if it's not actually needed to
> > > resolve the RMP fault. This doesn't cause any harm, but results in a
> > > needless PSMASH and zapping of the sPTE
> > >
> > > Resolve these issues by calling sev_handle_rmp_fault() only when
> > > kvm_mmu_page_fault()'s return code is greater than or equal to 0,
> > > indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
> > > simplify the code slightly and fix up the associated comments for better
> > > clarity.
> > >
> > > Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")
> > >
> > > Signed-off-by: Michael Roth <[email protected]>
> > > ---
> > > arch/x86/kvm/svm/svm.c | 10 ++++------
> > > 1 file changed, 4 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 426ad49325d7..9431ce74c7d4 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> > > svm->vmcb->control.insn_len);
> > > /*
> > > - * rc == 0 indicates a userspace exit is needed to handle page
> > > - * transitions, so do that first before updating the RMP table.
> > > + * rc < 0 indicates a userspace exit may be needed to handle page
> > > + * attribute updates, so deal with that first before handling other
> > > + * potential RMP fault conditions.
> > > */
> > > - if (error_code & PFERR_GUEST_RMP_MASK) {
> > > - if (rc == 0)
> > > - return rc;
> > > + if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)
> >
> > This isn't correct either. A return of '0' also indiciates "exit to userspace",
> > it just doesn't happen with SNP because '0' is returned only when KVM attempts
> > emulation, and that too gets short-circuited by svm_check_emulate_instruction().
> >
> > And I would honestly drop the comment, KVM's less-than-pleasant 1/0/-errno return
> > values overload is ubiquitous enough that it should be relatively self-explanatory.
> >
> > Or if you prefer to keep a comment, drop the part that specifically calls out
> > attributes updates, because that incorrectly implies that's the _only_ reason
> > why KVM checks the return. But my vote is to drop the comment, because it
> > essentially becomes "don't proceed to step 2 if step 1 failed", which kind of
> > makes the reader go "well, yeah".
>
> So IIUC you're suggesting
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 426ad49325d7..c39eaeb21981 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2068,16 +2068,11 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
> svm->vmcb->control.insn_bytes : NULL,
> svm->vmcb->control.insn_len);
> + if (rc <= 0)
> + return rc;
> - /*
> - * rc == 0 indicates a userspace exit is needed to handle page
> - * transitions, so do that first before updating the RMP table.
> - */
> - if (error_code & PFERR_GUEST_RMP_MASK) {
> - if (rc == 0)
> - return rc;
> + if (error_code & PFERR_GUEST_RMP_MASK)
> sev_handle_rmp_fault(vcpu, fault_address, error_code);
> - }
> return rc;
> }
>
> ?
>
> So, we're... a bit tight for 6.10 to include SNP and that is an
> understatement. My plan is to merge it for 6.11, but do so
> immediately after the merge window ends. In other words, it
> is a delay in terms of release but not in terms of time. I
> don't want QEMU and kvm-unit-tests work to be delayed any
> further, in particular.

That's unfortunate, I'd thought from the PUCK call that we still had
some time to stabilize things before merge window. But whatever you
think is best.

>
> Once we sort out the loose ends of patches 21-23, you could send
> it as a pull request.

Ok, as a pull request against kvm/next, or kvm/queue?

Thanks,

Mike

>
> Paolo
>
>

2024-05-10 17:25:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v15 22/23] KVM: SEV: Fix return code interpretation for RMP nested page faults

On Fri, May 10, 2024 at 6:59 PM Paolo Bonzini <[email protected]> wrote:
> Well, the merge window starts next sunday, doesn't it? If there's an
> -rc8 I agree there's some leeway, but that is not too likely.
>
> >> Once we sort out the loose ends of patches 21-23, you could send
> >> it as a pull request.
> > Ok, as a pull request against kvm/next, or kvm/queue?
>
> Against kvm/next.

Ah no, only kvm/queue has the preparatory hooks - they make no sense
without something that uses them. kvm/queue is ready now.

Also, please send the pull request "QEMU style", i.e. with patches
as replies.

If there's an -rc8, I'll probably pull it on Thursday morning.

Paolo


2024-05-10 19:09:19

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v15 23/23] KVM: SEV: Fix PSC handling for SMASH/UNSMASH and partial update ops

On Fri, May 10, 2024 at 07:09:07PM +0200, Paolo Bonzini wrote:
> On 5/10/24 03:58, Michael Roth wrote:
> > There are a few edge-cases that the current processing for GHCB PSC
> > requests doesn't handle properly:
> >
> > - KVM properly ignores SMASH/UNSMASH ops when they are embedded in a
> > PSC request buffer which contains other PSC operations, but
> > inadvertantly forwards them to userspace as private->shared PSC
> > requests if they appear at the end of the buffer. Make sure these are
> > ignored instead, just like cases where they are not at the end of the
> > request buffer.
> >
> > - Current code handles non-zero 'cur_page' fields when they are at the
> > beginning of a new GPA range, but it is not handling properly when
> > iterating through subsequent entries which are otherwise part of a
> > contiguous range. Fix up the handling so that these entries are not
> > combined into a larger contiguous range that include unintended GPA
> > ranges and are instead processed later as the start of a new
> > contiguous range.
> >
> > - The page size variable used to track 2M entries in KVM for inflight PSCs
> > might be artifically set to a different value, which can lead to
> > unexpected values in the entry's final 'cur_page' update. Use the
> > entry's 'pagesize' field instead to determine what the value of
> > 'cur_page' should be upon completion of processing.
> >
> > While here, also add a small helper for clearing in-flight PSCs
> > variables and fix up comments for better readability.
> >
> > Fixes: 266205d810d2 ("KVM: SEV: Add support to handle Page State Change VMGEXIT")
> > Signed-off-by: Michael Roth <[email protected]>
>
> There are some more improvements that can be made to the readability of
> the code... this one is already better than the patch is fixing up, but I
> don't like the code that is in the loop even though it is unconditionally
> followed by "break".
>
> Here's my attempt at replacing this patch, which is really more of a
> rewrite of the whole function... Untested beyond compilation.

Thanks for the suggested rework. I tested with/without 2MB pages and
everything worked as-written. This is the full/squashed patch I plan to
include in the pull request:

https://github.com/mdroth/linux/commit/91f6d31c4dfc88dd1ac378e2db6117b0c982e63c

-Mike

>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 35f0bd91f92e..6e612789c35f 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3555,23 +3555,25 @@ struct psc_buffer {
> static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc);
> -static int snp_complete_psc(struct kvm_vcpu *vcpu)
> +static void snp_complete_psc(struct vcpu_svm *svm, u64 psc_ret)
> +{
> + svm->sev_es.psc_inflight = 0;
> + svm->sev_es.psc_idx = 0;
> + svm->sev_es.psc_2m = 0;
> + ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
> +}
> +
> +static void __snp_complete_one_psc(struct vcpu_svm *svm)
> {
> - struct vcpu_svm *svm = to_svm(vcpu);
> struct psc_buffer *psc = svm->sev_es.ghcb_sa;
> struct psc_entry *entries = psc->entries;
> struct psc_hdr *hdr = &psc->hdr;
> - __u64 psc_ret;
> __u16 idx;
> - if (vcpu->run->hypercall.ret) {
> - psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
> - goto out_resume;
> - }
> -
> /*
> * Everything in-flight has been processed successfully. Update the
> - * corresponding entries in the guest's PSC buffer.
> + * corresponding entries in the guest's PSC buffer and zero out the
> + * count of in-flight PSC entries.
> */
> for (idx = svm->sev_es.psc_idx; svm->sev_es.psc_inflight;
> svm->sev_es.psc_inflight--, idx++) {
> @@ -3581,17 +3583,22 @@ static int snp_complete_psc(struct kvm_vcpu *vcpu)
> }
> hdr->cur_entry = idx;
> +}
> +
> +static int snp_complete_one_psc(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> + struct psc_buffer *psc = svm->sev_es.ghcb_sa;
> +
> + if (vcpu->run->hypercall.ret) {
> + snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
> + return 1; /* resume guest */
> + }
> +
> + __snp_complete_one_psc(svm);
> /* Handle the next range (if any). */
> return snp_begin_psc(svm, psc);
> -
> -out_resume:
> - svm->sev_es.psc_idx = 0;
> - svm->sev_es.psc_inflight = 0;
> - svm->sev_es.psc_2m = false;
> - ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
> -
> - return 1; /* resume guest */
> }
> static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
> @@ -3601,18 +3608,20 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
> struct psc_hdr *hdr = &psc->hdr;
> struct psc_entry entry_start;
> u16 idx, idx_start, idx_end;
> - __u64 psc_ret, gpa;
> + u64 gfn;
> int npages;
> -
> - /* There should be no other PSCs in-flight at this point. */
> - if (WARN_ON_ONCE(svm->sev_es.psc_inflight)) {
> - psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
> - goto out_resume;
> - }
> + bool huge;
> if (!(vcpu->kvm->arch.hypercall_exit_enabled & (1 << KVM_HC_MAP_GPA_RANGE))) {
> - psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
> - goto out_resume;
> + snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
> + return 1;
> + }
> +
> +next_range:
> + /* There should be no other PSCs in-flight at this point. */
> + if (WARN_ON_ONCE(svm->sev_es.psc_inflight)) {
> + snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
> + return 1;
> }
> /*
> @@ -3624,97 +3633,99 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
> idx_end = hdr->end_entry;
> if (idx_end >= VMGEXIT_PSC_MAX_COUNT) {
> - psc_ret = VMGEXIT_PSC_ERROR_INVALID_HDR;
> - goto out_resume;
> - }
> -
> - /* Nothing more to process. */
> - if (idx_start > idx_end) {
> - psc_ret = 0;
> - goto out_resume;
> + snp_complete_psc(svm, VMGEXIT_PSC_ERROR_INVALID_HDR);
> + return 1;
> }
> /* Find the start of the next range which needs processing. */
> for (idx = idx_start; idx <= idx_end; idx++, hdr->cur_entry++) {
> - __u16 cur_page;
> - gfn_t gfn;
> - bool huge;
> -
> entry_start = entries[idx];
> -
> - /* Only private/shared conversions are currently supported. */
> - if (entry_start.operation != VMGEXIT_PSC_OP_PRIVATE &&
> - entry_start.operation != VMGEXIT_PSC_OP_SHARED)
> - continue;
> -
> gfn = entry_start.gfn;
> - cur_page = entry_start.cur_page;
> huge = entry_start.pagesize;
> + npages = huge ? 512 : 1;
> - if ((huge && (cur_page > 512 || !IS_ALIGNED(gfn, 512))) ||
> - (!huge && cur_page > 1)) {
> - psc_ret = VMGEXIT_PSC_ERROR_INVALID_ENTRY;
> - goto out_resume;
> + if (entry_start.cur_page > npages || !IS_ALIGNED(gfn, npages)) {
> + snp_complete_psc(svm, VMGEXIT_PSC_ERROR_INVALID_ENTRY);
> + return 1;
> }
> + if (entry_start.cur_page) {
> + /*
> + * If this is a partially-completed 2M range, force 4K
> + * handling for the remaining pages since they're effectively
> + * split at this point. Subsequent code should ensure this
> + * doesn't get combined with adjacent PSC entries where 2M
> + * handling is still possible.
> + */
> + npages -= entry_start.cur_page;
> + gfn += entry_start.cur_page;
> + huge = false;
> + }
> + if (npages)
> + break;
> +
> /* All sub-pages already processed. */
> - if ((huge && cur_page == 512) || (!huge && cur_page == 1))
> - continue;
> -
> - /*
> - * If this is a partially-completed 2M range, force 4K handling
> - * for the remaining pages since they're effectively split at
> - * this point. Subsequent code should ensure this doesn't get
> - * combined with adjacent PSC entries where 2M handling is still
> - * possible.
> - */
> - svm->sev_es.psc_2m = cur_page ? false : huge;
> - svm->sev_es.psc_idx = idx;
> - svm->sev_es.psc_inflight = 1;
> -
> - gpa = gfn_to_gpa(gfn + cur_page);
> - npages = huge ? 512 - cur_page : 1;
> - break;
> }
> + if (idx > idx_end) {
> + /* Nothing more to process. */
> + snp_complete_psc(svm, 0);
> + return 1;
> + }
> +
> + svm->sev_es.psc_2m = huge;
> + svm->sev_es.psc_idx = idx;
> + svm->sev_es.psc_inflight = 1;
> +
> /*
> * Find all subsequent PSC entries that contain adjacent GPA
> * ranges/operations and can be combined into a single
> * KVM_HC_MAP_GPA_RANGE exit.
> */
> - for (idx = svm->sev_es.psc_idx + 1; idx <= idx_end; idx++) {
> + while (++idx <= idx_end) {
> struct psc_entry entry = entries[idx];
> if (entry.operation != entry_start.operation ||
> - entry.gfn != entry_start.gfn + npages ||
> - !!entry.pagesize != svm->sev_es.psc_2m)
> + entry.gfn != gfn + npages ||
> + entry.cur_page ||
> + !!entry.pagesize != huge)
> break;
> svm->sev_es.psc_inflight++;
> - npages += entry_start.pagesize ? 512 : 1;
> + npages += huge ? 512 : 1;
> }
> - vcpu->run->exit_reason = KVM_EXIT_HYPERCALL;
> - vcpu->run->hypercall.nr = KVM_HC_MAP_GPA_RANGE;
> - vcpu->run->hypercall.args[0] = gpa;
> - vcpu->run->hypercall.args[1] = npages;
> - vcpu->run->hypercall.args[2] = entry_start.operation == VMGEXIT_PSC_OP_PRIVATE
> - ? KVM_MAP_GPA_RANGE_ENCRYPTED
> - : KVM_MAP_GPA_RANGE_DECRYPTED;
> - vcpu->run->hypercall.args[2] |= entry_start.pagesize
> - ? KVM_MAP_GPA_RANGE_PAGE_SZ_2M
> - : KVM_MAP_GPA_RANGE_PAGE_SZ_4K;
> - vcpu->arch.complete_userspace_io = snp_complete_psc;
> + switch (entry_start.operation) {
> + case VMGEXIT_PSC_OP_PRIVATE:
> + case VMGEXIT_PSC_OP_SHARED:
> + vcpu->run->exit_reason = KVM_EXIT_HYPERCALL;
> + vcpu->run->hypercall.nr = KVM_HC_MAP_GPA_RANGE;
> + vcpu->run->hypercall.args[0] = gfn_to_gpa(gfn);
> + vcpu->run->hypercall.args[1] = npages;
> + vcpu->run->hypercall.args[2] = entry_start.operation == VMGEXIT_PSC_OP_PRIVATE
> + ? KVM_MAP_GPA_RANGE_ENCRYPTED
> + : KVM_MAP_GPA_RANGE_DECRYPTED;
> + vcpu->run->hypercall.args[2] |= huge
> + ? KVM_MAP_GPA_RANGE_PAGE_SZ_2M
> + : KVM_MAP_GPA_RANGE_PAGE_SZ_4K;
> + vcpu->arch.complete_userspace_io = snp_complete_one_psc;
> - return 0; /* forward request to userspace */
> + return 0; /* forward request to userspace */
> -out_resume:
> - svm->sev_es.psc_idx = 0;
> - svm->sev_es.psc_inflight = 0;
> - svm->sev_es.psc_2m = false;
> - ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
> + default:
> + /*
> + * Only shared/private PSC operations are currently supported, so if the
> + * entire range consists of unsupported operations (e.g. SMASH/UNSMASH),
> + * then consider the entire range completed and avoid exiting to
> + * userspace. In theory snp_complete_psc() can be called directly
> + * at this point to complete the current range and start the next one,
> + * but that could lead to unexpected levels of recursion.
> + */
> + __snp_complete_one_psc(svm);
> + goto next_range;
> + }
> - return 1; /* resume guest */
> + unreachable();
> }
> static int __sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
>
>

2024-05-13 23:48:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event

On Wed, May 01, 2024, Michael Roth wrote:
> Version 2 of GHCB specification added support for the SNP Extended Guest
> Request Message NAE event. This event serves a nearly identical purpose
> to the previously-added SNP_GUEST_REQUEST event, but allows for
> additional certificate data to be supplied via an additional
> guest-supplied buffer to be used mainly for verifying the signature of
> an attestation report as returned by firmware.
>
> This certificate data is supplied by userspace, so unlike with
> SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first
> forwarded to userspace via a KVM_EXIT_VMGEXIT exit structure, and then
> the firmware request is made after the certificate data has been fetched
> from userspace.
>
> Since there is a potential for race conditions where the
> userspace-supplied certificate data may be out-of-sync relative to the
> reported TCB or VLEK that firmware will use when signing attestation
> reports, a hook is also provided so that userspace can be informed once
> the attestation request is actually completed. See the updates to
> Documentation/ for more details on these aspects.
>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 87 ++++++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/sev.c | 86 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.h | 3 ++
> include/uapi/linux/kvm.h | 23 +++++++++
> 4 files changed, 199 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f0b76ff5030d..f3780ac98d56 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7060,6 +7060,93 @@ Please note that the kernel is allowed to use the kvm_run structure as the
> primary storage for certain register types. Therefore, the kernel may use the
> values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>
> +::
> +
> + /* KVM_EXIT_VMGEXIT */
> + struct kvm_user_vmgexit {

LOL, it looks dumb, but maybe kvm_vmgexit_exit to avoid confusing about whether
the struct refers to host userspace vs. guest userspace?

Actually, I vote to punt on naming until more exits need to be kicked to userspace,
and just do (see below for details on how I got here):

/* KVM_EXIT_VMGEXIT */
struct {
__u64 exit_code;
union {
struct {
__u64 data_gpa;
__u64 data_npages;
__u64 ret;
} req_certs;
};
} vmgexit;

> + #define KVM_USER_VMGEXIT_REQ_CERTS 1
> + __u32 type; /* KVM_USER_VMGEXIT_* type */

Regardless of whether or not requesting a certificate is vendor specific enough
to justify its own exit reason, I don't think KVM should have a #VMGEXIT that
adds its own layer. Structuring the user exit this way will make it weird and/or
difficult to handle #VMGEXITs that _do_ fit a generic pattern, e.g. a user might
wonder why PSC #VMGEXITs don't show up here.

And defining an exit reason that is, for all intents and purposes, a regurgitation
of the raw #VMGEXIT reason, but with a different value, is also confusing. E.g.
it wouldn't be unreasonable for a reader to expect that "type" matches the value
defined in the GHCB (or whever the values are defined).

Ah, you copied what KVM does for Hyper-V and Xen emulation. Hrm. But only
partially.

Assuming it's impractical to have a generic user exit for this, and we think
there is a high likelihood of needing to punt more #VMGEXITs to userspace, then
we should more closely (perhaps even exactly) follow the Hyper-V and Xen models.
I.e. for all values and whanot that are controlled/defined by a third party
(Hyper-V, Xen, the GHCB, etc.) #define those values in a header that is clearly
"owned" by the third party.

E.g. IIRC, include/xen/interface/xen.h is copied verbatim from Xen documentation
(source?). And include/asm-generic/hyperv-tlfs.h is the kernel's copy of the
TLFS, which dictates all of the Hyper-V hypercalls.

If we do that, then my concerns/objections largely go away, e.g. KVM isn't
defining magic values, there's less chance for confusion about what "type" holds,
etc.

Oh, and if we go that route, the sizes for all fields should follow the GHCB,
e.g. I believe the "type" should be a __u64.

> + union {
> + struct {
> + __u64 data_gpa;
> + __u64 data_npages;
> + #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN 1
> + #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY 2
> + #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC (1 << 31)

Hopefully it won't matter, but are BUSY and GENERIC actually defined somewhere?
I don't see them in GHCB 2.0.

In a perfect world, it would be nice for KVM to not have to care about the error
codes. But KVM disallows KVM_{G,S}ET_REGS for guest with protected state, which
means it's not feasible for userspace to set registers, at least not in any sane
way.

Heh, we could abuse KVM_SYNC_X86_REGS to let userspace specify RBX, but (a) that's
gross, and (b) KVM_SYNC_X86_REGS and KVM_SYNC_X86_SREGS really ought to be rejected
if guest state is protected.

> + __u32 ret;
> + #define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE BIT(0)

This has no business being buried in a VMGEXIT_REQ_CERTS flags. Notifying
userspace that KVM completed its portion of a userspace exit is completely generic.

And aside from where the notification flag lives, _if_ we add a notification
mechanism, it belongs in a separate patch, because it's purely a performance
optimization. Userspace can use immediate_exit to force KVM to re-exit after
completing an exit.

Actually, I take that back, this isn't even an optimization, it's literally a
non-generic implementation of kvm_run.immediate_exit.

If this were an optimization, i.e. KVM truly notified userspace without exiting,
then it would need to be a lot more robust, e.g. to ensure userspace actually
received the notification before KVM moved on.

> + __u8 flags;
> + #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING 0
> + #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE 1

This is also a weird reimplementation of generic functionality. KVM nullifies
vcpu->arch.complete_userspace_io _before_ invoking the callback. So if a callback
needs to run again on the next KVM_RUN, it can simply set complete_userspace_io
again. In other words, literally doing nothing will get you what you want :-)

> + __u8 status;
> + } req_certs;
> + };
> + };

2024-05-14 02:55:06

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event

On Mon, May 13, 2024 at 04:48:25PM -0700, Sean Christopherson wrote:
> On Wed, May 01, 2024, Michael Roth wrote:
> > Version 2 of GHCB specification added support for the SNP Extended Guest
> > Request Message NAE event. This event serves a nearly identical purpose
> > to the previously-added SNP_GUEST_REQUEST event, but allows for
> > additional certificate data to be supplied via an additional
> > guest-supplied buffer to be used mainly for verifying the signature of
> > an attestation report as returned by firmware.
> >
> > This certificate data is supplied by userspace, so unlike with
> > SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first
> > forwarded to userspace via a KVM_EXIT_VMGEXIT exit structure, and then
> > the firmware request is made after the certificate data has been fetched
> > from userspace.
> >
> > Since there is a potential for race conditions where the
> > userspace-supplied certificate data may be out-of-sync relative to the
> > reported TCB or VLEK that firmware will use when signing attestation
> > reports, a hook is also provided so that userspace can be informed once
> > the attestation request is actually completed. See the updates to
> > Documentation/ for more details on these aspects.
> >
> > Signed-off-by: Michael Roth <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 87 ++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/svm/sev.c | 86 +++++++++++++++++++++++++++++++++
> > arch/x86/kvm/svm/svm.h | 3 ++
> > include/uapi/linux/kvm.h | 23 +++++++++
> > 4 files changed, 199 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f0b76ff5030d..f3780ac98d56 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -7060,6 +7060,93 @@ Please note that the kernel is allowed to use the kvm_run structure as the
> > primary storage for certain register types. Therefore, the kernel may use the
> > values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
> >
> > +::
> > +
> > + /* KVM_EXIT_VMGEXIT */
> > + struct kvm_user_vmgexit {
>
> LOL, it looks dumb, but maybe kvm_vmgexit_exit to avoid confusing about whether
> the struct refers to host userspace vs. guest userspace?
>
> Actually, I vote to punt on naming until more exits need to be kicked to userspace,
> and just do (see below for details on how I got here):
>
> /* KVM_EXIT_VMGEXIT */
> struct {
> __u64 exit_code;
> union {
> struct {
> __u64 data_gpa;
> __u64 data_npages;
> __u64 ret;
> } req_certs;
> };
> } vmgexit;
>
> > + #define KVM_USER_VMGEXIT_REQ_CERTS 1
> > + __u32 type; /* KVM_USER_VMGEXIT_* type */
>
> Regardless of whether or not requesting a certificate is vendor specific enough
> to justify its own exit reason, I don't think KVM should have a #VMGEXIT that
> adds its own layer. Structuring the user exit this way will make it weird and/or
> difficult to handle #VMGEXITs that _do_ fit a generic pattern, e.g. a user might
> wonder why PSC #VMGEXITs don't show up here.
>
> And defining an exit reason that is, for all intents and purposes, a regurgitation
> of the raw #VMGEXIT reason, but with a different value, is also confusing. E.g.
> it wouldn't be unreasonable for a reader to expect that "type" matches the value
> defined in the GHCB (or whever the values are defined).

The type in this case is actually "extended guest request". You'd rightly
pointed out that that is miles away from describing what KVM wants
userspace to do, so I named it "request certificate". And now with PSC being
handled as seperate KVM_HC_MAP_GPA_RANGE event with no exposure of GHCB/etc
to userspace, it made further sense to not lean too heavily on the GHCB for
defining the types.

But continuing to name it KVM_EXIT_VMGEXIT sort of goes against that
decoupling, so I can see some potential for confusion there. KVM_EXIT_SNP is
probably a better generic name for what this exit is meant to cover. But I'm
not aware of anything specific that would involve requiring extending this in
the near-term, though maybe there's some potential with live migration. So a
renaming to something more generic and less specific to VMGEXIT/GHCB,
like KVM_EXIT_SNP, or something more specific like KVM_EXIT_SNP_REQ_CERTS,
both seem warranted, but I don't think moving to something more coupled to
VMGEXIT/GHCB would provide much benefit long-term.

>
> Ah, you copied what KVM does for Hyper-V and Xen emulation. Hrm. But only
> partially.
>
> Assuming it's impractical to have a generic user exit for this, and we think
> there is a high likelihood of needing to punt more #VMGEXITs to userspace, then
> we should more closely (perhaps even exactly) follow the Hyper-V and Xen models.
> I.e. for all values and whanot that are controlled/defined by a third party
> (Hyper-V, Xen, the GHCB, etc.) #define those values in a header that is clearly
> "owned" by the third party.
>
> E.g. IIRC, include/xen/interface/xen.h is copied verbatim from Xen documentation
> (source?). And include/asm-generic/hyperv-tlfs.h is the kernel's copy of the
> TLFS, which dictates all of the Hyper-V hypercalls.
>
> If we do that, then my concerns/objections largely go away, e.g. KVM isn't
> defining magic values, there's less chance for confusion about what "type" holds,
> etc.
>
> Oh, and if we go that route, the sizes for all fields should follow the GHCB,
> e.g. I believe the "type" should be a __u64.
>
> > + union {
> > + struct {
> > + __u64 data_gpa;
> > + __u64 data_npages;
> > + #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN 1
> > + #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY 2
> > + #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC (1 << 31)
>
> Hopefully it won't matter, but are BUSY and GENERIC actually defined somewhere?
> I don't see them in GHCB 2.0.

BUSY is defined in 4.1.7:

It is not expected that a guest would issue many Guest Request NAE
events. However, access to the SNP firmware is a sequential and
synchronous operation. To avoid the possibility of a guest creating a
denial-of-service attack against the SNP firmware, it is recommended
that some form of rate limiting be implemented should it be detected
that a high number of Guest Request NAE events are being issued. To
allow for this, the hypervisor may set the SW_EXITINFO2 field to
0x0000000200000000, which will inform the guest to retry the request

INVALID_LEN in 4.1.8.1:

The hypervisor must validate that the guest has supplied enough pages
to hold the certificates that will be returned before performing the SNP
guest request. If there are not enough guest pages to hold the certificate
table and certificate data, the hypervisor will return the required number
of pages needed to hold the certificate table and certificate data in the
RBX register and set the SW_EXITINFO2 field to 0x0000000100000000.

and GENERIC chosen to provide an non-zero error code that doesn't
conflict with that above (or future) GHCB-defined values. But KVM isn't
trying to expose the actual GHCB details, like how these values are to be in
the upper 32-bits of SW_EXITINFO2, it just re-uses the values to avoid
purposefully obfuscating the GHCB return codes they relate to.

>
> In a perfect world, it would be nice for KVM to not have to care about the error
> codes. But KVM disallows KVM_{G,S}ET_REGS for guest with protected state, which
> means it's not feasible for userspace to set registers, at least not in any sane
> way.
>
> Heh, we could abuse KVM_SYNC_X86_REGS to let userspace specify RBX, but (a) that's
> gross, and (b) KVM_SYNC_X86_REGS and KVM_SYNC_X86_SREGS really ought to be rejected
> if guest state is protected.
>
> > + __u32 ret;
> > + #define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE BIT(0)
>
> This has no business being buried in a VMGEXIT_REQ_CERTS flags. Notifying
> userspace that KVM completed its portion of a userspace exit is completely generic.
>
> And aside from where the notification flag lives, _if_ we add a notification
> mechanism, it belongs in a separate patch, because it's purely a performance
> optimization. Userspace can use immediate_exit to force KVM to re-exit after
> completing an exit.
>
> Actually, I take that back, this isn't even an optimization, it's literally a
> non-generic implementation of kvm_run.immediate_exit.

Relying on a generic -EINTR response resulting from kvm_run.immediate_exit
doesn't seem like a very robust way to ensure the attestation request
was made to firmware. It seems fully possible that future code changes
could result in EINTR being returned for other reasons. So how do you
reliably detect that the EINTR is a result of immediate_exit being called
after the attestation request is made to firmware? We could squirrel something
away in struct kvm_run to probe for, but delivering another
KVM_EXIT_SNP_REQ_CERT with an extra flag set seems to be reasonably
userspace-friendly.

>
> If this were an optimization, i.e. KVM truly notified userspace without exiting,
> then it would need to be a lot more robust, e.g. to ensure userspace actually
> received the notification before KVM moved on.

Right, this does rely on exiting via , not userspace polling for flags or
anything along that line.

>
> > + __u8 flags;
> > + #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING 0
> > + #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE 1
>
> This is also a weird reimplementation of generic functionality. KVM nullifies
> vcpu->arch.complete_userspace_io _before_ invoking the callback. So if a callback
> needs to run again on the next KVM_RUN, it can simply set complete_userspace_io
> again. In other words, literally doing nothing will get you what you want :-)

We could just have the completion callback set complete_userspace_io
again, but then you'd always get 2 userspace exit events per attestation
request. There could be some userspaces that don't implement the
file-locking scheme, in which case they wouldn't need the 2nd notification.
That's why the KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE flag is provided
as an opt-in.

The pending/done status bits are so userspace can distinguish between the
start of a certificate request and the completion side of it after it gets
bound a completed attestation request and the filelock can be released.

Thanks,

Mike

>
> > + __u8 status;
> > + } req_certs;
> > + };
> > + };

2024-05-14 08:11:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v15 22/23] KVM: SEV: Fix return code interpretation for RMP nested page faults

On May 10, 2024 6:59:37 PM GMT+02:00, Paolo Bonzini <[email protected]> wrote:
>Well, the merge window starts next sunday, doesn't it? If there's an -rc8 I agree there's some leeway, but that is not too likely.

Nah, the merge window just opened yesterday.

--
Sent from a small device: formatting sucks and brevity is inevitable.

2024-05-20 10:17:05

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages

On Wed, 2024-05-01 at 03:52 -0500, Michael Roth wrote:
> This will handle the RMP table updates needed to put a page into a
> private state before mapping it into an SEV-SNP guest.
>
>

[...]

> +int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + kvm_pfn_t pfn_aligned;
> + gfn_t gfn_aligned;
> + int level, rc;
> + bool assigned;
> +
> + if (!sev_snp_guest(kvm))
> + return 0;
> +
> + rc = snp_lookup_rmpentry(pfn, &assigned, &level);
> + if (rc) {
> + pr_err_ratelimited("SEV: Failed to look up RMP entry: GFN %llx PFN %llx error %d\n",
> + gfn, pfn, rc);
> + return -ENOENT;
> + }
> +
> + if (assigned) {
> + pr_debug("%s: already assigned: gfn %llx pfn %llx max_order %d level %d\n",
> + __func__, gfn, pfn, max_order, level);
> + return 0;
> + }
> +
> + if (is_large_rmp_possible(kvm, pfn, max_order)) {
> + level = PG_LEVEL_2M;
> + pfn_aligned = ALIGN_DOWN(pfn, PTRS_PER_PMD);
> + gfn_aligned = ALIGN_DOWN(gfn, PTRS_PER_PMD);
> + } else {
> + level = PG_LEVEL_4K;
> + pfn_aligned = pfn;
> + gfn_aligned = gfn;
> + }
> +
> + rc = rmp_make_private(pfn_aligned, gfn_to_gpa(gfn_aligned), level, sev->asid, false);
> + if (rc) {
> + pr_err_ratelimited("SEV: Failed to update RMP entry: GFN %llx PFN %llx level %d error %d\n",
> + gfn, pfn, level, rc);
> + return -EINVAL;
> + }
> +
> + pr_debug("%s: updated: gfn %llx pfn %llx pfn_aligned %llx max_order %d level %d\n",
> + __func__, gfn, pfn, pfn_aligned, max_order, level);
> +
> + return 0;
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b70556608e8d..60783e9f2ae8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5085,6 +5085,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
> .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
> .alloc_apic_backing_page = svm_alloc_apic_backing_page,
> +
> + .gmem_prepare = sev_gmem_prepare,
> };
>
>

+Rick, Isaku,

I am wondering whether this can be done in the KVM page fault handler?

The reason that I am asking is KVM will introduce several new
kvm_x86_ops::xx_private_spte() ops for TDX to handle setting up the
private mapping, and I am wondering whether SNP can just reuse some of
them so we can avoid having this .gmem_prepare():

/* Add a page as page table page into private page table */
int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, 
enum pg_level level, void *private_spt);
/*
* Free a page table page of private page table.
* ...
*/
int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, 
enum pg_level level, void *private_spt);

/* Add a guest private page into private page table */
int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, 
enum pg_level level, kvm_pfn_t pfn);

/* Remove a guest private page from private page table*/
int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, 
enum pg_level level, kvm_pfn_t pfn);
/*
* Keep a guest private page mapped in private page table, 
* but clear its present bit
*/
int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn,
enum pg_level level);

The idea behind these is in the fault handler:

bool use_private_pt = fault->is_private && 
kvm_use_private_pt(kvm);

root_pt = use_private_pt ? mmu->private_root_hpa : mmu->root_hpa;

tdp_mmu_for_each_pte(&iter, root_pt, gfn, gfn+1, ..) {

if (use_private_pt)
kvm_x86_ops->xx_private_spte();
else
// normal TDP MMU ops
}

Which means: if the fault is for private GPA, _AND_ when the VM has a
separate private table, use the specific xx_private_spte() ops to handle
private mapping.

But I am thinking we can use those hooks for SNP too, because
"conceptually", SNP also has concept of "private GPA" and must at least
issue some command to update the RMP table when private mapping is
setup/torn down.

So if we change the above logic to use fault->is_private, but not
'use_private_pt' to decide whether to invoke the
kvm_x86_ops::xx_private_spte(), then we can also implement SNP commands in
those callbacks IIUC:

if (fault->is_private && kvm_x86_ops::xx_private_spte())
kvm_x86_ops::xx_private_spte();
else
// normal TDP MMU operation

For SNP, these callbacks will operate on normal page table using the
normal TDP MMU code, but can do additional things like issuing commands as
shown in this patch.

My understanding is SNP doesn't need specific handling for middle level
page table, but should be able to utilize the ops when setting up /
tearing down the leaf SPTE?

2024-05-20 17:35:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages

On Mon, May 20, 2024, Kai Huang wrote:
> On Wed, 2024-05-01 at 03:52 -0500, Michael Roth wrote:
> > This will handle the RMP table updates needed to put a page into a
> > private state before mapping it into an SEV-SNP guest.
> >
> >
>
> [...]
>
> > +int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)

...

> +Rick, Isaku,
>
> I am wondering whether this can be done in the KVM page fault handler?

No, because the state of a pfn in the RMP is tied to the guest_memfd inode, not
to the file descriptor, i.e. not to an individual VM. And the NPT page tables
are treated as ephemeral for SNP.

2024-05-20 19:14:57

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages

On Mon, May 20, 2024 at 10:16:54AM +0000,
"Huang, Kai" <[email protected]> wrote:

> On Wed, 2024-05-01 at 03:52 -0500, Michael Roth wrote:
> > This will handle the RMP table updates needed to put a page into a
> > private state before mapping it into an SEV-SNP guest.
> >
> >
>
> [...]
>
> > +int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
> > +{
> > + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > + kvm_pfn_t pfn_aligned;
> > + gfn_t gfn_aligned;
> > + int level, rc;
> > + bool assigned;
> > +
> > + if (!sev_snp_guest(kvm))
> > + return 0;
> > +
> > + rc = snp_lookup_rmpentry(pfn, &assigned, &level);
> > + if (rc) {
> > + pr_err_ratelimited("SEV: Failed to look up RMP entry: GFN %llx PFN %llx error %d\n",
> > + gfn, pfn, rc);
> > + return -ENOENT;
> > + }
> > +
> > + if (assigned) {
> > + pr_debug("%s: already assigned: gfn %llx pfn %llx max_order %d level %d\n",
> > + __func__, gfn, pfn, max_order, level);
> > + return 0;
> > + }
> > +
> > + if (is_large_rmp_possible(kvm, pfn, max_order)) {
> > + level = PG_LEVEL_2M;
> > + pfn_aligned = ALIGN_DOWN(pfn, PTRS_PER_PMD);
> > + gfn_aligned = ALIGN_DOWN(gfn, PTRS_PER_PMD);
> > + } else {
> > + level = PG_LEVEL_4K;
> > + pfn_aligned = pfn;
> > + gfn_aligned = gfn;
> > + }
> > +
> > + rc = rmp_make_private(pfn_aligned, gfn_to_gpa(gfn_aligned), level, sev->asid, false);
> > + if (rc) {
> > + pr_err_ratelimited("SEV: Failed to update RMP entry: GFN %llx PFN %llx level %d error %d\n",
> > + gfn, pfn, level, rc);
> > + return -EINVAL;
> > + }
> > +
> > + pr_debug("%s: updated: gfn %llx pfn %llx pfn_aligned %llx max_order %d level %d\n",
> > + __func__, gfn, pfn, pfn_aligned, max_order, level);
> > +
> > + return 0;
> > +}
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index b70556608e8d..60783e9f2ae8 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -5085,6 +5085,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> > .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
> > .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
> > .alloc_apic_backing_page = svm_alloc_apic_backing_page,
> > +
> > + .gmem_prepare = sev_gmem_prepare,
> > };
> >
> >
>
> +Rick, Isaku,
>
> I am wondering whether this can be done in the KVM page fault handler?
>
> The reason that I am asking is KVM will introduce several new
> kvm_x86_ops::xx_private_spte() ops for TDX to handle setting up the
> private mapping, and I am wondering whether SNP can just reuse some of
> them so we can avoid having this .gmem_prepare():

Although I can't speak for SNP folks, I guess those hooks doesn't make sense for
them. I guess they want to stay away from directly modifying the TDP MMU to add
hooks to the TDP MMU. Instead, They intentionally chose to add hooks to
guest_memfd. Maybe it's possible for SNP to use those hooks, what's the benefit
for SNP?

If you're looking for the benefit to allow the hooks of the TDP MMU for shared
page table, what about other vm type? SW_PROTECTED or future one?
--
Isaku Yamahata <[email protected]>

2024-05-20 21:58:08

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages



On 21/05/2024 5:35 am, Sean Christopherson wrote:
> On Mon, May 20, 2024, Kai Huang wrote:
>> On Wed, 2024-05-01 at 03:52 -0500, Michael Roth wrote:
>>> This will handle the RMP table updates needed to put a page into a
>>> private state before mapping it into an SEV-SNP guest.
>>>
>>>
>>
>> [...]
>>
>>> +int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
>
> ...
>
>> +Rick, Isaku,
>>
>> I am wondering whether this can be done in the KVM page fault handler?
>
> No, because the state of a pfn in the RMP is tied to the guest_memfd inode, not
> to the file descriptor, i.e. not to an individual VM.

It's strange that as state of a PFN of SNP doesn't bind to individual
VM, at least for the private pages. The command rpm_make_private()
indeed reflects the mapping between PFN <-> <GFN, SSID>.

rc = rmp_make_private(pfn_aligned, gfn_to_gpa(gfn_aligned),
level, sev->asid, false);

> And the NPT page tables
> are treated as ephemeral for SNP.
>

Do you mean private mappings for SNP guest can be zapped from the VM
(the private pages are still there unchanged) and re-mapped later w/o
needing to have guest's explicit acceptance?

If so, I think "we can zap" doesn't mean "we need to zap"? Because the
privates are now pinned anyway. If we truly want to zap private
mappings for SNP, IIUC it can be done by distinguishing whether a VM
needs to use a separate private table, which is TDX-only.

I'll look into the SNP spec to understand more.

2024-05-20 23:15:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages

On Tue, May 21, 2024, Kai Huang wrote:
> On 21/05/2024 5:35 am, Sean Christopherson wrote:
> > On Mon, May 20, 2024, Kai Huang wrote:
> > > I am wondering whether this can be done in the KVM page fault handler?
> >
> > No, because the state of a pfn in the RMP is tied to the guest_memfd inode,
> > not to the file descriptor, i.e. not to an individual VM.
>
> It's strange that as state of a PFN of SNP doesn't bind to individual VM, at
> least for the private pages. The command rpm_make_private() indeed reflects
> the mapping between PFN <-> <GFN, SSID>.

s/SSID/ASID

KVM allows a single ASID to be bound to multiple "struct kvm" instances, e.g.
for intra-host migration. If/when trusted I/O is a thing, presumably KVM will
also need to share the ASID with other entities, e.g. IOMMUFD.

> rc = rmp_make_private(pfn_aligned, gfn_to_gpa(gfn_aligned),
> level, sev->asid, false);
>
> > And the NPT page tables are treated as ephemeral for SNP.
>
> Do you mean private mappings for SNP guest can be zapped from the VM (the
> private pages are still there unchanged) and re-mapped later w/o needing to
> have guest's explicit acceptance?

Correct.

> If so, I think "we can zap" doesn't mean "we need to zap"?

Correct.

> Because the privates are now pinned anyway.

Pinning is an orthogonal issue. And it's not so much that the pfns are pinned
as it is that guest_memfd simply doesn't support page migration or swap at this
time.

Regardless of whether or not guest_memfd supports page migration, KVM needs to
track the state of the physical page in guest_memfd, e.g. if it's been assigned
to the ASID versus if it's still in a shared state.

> If we truly want to zap private mappings for SNP, IIUC it can be done by
> distinguishing whether a VM needs to use a separate private table, which is
> TDX-only.

I wouldn't say we "want" to zap private mappings for SNP, rather that it's a lot
less work to keep KVM's existing behavior (literally do nothing) than it is to
rework the MMU and whatnot to not zap SPTEs. And there's no big motivation to
avoid zapping because SNP VMs are unlikely to delete memslots.

If it turns out that it's easy to preserve SNP mappings after TDX lands, then we
can certainly go that route, but AFAIK there's no reason to force the issue.

2024-05-20 23:42:07

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages



On 21/05/2024 11:15 am, Sean Christopherson wrote:
> On Tue, May 21, 2024, Kai Huang wrote:
>> On 21/05/2024 5:35 am, Sean Christopherson wrote:
>>> On Mon, May 20, 2024, Kai Huang wrote:
>>>> I am wondering whether this can be done in the KVM page fault handler?
>>>
>>> No, because the state of a pfn in the RMP is tied to the guest_memfd inode,
>>> not to the file descriptor, i.e. not to an individual VM.
>>
>> It's strange that as state of a PFN of SNP doesn't bind to individual VM, at
>> least for the private pages. The command rpm_make_private() indeed reflects
>> the mapping between PFN <-> <GFN, SSID>.
>
> s/SSID/ASID
>
> KVM allows a single ASID to be bound to multiple "struct kvm" instances, e.g.
> for intra-host migration. If/when trusted I/O is a thing, presumably KVM will
> also need to share the ASID with other entities, e.g. IOMMUFD.

But is this the case for SNP? I thought due to the nature of private
pages, they cannot be shared between VMs? So to me this RMP entry
mapping for PFN <-> GFN for private page should just be per-VM.

>
>> rc = rmp_make_private(pfn_aligned, gfn_to_gpa(gfn_aligned),
>> level, sev->asid, false);
>>
>>> And the NPT page tables are treated as ephemeral for SNP.
>>
>> Do you mean private mappings for SNP guest can be zapped from the VM (the
>> private pages are still there unchanged) and re-mapped later w/o needing to
>> have guest's explicit acceptance?
>
> Correct.
>
>> If so, I think "we can zap" doesn't mean "we need to zap"?
>
> Correct.
>
>> Because the privates are now pinned anyway.
>
> Pinning is an orthogonal issue. And it's not so much that the pfns are pinned
> as it is that guest_memfd simply doesn't support page migration or swap at this
> time.

Yes.

>
> Regardless of whether or not guest_memfd supports page migration, KVM needs to
> track the state of the physical page in guest_memfd, e.g. if it's been assigned
> to the ASID versus if it's still in a shared state.

I am not certain this can impact whether we want to do RMP commands via
guest_memfd() hooks or TDP MMU hooks?

>
>> If we truly want to zap private mappings for SNP, IIUC it can be done by
>> distinguishing whether a VM needs to use a separate private table, which is
>> TDX-only.
>
> I wouldn't say we "want" to zap private mappings for SNP, rather that it's a lot
> less work to keep KVM's existing behavior (literally do nothing) than it is to
> rework the MMU and whatnot to not zap SPTEs.

My thinking too.

> And there's no big motivation to
> avoid zapping because SNP VMs are unlikely to delete memslots.

I think we should also consider MMU notifier?

>
> If it turns out that it's easy to preserve SNP mappings after TDX lands, then we
> can certainly go that route, but AFAIK there's no reason to force the issue.

No I am certainly not saying we should do SNP after TDX. Sorry I didn't
closely monitor the status of this SNP patchset.

My intention is just wanting to make the TDP MMU common code change more
useful (since we need that for TDX anyway), i.e., not effectively just
for TDX if possible:

Currently the TDP MMU hooks are called depending whether the page table
type is private (or mirrored whatever), but I think conceptually, we
should decide whether to call TDP MMU hooks based on whether faulting
GPA is private, _AND_ when the hook is available.

https://lore.kernel.org/lkml/[email protected]/

If invoking SNP RMP commands is feasible in TDP MMU hooks, then I think
there's value of letting SNP code to use them too. And we can simply
split one patch out to only add the TDP MMU hooks for SNP to land first.

2024-05-21 00:30:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v15 13/20] KVM: SEV: Implement gmem hook for initializing private pages

On Tue, May 21, 2024, Kai Huang wrote:
> On 21/05/2024 11:15 am, Sean Christopherson wrote:
> > On Tue, May 21, 2024, Kai Huang wrote:
> > > On 21/05/2024 5:35 am, Sean Christopherson wrote:
> > > > On Mon, May 20, 2024, Kai Huang wrote:
> > > > > I am wondering whether this can be done in the KVM page fault handler?
> > > >
> > > > No, because the state of a pfn in the RMP is tied to the guest_memfd inode,
> > > > not to the file descriptor, i.e. not to an individual VM.
> > >
> > > It's strange that as state of a PFN of SNP doesn't bind to individual VM, at
> > > least for the private pages. The command rpm_make_private() indeed reflects
> > > the mapping between PFN <-> <GFN, SSID>.
> >
> > s/SSID/ASID
> >
> > KVM allows a single ASID to be bound to multiple "struct kvm" instances, e.g.
> > for intra-host migration. If/when trusted I/O is a thing, presumably KVM will
> > also need to share the ASID with other entities, e.g. IOMMUFD.
>
> But is this the case for SNP? I thought due to the nature of private pages,
> they cannot be shared between VMs? So to me this RMP entry mapping for PFN
> <-> GFN for private page should just be per-VM.

Sorry to redirect, but please read this mail (and probably surrounding mails).
It hopefully explains most of the question you have.

https://lore.kernel.org/all/[email protected]

> > Regardless of whether or not guest_memfd supports page migration, KVM needs to
> > track the state of the physical page in guest_memfd, e.g. if it's been assigned
> > to the ASID versus if it's still in a shared state.
>
> I am not certain this can impact whether we want to do RMP commands via
> guest_memfd() hooks or TDP MMU hooks?
>
> > > If we truly want to zap private mappings for SNP, IIUC it can be done by
> > > distinguishing whether a VM needs to use a separate private table, which is
> > > TDX-only.
> >
> > I wouldn't say we "want" to zap private mappings for SNP, rather that it's a lot
> > less work to keep KVM's existing behavior (literally do nothing) than it is to
> > rework the MMU and whatnot to not zap SPTEs.
>
> My thinking too.
>
> > And there's no big motivation to avoid zapping because SNP VMs are unlikely
> > to delete memslots.
>
> I think we should also consider MMU notifier?

No, private mappings have no host userspace mappings, i.e. are completely exempt
from MMU notifier events. guest_memfd() can still invalidate mappings, but that
only occurs if userspace punches a hole, which is destructive.

> > If it turns out that it's easy to preserve SNP mappings after TDX lands, then we
> > can certainly go that route, but AFAIK there's no reason to force the issue.
>
> No I am certainly not saying we should do SNP after TDX. Sorry I didn't
> closely monitor the status of this SNP patchset.
>
> My intention is just wanting to make the TDP MMU common code change more
> useful (since we need that for TDX anyway), i.e., not effectively just for
> TDX if possible:
>
> Currently the TDP MMU hooks are called depending whether the page table type
> is private (or mirrored whatever), but I think conceptually, we should
> decide whether to call TDP MMU hooks based on whether faulting GPA is
> private, _AND_ when the hook is available.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> If invoking SNP RMP commands is feasible in TDP MMU hooks,

Feasible. Yes. Desirable? No. Either KVM tracks the state of the physical page
using the guest_memfd inode, or KVM _guarantees_ the NPT mappings _never_ get
dropped, including during intra-host migration. E.g. to support intra-host
migration of TDX VMs, KVM is pretty much forced to transer the S-EPT tables as-is,
which is ugly and painful (though performant). We could do the same for NPT, but
there would need to be massive performance benefits to justify the complexity.

> then I think there's value of letting SNP code to use them too. And we can
> simply split one patch out to only add the TDP MMU hooks for SNP to land
> first.