This patchset is also available at:
https://github.com/amdese/linux/commits/upmv10-host-snp-v8-rfc
and is based on top of the following tree:
https://github.com/mdroth/linux/commits/upm_base_support_fixes
which in turn is based on Sean Christopherson's UPM base support tree,
with some fixes/workarounds needed for SEV/SNP support.[1]
== OVERVIEW ==
This version is being posted as an RFC due to fairly extensive changes
relating to transitioning the SEV-SNP implementation to using
restricted/private memslots (aka Unmapped Private Memory) to manage
private guest pages instead of the legacy SEV memory registration ioctls.
Alongside that work we've also been investigating leveraging UPM to to
implement lazy-pinning support for SEV guests, rather than the legacy
SEV memory registration ioctls which rely on pinning everything in
advance.
For both of these SEV and SEV-SNP use-cases we've needed to add a
number of hooks in the restricted, so we thought it would be useful
for this version at least to include both UPM-based SEV and SNP
implementations so can see if these hooks might be needed for other
archs/platforms and start consolidating around whether/how they should
be defined for general usage. There are still some TODOs in this area,
but we hope this implementation is complete enough to at least outline
the required additions needed for using UPM for these use-cases.
Outside of UPM-related items, we've also included fairly extensive changes
based on review feedback from v6/v7 and would appreciate any feedback on
those aspects as well.
== LAYOUT ==
PATCH 01-03: pre-patches that add the UPM hooks and KVM capability needed
to switch between UPM and legacy SEV memory registration.
PATCH 04-09: implement SEV lazy-pinning using UPM to manage private memory
PATCH 10-28: general SNP detection/enablement for host and CCP driver
PATCH 29-56: SNP hypervisor support
== TESTING (note updated QEMU command-lines) ==
For testing this via QEMU, use the following tree:
https://github.com/amdese/qemu/commits/upmv10b-snpv3-wip
SEV-SNP with UPM:
qemu-system-x86_64 -cpu EPYC-Milan-v2 \
-object memory-backend-memfd-private,id=ram1,size=1G,share=true \
-object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1 \
-machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=protected \
...
SEV with UPM (requires patched OVMF[2]):
qemu-system-x86_64 -cpu EPYC-Milan-v2 \
-object memory-backend-memfd-private,id=ram1,size=1G,share=true \
-object sev-guest,id=sev0,cbitpos=51,reduced-phys-bits=1 \
-machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=protected \
...
KVM selftests for UPM:
cd $kernel_src_dir
make -C tools/testing/selftests TARGETS="kvm" EXTRA_CFLAGS="-DDEBUG"
sudo tools/testing/selftests/kvm/x86_64/private_mem_conversions_test
== BACKGROUND (SEV-SNP) ==
This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Guest Support now part of mainline.
This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.
The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.
The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:
- Register GHCB GPA
- Page State Change Request
- Hypevisor feature
- Guest message request
The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
access requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality is
already protected via memory encryption. When hardware encounters an RMP
checks failure, it raises a page-fault exception. If RMP check failure
is due to the page-size mismatch, then split the large page to resolve
the fault.
The series does not provide support for the interrupt security and migration
and those feature will be added after the base support.
== BACKGROUND (SEV Lazy-pinning) ==
The current implementation of SEV relies on a KVM_MEM_ENCRYPT_REG_REGION
ioctl to pin all pages in advance, since they may be used as private pages
by the guest. Previous iterations of SNP also relied on this interface. With
UPM however, the initial shared pages are not converted to private, instead
private pages are allocated from a separate restrictedmem backend, which,
by design, will handle the pinning internally upon allocation, which
generally occurs when the guest faults the page in for the first time.
This provides all the necessary characteristic to support lazy-pinning to
improve boot times for SEV guests, and well as a bare minimum implementation
of a UPM-enabled guest that can provide basic infrastructure and testing
flexibility that SNP can then build on, which is why it's included with
this series for now.
It should be noted that SEV guests don't have a mean to issue explicit
page state changes like SNP guests can via GHCB page-state change requests.
They do however need a way to inform the host of shared pages so that
non-restricted memory can be used. This is done via the MAP_GPA_RANGE
KVM hypercall interface which was introduced in guest kernels to inform
the host of shared pages to support live migration. This allows support
for existing guest kernels, however OVMF is still lacking enablement
for MAP_GPA_RANGE hypercalls, which is why a patched version is needed
there.
== TODO / KNOWN ISSUES ==
* Failures in the rmpupdate() helper have been observed when running many
concurrent SNP guests while THP is disabled for restricted/private guest
pages. This has not been observed when running with THP enabled via
`echo always >/sys/kernel/mm/transparent_hugepages/shmem_enabled`
* CCP driver init ordering considerations (Jarkko)
* Reclaim firmware pages in case of SNP command failure (Alper)
* Return 0 to userspace when generating KVM_EXIT_VMGEXIT (Tom)
* Rework interfaces for issuing SNP guest requests to firmware (Alexey)
* Incorporate guest request throttling patches from Dionna
* Incorporate suggested updates for instant certificate support (Dionna)
[1] https://github.com/mdroth/edk2/commits/upmv8-seves-v1
[2] https://lore.kernel.org/lkml/[email protected]/
Changes since v7:
* Rebase to Sean's updated UPM base support tree
* Drop KVM_CAP_UNMAPPED_MEMORY and .private_mem_enabled x86 op in favor
of kvm_arch_has_private_mem() and vm_type KVM_VM_CREATE arg
* Drop GHCB map/unmap refactoring and post map/unmap hooks as they are no
longer needed with UPM
* Move .fault_is_private implementation to SNP patch range, no longer
needed for SEV.
* Don't call attribute update / invalidation hooks under kvm->mmu_lock
(Tom, Jarkko)
* Revert switch to using set_memory_p()/set_memory_np() in rmpupdate() due
to it causing performance regression
* Commit fixups for 'fault_is_private'/'update_mem_attr' hooks, have
'fault_is_private' return bool (Boris)
* Split kvm_vm_set_region_attr() into separate patch. (Jarkko)
* Copy corrected CPUID page to userspace when firmware rejects it (Tom,
Jarkko)
* Fix sev_dump_rmpentry() error-handling (Alper)
* Use adjusted cmd_buf pointer rather than sev->cmd_buf directly (Alper)
* Correct typo in SNP_GET_EXT_CONFIG documentation (Dov)
* Update struct kvm_sev_snp_launch_finish definition in
amd-memory-encryption.rst (Tom)
* Fix snp_launch_update_vmsa replacing created_vcpus with online_vcpus
* Fix SNP_DBG_DECRYPT to not include len parameter.
* Fix SNP_LAUNCH_FINISH to copy host-data from userspace
Changes since v6:
* Added support for restrictedmem/UPM, and removed SEV-specific
implementation of private memory management. As a result of this rework
the following patches were no longer needed so were dropped:
- KVM: SVM: Mark the private vma unmergable for SEV-SNP guests
- KVM: SVM: Disallow registering memory range from HugeTLB for SNP guest
- KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP
- KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
* Moved RMP table entry structure definition (struct rmpentry)
to sev.c, to not expose this non-architectural definition to rest
of the kernel and making the structure private to SNP code.
Also made RMP table entry accessors to be inline functions and
removed all accessors which are not called more than once.
Added a new function rmptable_entry() to index into the RMP table
and return RMP table entry.
* Moved RMPUPDATE, PSMASH helper function declerations to x86 arch
specific include namespace from linux namespace. Added comments
for these helper functions.
* Introduce set_memory_p() to provide a way to change atributes of a
memory range to be marked as present and added to the kernel
directmap, and invalidating/restoring pages from directmap are
now done using set_memory_np() and set_memory_p().
* Added detailed comments around user RMP #PF fault handling and
simplified computation of the faulting pfn for large-pages.
* Added support to return pfn from dump_pagetable() to do SEV-specific
fault handling, this is added a pre-patch. This support is now
used to dump RMP entry in case of RMP #PF in show_fault_oops().
* Added a new generic SNP command params structure sev_data_snp_addr,
which is used for all SNP firmware API commands requiring a
single physical address parameter.
* Added support for new SNP_INIT_EX command with support for HV-Fixed
page range list.
* Added support for new SNP_SHUTDOWN_EX command which allows
disabling enforcement of SNP in the IOMMU. Also DF_FLUSH is done
at SNP shutdown if it indicates DF_FLUSH is required.
* Make sev_do_cmd() a generic API interface for the hypervisor
to issue commands to manage an SEV and SNP guest. Also removed
the API wrappers used by the hypervisor to manage an SEV-SNP guest.
All these APIs now invoke sev_do_cmd() directly.
* Introduce snp leaked pages list. If pages are unsafe to be released
back to the page-allocator as they can't be reclaimed or
transitioned back to hypervisor/shared state are now added
to this internal leaked pages list to prevent fatal page faults
when accessing these pages. The function snp_leak_pages() is
renamed to snp_mark_pages_offline() and is an external function
available to both CCP driver and the SNP hypervisor code. Removed
call to memory_failure() when leaking/marking pages offline.
* Remove snp_set_rmp_state() multiplexor code and add new separate
helpers such as rmp_mark_pages_firmware() & rmp_mark_pages_shared().
The callers now issue snp_reclaim_pages() directly when needed as
done by __snp_free_firmware_pages() and unmap_firmware_writeable().
All callers of snp_set_rmp_state() modified to call helpers
rmp_mark_pages_firmware() or rmp_mark_pages_shared() as required.
* Change snp_reclaim_pages() to take physical address as an argument
and clear C-bit from this physical address argument internally.
* Output parameter sev_user_data_ext_snp_config in sev_ioctl_snp_get_config()
is memset to zero to avoid kernel memory leaking.
* Prevent race between sev_ioctl_snp_set_config() and
snp_guest_ext_guest_request() for sev->snp_certs_data by acquiring
sev->snp_certs_lock mutex.
* Zeroed out struct sev_user_data_snp_config in
sev_ioctl_snp_set_config() to prevent leaking uninitialized
kernel memory.
* Optimized snp_safe_alloc_page() by avoiding multiple calls to
pfn_to_page() and checking for a hugepage using pfn instead of
expanding to full physical address.
* Invoke host_rmp_make_shared() with leak parameter set to true
if VMSA page cannot be transitioned back to shared state.
* Fix snp_launch_finish() to always sent the ID_AUTH struct to
the firmware. Use params.auth_key_en indicator to set
if the ID_AUTH struct contains an author key or not.
* Cleanup snp_context_create() and allocate certs_data in this
function using kzalloc() to prevent giving the guest
uninitialized kernel memory.
* Remove the check for guest supplied buffer greater than the data
provided by the hypervisor in snp_handle_ext_guest_request().
* Add check in sev_snp_ap_create() if a malicious guest can
RMPADJUST a large page into VMSA which will hit the SNP erratum
where the CPU will incorrectly signal an RMP violation #PF if a
hugepage collides with the RMP entry of VMSA page, reject the
AP CREATE request if VMSA address from guest is 2M aligned.
* Make VMSAVE target area memory allocation SNP safe, implemented
workaround for an SNP erratum where the CPU will incorrectly signal
an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
RMP entry of the VMSAVE target page.
* Fix handle_split_page_fault() to work with memfd backed pages.
* Add KVM commands for per-VM instance certificates.
* Add IOMMU_SNP_SHUTDOWN support, this adds support for Host kexec
support with SNP.
----------------------------------------------------------------
Ashish Kalra (4):
x86/fault: Return pfn from dump_pagetable() for SEV-specific fault handling.
crypto: ccp: Introduce snp leaked pages list
KVM: SVM: Make VMSAVE target area memory allocation SNP safe
iommu/amd: Add IOMMU_SNP_SHUTDOWN support
Brijesh Singh (32):
x86/cpufeatures: Add SEV-SNP CPU feature
x86/sev: Add the host SEV-SNP initialization support
x86/sev: Add RMP entry lookup helpers
x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
x86/sev: Invalidate pages from the direct map when adding them to the RMP table
x86/traps: Define RMP violation #PF error code
x86/fault: Add support to handle the RMP fault for user address
crypto:ccp: Define the SEV-SNP commands
crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
crypto:ccp: Provide API to issue SEV and SNP commands
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
crypto: ccp: Handle the legacy SEV command when SNP is enabled
crypto: ccp: Add the SNP_PLATFORM_STATUS command
crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
crypto: ccp: Provide APIs to query extended attestation report
KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
KVM: SVM: Add initial SEV-SNP support
KVM: SVM: Add KVM_SNP_INIT command
KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
KVM: X86: Keep the NPT and RMP page level in sync
KVM: x86: Define RMP page fault error bits for #NPF
KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
KVM: SVM: Add support to handle Page State Change VMGEXIT
KVM: x86: Export the kvm_zap_gfn_range() for the SNP use
KVM: SVM: Add support to handle the RMP nested page fault
KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
KVM: SVM: Add module parameter to enable the SEV-SNP
ccp: Add support to decrypt the page
Dionna Glaze (2):
x86/sev: Add KVM commands for instance certs
x86/sev: Document KVM_SEV_SNP_{G,S}ET_CERTS
Hugh Dickins (1):
x86/fault: fix handle_split_page_fault() to work with memfd backed pages
Michael Roth (10):
KVM: x86: Add 'fault_is_private' x86 op
KVM: x86: Add 'update_mem_attr' x86 op
KVM: x86: Add platform hooks for private memory invalidations
KVM: SEV: Require KVM_PROTECTED_VM when AMD_MEM_ENCRYPT is enabled
KVM: Split out memory attribute xarray updates to helper function
x86/fault: Add helper for dumping RMP entries
KVM: SVM: Add KVM_EXIT_VMGEXIT
KVM: SVM: Add SNP-specific handling for memory attribute updates
KVM: SVM: Implement .fault_is_private callback for SNP
KVM: SEV: Handle restricted memory invalidations for SNP
Nikunj A Dadhania (2):
KVM: SEV: Rename sev_{pin,unpin}_memory
KVM: SEV: Handle memory backed by restricted memfd
Tom Lendacky (3):
KVM: SVM: Add support to handle AP reset MSR protocol
KVM: SVM: Use a VMSA physical address variable for populating VMCB
KVM: SVM: Support SEV-SNP AP Creation NAE event
Vishal Annapurve (2):
KVM: Add HVA range operator
KVM: SEV: Populate private memory fd during LAUNCH_UPDATE_DATA
Documentation/virt/coco/sev-guest.rst | 54 +
.../virt/kvm/x86/amd-memory-encryption.rst | 147 ++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/kvm-x86-ops.h | 5 +
arch/x86/include/asm/kvm_host.h | 19 +
arch/x86/include/asm/msr-index.h | 11 +-
arch/x86/include/asm/sev-common.h | 28 +
arch/x86/include/asm/sev.h | 28 +
arch/x86/include/asm/svm.h | 6 +
arch/x86/include/asm/trap_pf.h | 18 +-
arch/x86/kernel/cpu/amd.c | 5 +-
arch/x86/kernel/sev.c | 472 +++++
arch/x86/kvm/lapic.c | 5 +-
arch/x86/kvm/mmu.h | 2 -
arch/x86/kvm/mmu/mmu.c | 31 +-
arch/x86/kvm/mmu/mmu_internal.h | 37 +-
arch/x86/kvm/svm/sev.c | 2089 +++++++++++++++++---
arch/x86/kvm/svm/svm.c | 55 +-
arch/x86/kvm/svm/svm.h | 43 +-
arch/x86/kvm/trace.h | 34 +
arch/x86/kvm/x86.c | 10 +
arch/x86/mm/fault.c | 118 +-
drivers/crypto/ccp/sev-dev.c | 1054 +++++++++-
drivers/crypto/ccp/sev-dev.h | 18 +
drivers/iommu/amd/init.c | 53 +
include/linux/amd-iommu.h | 1 +
include/linux/kvm_host.h | 14 +
include/linux/mm.h | 3 +-
include/linux/mm_types.h | 3 +
include/linux/psp-sev.h | 351 +++-
include/uapi/linux/kvm.h | 74 +
include/uapi/linux/psp-sev.h | 62 +
mm/memory.c | 15 +
mm/restrictedmem.c | 12 +-
tools/arch/x86/include/asm/cpufeatures.h | 1 +
virt/kvm/kvm_main.c | 117 +-
38 files changed, 4714 insertions(+), 291 deletions(-)
From: Brijesh Singh <[email protected]>
The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
track the owner of each page of memory. Pages of memory can be owned by
the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
section 15.36.3 for more detail on RMP.
The RMP table is used to enforce access control to memory. The table
itself is not directly writable by the software. New CPU instructions
(RMPUPDATE, PVALIDATE, RMPADJUST) are used to manipulate the RMP
entries.
Based on the platform configuration, the BIOS reserves the memory used
for the RMP table. The start and end address of the RMP table must be
queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
RMP_END are not set then disable the SEV-SNP feature.
The SEV-SNP feature is enabled only after the RMP table is successfully
initialized.
Also set SYSCFG.MFMD when enabling SNP as SEV-SNP FW >= 1.51 requires
that SYSCFG.MFMD must be se
RMP table entry format is non-architectural and it can vary by processor
and is defined by the PPR. Restrict SNP support on the known CPU model
and family for which the RMP table entry format is currently defined
for.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/msr-index.h | 11 +-
arch/x86/kernel/sev.c | 175 +++++++++++++++++++++++
3 files changed, 192 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 33d2cd04d254..9b5a2cc8064a 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -87,6 +87,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+# define DISABLE_SEV_SNP 0
+#else
+# define DISABLE_SEV_SNP (1 << (X86_FEATURE_SEV_SNP & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -110,7 +116,7 @@
DISABLE_ENQCMD)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
-#define DISABLED_MASK19 0
+#define DISABLED_MASK19 (DISABLE_SEV_SNP)
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
#endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 10ac52705892..35100c630617 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -565,6 +565,8 @@
#define MSR_AMD64_SEV_ENABLED BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
#define MSR_AMD64_SEV_ES_ENABLED BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
#define MSR_AMD64_SEV_SNP_ENABLED BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
+#define MSR_AMD64_RMP_BASE 0xc0010132
+#define MSR_AMD64_RMP_END 0xc0010133
#define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f
@@ -649,7 +651,14 @@
#define MSR_K8_TOP_MEM2 0xc001001d
#define MSR_AMD64_SYSCFG 0xc0010010
#define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
-#define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_AMD64_SYSCFG_SNP_EN_BIT 24
+#define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
+#define MSR_AMD64_SYSCFG_MFDM_BIT 19
+#define MSR_AMD64_SYSCFG_MFDM BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
+
#define MSR_K8_INT_PENDING_MSG 0xc0010055
/* C1E active bits in int pending message */
#define K8_INTP_C1E_ACTIVE_MASK 0x18000000
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a428c62330d3..e54e412c9916 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -22,6 +22,9 @@
#include <linux/efi.h>
#include <linux/platform_device.h>
#include <linux/io.h>
+#include <linux/cpumask.h>
+#include <linux/iommu.h>
+#include <linux/amd-iommu.h>
#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
@@ -38,6 +41,7 @@
#include <asm/apic.h>
#include <asm/cpuid.h>
#include <asm/cmdline.h>
+#include <asm/iommu.h>
#define DR7_RESET_VALUE 0x400
@@ -57,6 +61,12 @@
#define AP_INIT_CR0_DEFAULT 0x60000010
#define AP_INIT_MXCSR_DEFAULT 0x1f80
+/*
+ * The first 16KB from the RMP_BASE is used by the processor for the
+ * bookkeeping, the range needs to be added during the RMP entry lookup.
+ */
+#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
+
/* For early boot hypervisor communication in SEV-ES enabled guests */
static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
@@ -69,6 +79,9 @@ static struct ghcb *boot_ghcb __section(".data");
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;
+static unsigned long rmptable_start __ro_after_init;
+static unsigned long rmptable_end __ro_after_init;
+
/* #VC handler runtime per-CPU data */
struct sev_es_runtime_data {
struct ghcb ghcb_page;
@@ -2260,3 +2273,165 @@ static int __init snp_init_platform_device(void)
return 0;
}
device_initcall(snp_init_platform_device);
+
+#undef pr_fmt
+#define pr_fmt(fmt) "SEV-SNP: " fmt
+
+static int __mfd_enable(unsigned int cpu)
+{
+ u64 val;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return 0;
+
+ rdmsrl(MSR_AMD64_SYSCFG, val);
+
+ val |= MSR_AMD64_SYSCFG_MFDM;
+
+ wrmsrl(MSR_AMD64_SYSCFG, val);
+
+ return 0;
+}
+
+static __init void mfd_enable(void *arg)
+{
+ __mfd_enable(smp_processor_id());
+}
+
+static int __snp_enable(unsigned int cpu)
+{
+ u64 val;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return 0;
+
+ rdmsrl(MSR_AMD64_SYSCFG, val);
+
+ val |= MSR_AMD64_SYSCFG_SNP_EN;
+ val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
+
+ wrmsrl(MSR_AMD64_SYSCFG, val);
+
+ return 0;
+}
+
+static __init void snp_enable(void *arg)
+{
+ __snp_enable(smp_processor_id());
+}
+
+static bool get_rmptable_info(u64 *start, u64 *len)
+{
+ u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
+
+ rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
+ rdmsrl(MSR_AMD64_RMP_END, rmp_end);
+
+ if (!rmp_base || !rmp_end) {
+ pr_err("Memory for the RMP table has not been reserved by BIOS\n");
+ return false;
+ }
+
+ rmp_sz = rmp_end - rmp_base + 1;
+
+ /*
+ * Calculate the amount the memory that must be reserved by the BIOS to
+ * address the whole RAM. The reserved memory should also cover the
+ * RMP table itself.
+ */
+ calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + totalram_pages()) << 4)
+ + RMPTABLE_CPU_BOOKKEEPING_SZ;
+
+ if (calc_rmp_sz > rmp_sz) {
+ pr_err("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
+ calc_rmp_sz, rmp_sz);
+ return false;
+ }
+
+ *start = rmp_base;
+ *len = rmp_sz;
+
+ pr_info("RMP table physical address [0x%016llx - 0x%016llx]\n", rmp_base, rmp_end);
+
+ return true;
+}
+
+static __init int snp_rmptable_init(void)
+{
+ u64 rmp_base, sz;
+ void *start;
+ u64 val;
+
+ if (!get_rmptable_info(&rmp_base, &sz))
+ return 1;
+
+ start = memremap(rmp_base, sz, MEMREMAP_WB);
+ if (!start) {
+ pr_err("Failed to map RMP table addr 0x%llx size 0x%llx\n", rmp_base, sz);
+ return 1;
+ }
+
+ /*
+ * Check if SEV-SNP is already enabled, this can happen in case of
+ * kexec boot.
+ */
+ rdmsrl(MSR_AMD64_SYSCFG, val);
+ if (val & MSR_AMD64_SYSCFG_SNP_EN)
+ goto skip_enable;
+
+ memset(start, 0, sz);
+
+ /* Flush the caches to ensure that data is written before SNP is enabled. */
+ wbinvd_on_all_cpus();
+
+ /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
+ on_each_cpu(mfd_enable, NULL, 1);
+
+ /* Enable SNP on all CPUs. */
+ on_each_cpu(snp_enable, NULL, 1);
+
+skip_enable:
+ rmptable_start = (unsigned long)start;
+ rmptable_end = rmptable_start + sz - 1;
+
+ return 0;
+}
+
+static int __init snp_host_init(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return 0;
+
+ /*
+ * RMP table entry format is not architectural and it can vary by processor and
+ * is defined by the per-processor PPR. Restrict SNP support on the known CPU
+ * model and family for which the RMP table entry format is currently defined for.
+ */
+ if (boot_cpu_data.x86 != 0x19 || boot_cpu_data.x86_model > 0xaf)
+ goto nosnp;
+
+ if (amd_iommu_snp_enable())
+ goto nosnp;
+
+ if (snp_rmptable_init())
+ goto nosnp;
+
+ cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
+
+ return 0;
+
+nosnp:
+ setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+ return -ENODEV;
+}
+
+/*
+ * This must be called after the PCI subsystem. This is because amd_iommu_snp_enable()
+ * is called to ensure the IOMMU supports the SEV-SNP feature, which can only be
+ * called after subsys_initcall().
+ *
+ * NOTE: IOMMU is enforced by SNP to ensure that hypervisor cannot program DMA
+ * directly into guest private memory. In case of SNP, the IOMMU ensures that
+ * the page(s) used for DMA are hypervisor owned.
+ */
+fs_initcall(snp_host_init);
--
2.25.1
From: Brijesh Singh <[email protected]>
The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
entry for a given page. The RMP entry format is documented in AMD PPR, see
https://bugzilla.kernel.org/attachment.cgi?id=296015.
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev.h | 4 +-
arch/x86/kernel/sev.c | 84 ++++++++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ebc271bb6d8e..8d3ce2ad27da 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -83,7 +83,7 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
/* RMP page size */
#define RMP_PG_SIZE_4K 0
-
+#define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
#define RMPADJUST_VMSA_PAGE_BIT BIT(16)
/* SNP Guest message request */
@@ -197,6 +197,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+int snp_lookup_rmpentry(u64 pfn, int *level);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -221,6 +222,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index e54e412c9916..a063c1b98034 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -61,11 +61,36 @@
#define AP_INIT_CR0_DEFAULT 0x60000010
#define AP_INIT_MXCSR_DEFAULT 0x1f80
+/*
+ * The RMP entry format is not architectural. The format is defined in PPR
+ * Family 19h Model 01h, Rev B1 processor.
+ */
+struct rmpentry {
+ union {
+ struct {
+ u64 assigned : 1,
+ pagesize : 1,
+ immutable : 1,
+ rsvd1 : 9,
+ gpa : 39,
+ asid : 10,
+ vmsa : 1,
+ validated : 1,
+ rsvd2 : 1;
+ } info;
+ u64 low;
+ };
+ u64 high;
+} __packed;
+
/*
* The first 16KB from the RMP_BASE is used by the processor for the
* bookkeeping, the range needs to be added during the RMP entry lookup.
*/
#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
+#define RMPENTRY_SHIFT 8
+#define rmptable_page_offset(x) (RMPTABLE_CPU_BOOKKEEPING_SZ + \
+ (((unsigned long)x) >> RMPENTRY_SHIFT))
/* For early boot hypervisor communication in SEV-ES enabled guests */
static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
@@ -2435,3 +2460,62 @@ static int __init snp_host_init(void)
* the page(s) used for DMA are hypervisor owned.
*/
fs_initcall(snp_host_init);
+
+static inline unsigned int rmpentry_assigned(struct rmpentry *e)
+{
+ return e->info.assigned;
+}
+
+static inline unsigned int rmpentry_pagesize(struct rmpentry *e)
+{
+ return e->info.pagesize;
+}
+
+static struct rmpentry *rmptable_entry(unsigned long paddr)
+{
+ unsigned long vaddr;
+
+ vaddr = rmptable_start + rmptable_page_offset(paddr);
+ if (unlikely(vaddr > rmptable_end))
+ return ERR_PTR(-EFAULT);
+
+ return (struct rmpentry *)vaddr;
+}
+
+static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
+{
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ struct rmpentry *entry, *large_entry;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return ERR_PTR(-ENXIO);
+
+ if (!pfn_valid(pfn))
+ return ERR_PTR(-EINVAL);
+
+ entry = rmptable_entry(paddr);
+ if (IS_ERR(entry))
+ return entry;
+
+ /* Read a large RMP entry to get the correct page level used in RMP entry. */
+ large_entry = rmptable_entry(paddr & PMD_MASK);
+ *level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
+
+ return entry;
+}
+
+/*
+ * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
+ * and -errno if there is no corresponding RMP entry.
+ */
+int snp_lookup_rmpentry(u64 pfn, int *level)
+{
+ struct rmpentry *e;
+
+ e = __snp_lookup_rmpentry(pfn, level);
+ if (IS_ERR(e))
+ return PTR_ERR(e);
+
+ return !!rmpentry_assigned(e);
+}
+EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
--
2.25.1
This information will be useful for debugging things like page faults
due to RMP access violations and RMPUPDATE failures.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
[mdr: move helper to standalone patch]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev.h | 2 ++
arch/x86/kernel/sev.c | 48 ++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 8d3ce2ad27da..edbb7fa488af 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -198,6 +198,7 @@ bool snp_init(struct boot_params *bp);
void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
int snp_lookup_rmpentry(u64 pfn, int *level);
+void sev_dump_rmpentry(u64 pfn);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -223,6 +224,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
return -ENOTTY;
}
static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
+static inline void sev_dump_rmpentry(u64 pfn) {}
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a063c1b98034..a01741c4a1b8 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2504,6 +2504,54 @@ static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
return entry;
}
+void sev_dump_rmpentry(u64 pfn)
+{
+ unsigned long pfn_end;
+ struct rmpentry *e;
+ int level;
+
+ e = __snp_lookup_rmpentry(pfn, &level);
+ if (IS_ERR(e)) {
+ pr_info("Failed to read RMP entry for PFN 0x%llx\n", pfn);
+ return;
+ }
+
+ if (rmpentry_assigned(e)) {
+ pr_info("RMPEntry paddr 0x%llx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx asid=%d vmsa=%d validated=%d]\n",
+ pfn << PAGE_SHIFT, rmpentry_assigned(e), e->info.immutable,
+ rmpentry_pagesize(e), (unsigned long)e->info.gpa, e->info.asid,
+ e->info.vmsa, e->info.validated);
+
+ /* Dump all the non-zero entries if debug enabled */
+ if (!sev_cfg.debug)
+ return;
+ }
+
+ /*
+ * If the RMP entry at the faulting pfn was not assigned, then not sure
+ * what caused the RMP violation. To get some useful debug information,
+ * iterate through the entire 2MB region, and dump the RMP entries if
+ * one of the bit in the RMP entry is set.
+ */
+ pfn = pfn & ~(PTRS_PER_PMD - 1);
+ pfn_end = pfn + PTRS_PER_PMD;
+
+ while (pfn < pfn_end) {
+ e = __snp_lookup_rmpentry(pfn, &level);
+ if (IS_ERR(e)) {
+ pr_info("Failed to read RMP entry for PFN 0x%llx\n", pfn);
+ pfn++;
+ continue;
+ }
+
+ if (e->low || e->high)
+ pr_info("RMPEntry paddr 0x%llx: [high=0x%016llx low=0x%016llx]\n",
+ pfn << PAGE_SHIFT, e->high, e->low);
+ pfn++;
+ }
+}
+EXPORT_SYMBOL_GPL(sev_dump_rmpentry);
+
/*
* Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
* and -errno if there is no corresponding RMP entry.
--
2.25.1
From: Brijesh Singh <[email protected]>
The RMPUPDATE instruction writes a new RMP entry in the RMP Table. The
hypervisor will use the instruction to add pages to the RMP table. See
APM3 for details on the instruction operations.
The PSMASH instruction expands a 2MB RMP entry into a corresponding set
of contiguous 4KB-Page RMP entries. The hypervisor will use this
instruction to adjust the RMP entry without invalidating the previous
RMP entry.
Add the following external interface API functions:
psmash():
Used to smash a 2MB aligned page into 4K pages while preserving the
Validated bit in the RMP.
rmp_make_private():
Used to assign a page to guest using the RMPUPDATE instruction.
rmp_make_shared():
Used to transition a page to hypervisor/shared state using the
RMPUPDATE instruction.
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
[mdr: add RMPUPDATE retry logic for transient FAIL_OVERLAP errors]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev.h | 24 +++++++++
arch/x86/kernel/sev.c | 108 +++++++++++++++++++++++++++++++++++++
2 files changed, 132 insertions(+)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index edbb7fa488af..7d728b30319c 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -80,10 +80,15 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
/* Software defined (when rFlags.CF = 1) */
#define PVALIDATE_FAIL_NOUPDATE 255
+/* RMUPDATE detected 4K page and 2MB page overlap. */
+#define RMPUPDATE_FAIL_OVERLAP 7
/* RMP page size */
#define RMP_PG_SIZE_4K 0
+#define RMP_PG_SIZE_2M 1
#define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
+#define X86_TO_RMP_PG_LEVEL(level) (((level) == PG_LEVEL_4K) ? RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)
+
#define RMPADJUST_VMSA_PAGE_BIT BIT(16)
/* SNP Guest message request */
@@ -133,6 +138,15 @@ struct snp_secrets_page_layout {
u8 rsvd3[3840];
} __packed;
+struct rmp_state {
+ u64 gpa;
+ u8 assigned;
+ u8 pagesize;
+ u8 immutable;
+ u8 rsvd;
+ u32 asid;
+} __packed;
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
extern struct static_key_false sev_es_enable_key;
extern void __sev_es_ist_enter(struct pt_regs *regs);
@@ -199,6 +213,9 @@ void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
int snp_lookup_rmpentry(u64 pfn, int *level);
void sev_dump_rmpentry(u64 pfn);
+int psmash(u64 pfn);
+int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable);
+int rmp_make_shared(u64 pfn, enum pg_level level);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -225,6 +242,13 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
}
static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
static inline void sev_dump_rmpentry(u64 pfn) {}
+static inline int psmash(u64 pfn) { return -ENXIO; }
+static inline int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid,
+ bool immutable)
+{
+ return -ENODEV;
+}
+static inline int rmp_make_shared(u64 pfn, enum pg_level level) { return -ENODEV; }
#endif
#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a01741c4a1b8..a49f30c10dc1 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2567,3 +2567,111 @@ int snp_lookup_rmpentry(u64 pfn, int *level)
return !!rmpentry_assigned(e);
}
EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
+
+/*
+ * psmash is used to smash a 2MB aligned page into 4K
+ * pages while preserving the Validated bit in the RMP.
+ */
+int psmash(u64 pfn)
+{
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ int ret;
+
+ pr_debug("%s: PFN: 0x%llx\n", __func__, pfn);
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return -ENXIO;
+
+ /* Binutils version 2.36 supports the PSMASH mnemonic. */
+ asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
+ : "=a"(ret)
+ : "a"(paddr)
+ : "memory", "cc");
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(psmash);
+
+static int rmpupdate(u64 pfn, struct rmp_state *val)
+{
+ int max_attempts = 4 * num_present_cpus();
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ int ret, level, npages;
+ int attempts = 0;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return -ENXIO;
+
+ do {
+ /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
+ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
+ : "=a"(ret)
+ : "a"(paddr), "c"((unsigned long)val)
+ : "memory", "cc");
+
+ attempts++;
+ if (ret)
+ pr_debug("RMPUPDATE retry needed, ASID: %d, ret: %d, pfn: %llx, npages: %d, level: %d, assigned: %d, attempts: %d (max: %d)\n",
+ ret, val->asid, pfn, npages, level, val->assigned,
+ attempts, max_attempts);
+ } while (ret && attempts < max_attempts);
+
+ if (ret) {
+ pr_err("RMPUPDATE failed after %d attempts, ret: %d, pfn: %llx, npages: %d, level: %d\n",
+ attempts, ret, pfn, npages, level);
+ sev_dump_rmpentry(pfn);
+ dump_stack();
+ return -EFAULT;
+ } else if (attempts > 1) {
+ pr_debug("RMPUPDATE succeeded after %d attempts, ASID: %d, ret: %d, pfn: %llx, npages: %d",
+ attempts, val->asid, ret, pfn, npages);
+ }
+
+ return 0;
+}
+
+/*
+ * Assign a page to guest using the RMPUPDATE instruction.
+ */
+int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable)
+{
+ struct rmp_state val;
+
+ pr_debug("%s: GPA: 0x%llx, PFN: 0x%llx, level: %d, immutable: %d\n",
+ __func__, gpa, pfn, level, immutable);
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ memset(&val, 0, sizeof(val));
+ val.assigned = 1;
+ val.asid = asid;
+ val.immutable = immutable;
+ val.gpa = gpa;
+ val.pagesize = X86_TO_RMP_PG_LEVEL(level);
+
+ return rmpupdate(pfn, &val);
+}
+EXPORT_SYMBOL_GPL(rmp_make_private);
+
+/*
+ * Transition a page to hypervisor/shared state using the RMPUPDATE instruction.
+ */
+int rmp_make_shared(u64 pfn, enum pg_level level)
+{
+ struct rmp_state val;
+
+ pr_debug("%s: PFN: 0x%llx, level: %d\n", __func__, pfn, level);
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ memset(&val, 0, sizeof(val));
+ val.pagesize = X86_TO_RMP_PG_LEVEL(level);
+
+ return rmpupdate(pfn, &val);
+}
+EXPORT_SYMBOL_GPL(rmp_make_shared);
--
2.25.1
From: Brijesh Singh <[email protected]>
The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used with standard x86 and IOMMU page tables to enforce
memory restrictions and page access rights. The RMP check is enforced as
soon as SEV-SNP is enabled globally in the system. When hardware
encounters an RMP-check failure, it raises a page-fault exception.
The rmp_make_private() and rmp_make_shared() helpers are used to add
or remove the pages from the RMP table. Improve the rmp_make_private()
to invalidate state so that pages cannot be used in the direct-map after
they are added the RMP table, and restored to their default valid
permission after the pages are removed from the RMP table.
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kernel/sev.c | 57 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a49f30c10dc1..3e5ff5934e83 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2595,6 +2595,37 @@ int psmash(u64 pfn)
}
EXPORT_SYMBOL_GPL(psmash);
+static int restore_direct_map(u64 pfn, int npages)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < npages; i++) {
+ ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));
+ if (ret)
+ goto cleanup;
+ }
+
+cleanup:
+ WARN(ret > 0, "Failed to restore direct map for pfn 0x%llx\n", pfn + i);
+ return ret;
+}
+
+static int invalidate_direct_map(u64 pfn, int npages)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < npages; i++) {
+ ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));
+ if (ret)
+ goto cleanup;
+ }
+
+cleanup:
+ WARN(ret > 0, "Failed to invalidate direct map for pfn 0x%llx\n", pfn + i);
+ restore_direct_map(pfn, i);
+ return ret;
+}
+
static int rmpupdate(u64 pfn, struct rmp_state *val)
{
int max_attempts = 4 * num_present_cpus();
@@ -2605,6 +2636,21 @@ static int rmpupdate(u64 pfn, struct rmp_state *val)
if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
return -ENXIO;
+ level = RMP_TO_X86_PG_LEVEL(val->pagesize);
+ npages = page_level_size(level) / PAGE_SIZE;
+
+ /*
+ * If page is getting assigned in the RMP table then unmap it from the
+ * direct map.
+ */
+ if (val->assigned) {
+ if (invalidate_direct_map(pfn, npages)) {
+ pr_err("Failed to unmap %d pages at pfn 0x%llx from the direct_map\n",
+ npages, pfn);
+ return -EFAULT;
+ }
+ }
+
do {
/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
@@ -2630,6 +2676,17 @@ static int rmpupdate(u64 pfn, struct rmp_state *val)
attempts, val->asid, ret, pfn, npages);
}
+ /*
+ * Restore the direct map after the page is removed from the RMP table.
+ */
+ if (!val->assigned) {
+ if (restore_direct_map(pfn, npages)) {
+ pr_err("Failed to map %d pages at pfn 0x%llx into the direct_map\n",
+ npages, pfn);
+ return -EFAULT;
+ }
+ }
+
return 0;
}
--
2.25.1
From: Brijesh Singh <[email protected]>
Bit 31 in the page fault-error bit will be set when processor encounters
an RMP violation.
While at it, use the BIT_ULL() macro.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/trap_pf.h | 18 +++++++++++-------
arch/x86/mm/fault.c | 1 +
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..295be06f8db7 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_TRAP_PF_H
#define _ASM_X86_TRAP_PF_H
+#include <linux/bits.h> /* BIT() macro */
+
/*
* Page fault error code bits:
*
@@ -12,15 +14,17 @@
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
* bit 15 == 1: SGX MMU page-fault
+ * bit 31 == 1: fault was due to RMP violation
*/
enum x86_pf_error_code {
- X86_PF_PROT = 1 << 0,
- X86_PF_WRITE = 1 << 1,
- X86_PF_USER = 1 << 2,
- X86_PF_RSVD = 1 << 3,
- X86_PF_INSTR = 1 << 4,
- X86_PF_PK = 1 << 5,
- X86_PF_SGX = 1 << 15,
+ X86_PF_PROT = BIT(0),
+ X86_PF_WRITE = BIT(1),
+ X86_PF_USER = BIT(2),
+ X86_PF_RSVD = BIT(3),
+ X86_PF_INSTR = BIT(4),
+ X86_PF_PK = BIT(5),
+ X86_PF_SGX = BIT(15),
+ X86_PF_RMP = BIT(31),
};
#endif /* _ASM_X86_TRAP_PF_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..f8193b99e9c8 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -567,6 +567,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
!(error_code & X86_PF_PROT) ? "not-present page" :
(error_code & X86_PF_RSVD) ? "reserved bit violation" :
(error_code & X86_PF_PK) ? "protection keys violation" :
+ (error_code & X86_PF_RMP) ? "RMP violation" :
"permissions violation");
if (!(error_code & X86_PF_USER) && user_mode(regs)) {
--
2.25.1
From: Brijesh Singh <[email protected]>
When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. When the host writes to pages, hardware checks the following
conditions at the end of page walk:
1. Assigned bit in the RMP table is zero (i.e page is shared).
2. If the page table entry that gives the sPA indicates that the target
page size is a large page, then all RMP entries for the 4KB
constituting pages of the target must have the assigned bit 0.
3. Immutable bit in the RMP table is not zero.
The hardware will raise page fault if one of the above conditions is not
met. Try resolving the fault instead of taking fault again and again. If
the host attempts to write to the guest private memory then send the
SIGBUS signal to kill the process. If the page level between the host and
RMP entry does not match, then split the address to keep the RMP and host
page levels in sync.
Co-developed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/mm/fault.c | 104 ++++++++++++++++++++++++++++++++++++++-
include/linux/mm.h | 3 +-
include/linux/mm_types.h | 3 ++
mm/memory.c | 10 ++++
4 files changed, 118 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f8193b99e9c8..afd4cde17001 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
#include <asm/kvm_para.h> /* kvm_handle_async_pf */
#include <asm/vdso.h> /* fixup_vdso_exception() */
#include <asm/irq_stack.h>
+#include <asm/sev.h> /* snp_lookup_rmpentry() */
#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
@@ -414,6 +415,7 @@ static void dump_pagetable(unsigned long address)
pr_cont("PTE %lx", pte_val(*pte));
out:
pr_cont("\n");
+
return;
bad:
pr_info("BAD\n");
@@ -527,6 +529,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index)
static void
show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address)
{
+ unsigned long pfn;
+
if (!oops_may_print())
return;
@@ -599,7 +603,10 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
show_ldttss(&gdt, "TR", tr);
}
- dump_pagetable(address);
+ pfn = dump_pagetable(address);
+
+ if (error_code & X86_PF_RMP)
+ sev_dump_rmpentry(pfn);
}
static noinline void
@@ -1240,6 +1247,90 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
}
NOKPROBE_SYMBOL(do_kern_addr_fault);
+enum rmp_pf_ret {
+ RMP_PF_SPLIT = 0,
+ RMP_PF_RETRY = 1,
+ RMP_PF_UNMAP = 2,
+};
+
+/*
+ * The goal of RMP faulting routine is really to check whether the
+ * page that faulted should be accessible. That can be determined
+ * simply by looking at the RMP entry for the 4k address being accessed.
+ * If that entry has Assigned=1 then it's a bad address. It could be
+ * because the 2MB region was assigned as a large page, or it could be
+ * because the region is all 4k pages and that 4k was assigned.
+ * In either case, it's a bad access.
+ * There are basically two main possibilities:
+ * 1. The 2M entry has Assigned=1 and Page_Size=1. Then all 511 middle
+ * entries also have Assigned=1. This entire 2M region is a guest page.
+ * 2. The 2M entry has Assigned=0 and Page_Size=0. Then the 511 middle
+ * entries can be anything, this region consists of individual 4k assignments.
+ */
+static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
+ unsigned long address)
+{
+ int rmp_level, level;
+ pgd_t *pgd;
+ pte_t *pte;
+ u64 pfn;
+
+ pgd = __va(read_cr3_pa());
+ pgd += pgd_index(address);
+
+ pte = lookup_address_in_pgd(pgd, address, &level);
+
+ /*
+ * It can happen if there was a race between an unmap event and
+ * the RMP fault delivery.
+ */
+ if (!pte || !pte_present(*pte))
+ return RMP_PF_UNMAP;
+
+ /*
+ * RMP page fault handler follows this algorithm:
+ * 1. Compute the pfn for the 4kb page being accessed
+ * 2. Read that RMP entry -- If it is assigned then kill the process
+ * 3. Otherwise, check the level from the host page table
+ * If level=PG_LEVEL_4K then the page is already smashed
+ * so just retry the instruction
+ * 4. If level=PG_LEVEL_2M/1G, then the host page needs to be split
+ */
+
+ pfn = pte_pfn(*pte);
+
+ /* If its large page then calculte the fault pfn */
+ if (level > PG_LEVEL_4K)
+ pfn = pfn | PFN_DOWN(address & (page_level_size(level) - 1));
+
+ /*
+ * If its a guest private page, then the fault cannot be resolved.
+ * Send a SIGBUS to terminate the process.
+ *
+ * As documented in APM vol3 pseudo-code for RMPUPDATE, when the 2M range
+ * is covered by a valid (Assigned=1) 2M entry, the middle 511 4k entries
+ * also have Assigned=1. This means that if there is an access to a page
+ * which happens to lie within an Assigned 2M entry, the 4k RMP entry
+ * will also have Assigned=1. Therefore, the kernel should see that
+ * the page is not a valid page and the fault cannot be resolved.
+ */
+ if (snp_lookup_rmpentry(pfn, &rmp_level)) {
+ pr_info("Fatal RMP page fault, terminating process, entry assigned for pfn 0x%llx\n",
+ pfn);
+ do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
+ return RMP_PF_RETRY;
+ }
+
+ /*
+ * The backing page level is higher than the RMP page level, request
+ * to split the page.
+ */
+ if (level > rmp_level)
+ return RMP_PF_SPLIT;
+
+ return RMP_PF_RETRY;
+}
+
/*
* Handle faults in the user portion of the address space. Nothing in here
* should check X86_PF_USER without a specific justification: for almost
@@ -1337,6 +1428,17 @@ void do_user_addr_fault(struct pt_regs *regs,
if (error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
+ /*
+ * If its an RMP violation, try resolving it.
+ */
+ if (error_code & X86_PF_RMP) {
+ if (handle_user_rmp_page_fault(regs, error_code, address))
+ return;
+
+ /* Ask to split the page */
+ flags |= FAULT_FLAG_PAGE_SPLIT;
+ }
+
#ifdef CONFIG_X86_64
/*
* Faults in the vsyscall page might need emulation. The
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c84f4e48cd7..2fd8e16d149c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -466,7 +466,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" }
/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..06ba34d51638 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -962,6 +962,8 @@ typedef struct {
* mapped R/O.
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
* We should only access orig_pte if this flag set.
+ * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
+ * region to smaller page size and retry.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -999,6 +1001,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
FAULT_FLAG_UNSHARE = 1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+ FAULT_FLAG_PAGE_SPLIT = 1 << 12,
};
typedef unsigned int __bitwise zap_flags_t;
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..e68da7e403c6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4996,6 +4996,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return 0;
}
+static int handle_split_page_fault(struct vm_fault *vmf)
+{
+ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+ return 0;
+}
+
/*
* By the time we get here, we already hold the mm semaphore
*
@@ -5078,6 +5084,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}
+
+ if (flags & FAULT_FLAG_PAGE_SPLIT)
+ return handle_split_page_fault(&vmf);
+
if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf);
--
2.25.1
From: Hugh Dickins <[email protected]>
When the address is backed by a memfd, the code to split the page does
nothing more than remove the PMD from the page tables. So immediately
install a PTE to ensure that any other pages in that 2MB region are
brought back as in 4K pages.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index e68da7e403c6..33c9020ba1f8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4999,6 +4999,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
static int handle_split_page_fault(struct vm_fault *vmf)
{
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+ /*
+ * Install a PTE immediately to ensure that any other pages in
+ * this 2MB region are brought back in as 4K pages.
+ */
+ __pte_alloc(vmf->vma->vm_mm, vmf->pmd);
return 0;
}
--
2.25.1
From: Ashish Kalra <[email protected]>
Return pfn from dump_pagetable() to do SEV-specific
fault handling. Used for handling SNP RMP page fault.
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/mm/fault.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index afd4cde17001..f2b16dcfbd9a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -311,7 +311,7 @@ static bool low_pfn(unsigned long pfn)
return pfn < max_low_pfn;
}
-static void dump_pagetable(unsigned long address)
+static unsigned long dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3_pa());
pgd_t *pgd = &base[pgd_index(address)];
@@ -345,8 +345,10 @@ static void dump_pagetable(unsigned long address)
pte = pte_offset_kernel(pmd, address);
pr_cont("*pte = %0*Lx ", sizeof(*pte) * 2, (u64)pte_val(*pte));
+ return 0;
out:
pr_cont("\n");
+ return 0;
}
#else /* CONFIG_X86_64: */
@@ -367,10 +369,11 @@ static int bad_address(void *p)
return get_kernel_nofault(dummy, (unsigned long *)p);
}
-static void dump_pagetable(unsigned long address)
+static unsigned long dump_pagetable(unsigned long address)
{
pgd_t *base = __va(read_cr3_pa());
pgd_t *pgd = base + pgd_index(address);
+ unsigned long pfn;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
@@ -388,6 +391,7 @@ static void dump_pagetable(unsigned long address)
if (bad_address(p4d))
goto bad;
+ pfn = p4d_pfn(*p4d);
pr_cont("P4D %lx ", p4d_val(*p4d));
if (!p4d_present(*p4d) || p4d_large(*p4d))
goto out;
@@ -396,6 +400,7 @@ static void dump_pagetable(unsigned long address)
if (bad_address(pud))
goto bad;
+ pfn = pud_pfn(*pud);
pr_cont("PUD %lx ", pud_val(*pud));
if (!pud_present(*pud) || pud_large(*pud))
goto out;
@@ -404,6 +409,7 @@ static void dump_pagetable(unsigned long address)
if (bad_address(pmd))
goto bad;
+ pfn = pmd_pfn(*pmd);
pr_cont("PMD %lx ", pmd_val(*pmd));
if (!pmd_present(*pmd) || pmd_large(*pmd))
goto out;
@@ -412,13 +418,14 @@ static void dump_pagetable(unsigned long address)
if (bad_address(pte))
goto bad;
+ pfn = pte_pfn(*pte);
pr_cont("PTE %lx", pte_val(*pte));
out:
pr_cont("\n");
-
- return;
+ return pfn;
bad:
pr_info("BAD\n");
+ return -1;
}
#endif /* CONFIG_X86_64 */
--
2.25.1
This callback is used by the KVM MMU to check whether a #NPF was for a
private GPA or not.
In some cases the full 64-bit error code for the #NPF will be needed to
make this determination, so also update kvm_mmu_do_page_fault() to
accept the full 64-bit value so it can be plumbed through to the
callback.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 3 +--
arch/x86/kvm/mmu/mmu_internal.h | 37 +++++++++++++++++++++++++++---
4 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8dc345cc6318..72183da010b8 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e552374f2357..f856d689dda0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1643,6 +1643,7 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
+ bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eda615f3951c..fb3f34b7391c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5724,8 +5724,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}
if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false);
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e642d431df4b..557a001210df 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -231,6 +231,37 @@ struct kvm_page_fault {
int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
+{
+ struct kvm_memory_slot *slot;
+ bool private_fault = false;
+ gfn_t gfn = gpa_to_gfn(gpa);
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!slot) {
+ pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn);
+ goto out;
+ }
+
+ if (!kvm_slot_can_be_private(slot)) {
+ pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn);
+ goto out;
+ }
+
+ if (static_call(kvm_x86_fault_is_private)(kvm, gpa, err, &private_fault))
+ goto out;
+
+ /*
+ * Handling below is for UPM self-tests and guests that treat userspace
+ * as the authority on whether a fault should be private or not.
+ */
+ private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
+
+out:
+ pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault);
+ return private_fault;
+}
+
/*
* Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
* and of course kvm_mmu_do_page_fault().
@@ -262,11 +293,11 @@ enum {
};
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch)
+ u64 err, bool prefetch)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
- .error_code = err,
+ .error_code = lower_32_bits(err),
.exec = err & PFERR_FETCH_MASK,
.write = err & PFERR_WRITE_MASK,
.present = err & PFERR_PRESENT_MASK,
@@ -280,7 +311,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+ .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
};
int r;
--
2.25.1
From: Brijesh Singh <[email protected]>
AMD introduced the next generation of SEV called SEV-SNP (Secure Nested
Paging). SEV-SNP builds upon existing SEV and SEV-ES functionality
while adding new hardware security protection.
Define the commands and structures used to communicate with the AMD-SP
when creating and managing the SEV-SNP guests. The SEV-SNP firmware spec
is available at developer.amd.com/sev.
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 16 +++
include/linux/psp-sev.h | 247 +++++++++++++++++++++++++++++++++++
include/uapi/linux/psp-sev.h | 44 +++++++
3 files changed, 307 insertions(+)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 06fc7156c04f..9d84720a41d7 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -126,6 +126,8 @@ static int sev_cmd_buffer_len(int cmd)
switch (cmd) {
case SEV_CMD_INIT: return sizeof(struct sev_data_init);
case SEV_CMD_INIT_EX: return sizeof(struct sev_data_init_ex);
+ case SEV_CMD_SNP_SHUTDOWN_EX: return sizeof(struct sev_data_snp_shutdown_ex);
+ case SEV_CMD_SNP_INIT_EX: return sizeof(struct sev_data_snp_init_ex);
case SEV_CMD_PLATFORM_STATUS: return sizeof(struct sev_user_data_status);
case SEV_CMD_PEK_CSR: return sizeof(struct sev_data_pek_csr);
case SEV_CMD_PEK_CERT_IMPORT: return sizeof(struct sev_data_pek_cert_import);
@@ -154,6 +156,20 @@ static int sev_cmd_buffer_len(int cmd)
case SEV_CMD_GET_ID: return sizeof(struct sev_data_get_id);
case SEV_CMD_ATTESTATION_REPORT: return sizeof(struct sev_data_attestation_report);
case SEV_CMD_SEND_CANCEL: return sizeof(struct sev_data_send_cancel);
+ case SEV_CMD_SNP_GCTX_CREATE: return sizeof(struct sev_data_snp_addr);
+ case SEV_CMD_SNP_LAUNCH_START: return sizeof(struct sev_data_snp_launch_start);
+ case SEV_CMD_SNP_LAUNCH_UPDATE: return sizeof(struct sev_data_snp_launch_update);
+ case SEV_CMD_SNP_ACTIVATE: return sizeof(struct sev_data_snp_activate);
+ case SEV_CMD_SNP_DECOMMISSION: return sizeof(struct sev_data_snp_addr);
+ case SEV_CMD_SNP_PAGE_RECLAIM: return sizeof(struct sev_data_snp_page_reclaim);
+ case SEV_CMD_SNP_GUEST_STATUS: return sizeof(struct sev_data_snp_guest_status);
+ case SEV_CMD_SNP_LAUNCH_FINISH: return sizeof(struct sev_data_snp_launch_finish);
+ case SEV_CMD_SNP_DBG_DECRYPT: return sizeof(struct sev_data_snp_dbg);
+ case SEV_CMD_SNP_DBG_ENCRYPT: return sizeof(struct sev_data_snp_dbg);
+ case SEV_CMD_SNP_PAGE_UNSMASH: return sizeof(struct sev_data_snp_page_unsmash);
+ case SEV_CMD_SNP_PLATFORM_STATUS: return sizeof(struct sev_data_snp_addr);
+ case SEV_CMD_SNP_GUEST_REQUEST: return sizeof(struct sev_data_snp_guest_request);
+ case SEV_CMD_SNP_CONFIG: return sizeof(struct sev_user_data_snp_config);
default: return 0;
}
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 1595088c428b..31b045e1926f 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -86,6 +86,35 @@ enum sev_cmd {
SEV_CMD_DBG_DECRYPT = 0x060,
SEV_CMD_DBG_ENCRYPT = 0x061,
+ /* SNP specific commands */
+ SEV_CMD_SNP_INIT = 0x81,
+ SEV_CMD_SNP_SHUTDOWN = 0x82,
+ SEV_CMD_SNP_PLATFORM_STATUS = 0x83,
+ SEV_CMD_SNP_DF_FLUSH = 0x84,
+ SEV_CMD_SNP_INIT_EX = 0x85,
+ SEV_CMD_SNP_SHUTDOWN_EX = 0x86,
+ SEV_CMD_SNP_DECOMMISSION = 0x90,
+ SEV_CMD_SNP_ACTIVATE = 0x91,
+ SEV_CMD_SNP_GUEST_STATUS = 0x92,
+ SEV_CMD_SNP_GCTX_CREATE = 0x93,
+ SEV_CMD_SNP_GUEST_REQUEST = 0x94,
+ SEV_CMD_SNP_ACTIVATE_EX = 0x95,
+ SEV_CMD_SNP_LAUNCH_START = 0xA0,
+ SEV_CMD_SNP_LAUNCH_UPDATE = 0xA1,
+ SEV_CMD_SNP_LAUNCH_FINISH = 0xA2,
+ SEV_CMD_SNP_DBG_DECRYPT = 0xB0,
+ SEV_CMD_SNP_DBG_ENCRYPT = 0xB1,
+ SEV_CMD_SNP_PAGE_SWAP_OUT = 0xC0,
+ SEV_CMD_SNP_PAGE_SWAP_IN = 0xC1,
+ SEV_CMD_SNP_PAGE_MOVE = 0xC2,
+ SEV_CMD_SNP_PAGE_MD_INIT = 0xC3,
+ SEV_CMD_SNP_PAGE_MD_RECLAIM = 0xC4,
+ SEV_CMD_SNP_PAGE_RO_RECLAIM = 0xC5,
+ SEV_CMD_SNP_PAGE_RO_RESTORE = 0xC6,
+ SEV_CMD_SNP_PAGE_RECLAIM = 0xC7,
+ SEV_CMD_SNP_PAGE_UNSMASH = 0xC8,
+ SEV_CMD_SNP_CONFIG = 0xC9,
+
SEV_CMD_MAX,
};
@@ -531,6 +560,224 @@ struct sev_data_attestation_report {
u32 len; /* In/Out */
} __packed;
+/**
+ * struct sev_data_snp_download_firmware - SNP_DOWNLOAD_FIRMWARE command params
+ *
+ * @address: physical address of firmware image
+ * @len: len of the firmware image
+ */
+struct sev_data_snp_download_firmware {
+ u64 address; /* In */
+ u32 len; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_activate - SNP_ACTIVATE command params
+ *
+ * @gctx_paddr: system physical address guest context page
+ * @asid: ASID to bind to the guest
+ */
+struct sev_data_snp_activate {
+ u64 gctx_paddr; /* In */
+ u32 asid; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_addr - generic SNP command params
+ *
+ * @address: system physical address guest context page
+ */
+struct sev_data_snp_addr {
+ u64 gctx_paddr; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_start - SNP_LAUNCH_START command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @policy: guest policy
+ * @ma_gctx_addr: system physical address of migration agent
+ * @imi_en: launch flow is launching an IMI for the purpose of
+ * guest-assisted migration.
+ * @ma_en: the guest is associated with a migration agent
+ */
+struct sev_data_snp_launch_start {
+ u64 gctx_paddr; /* In */
+ u64 policy; /* In */
+ u64 ma_gctx_paddr; /* In */
+ u32 ma_en:1; /* In */
+ u32 imi_en:1; /* In */
+ u32 rsvd:30;
+ u8 gosvw[16]; /* In */
+} __packed;
+
+/* SNP support page type */
+enum {
+ SNP_PAGE_TYPE_NORMAL = 0x1,
+ SNP_PAGE_TYPE_VMSA = 0x2,
+ SNP_PAGE_TYPE_ZERO = 0x3,
+ SNP_PAGE_TYPE_UNMEASURED = 0x4,
+ SNP_PAGE_TYPE_SECRET = 0x5,
+ SNP_PAGE_TYPE_CPUID = 0x6,
+
+ SNP_PAGE_TYPE_MAX
+};
+
+/**
+ * struct sev_data_snp_launch_update - SNP_LAUNCH_UPDATE command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @imi_page: indicates that this page is part of the IMI of the guest
+ * @page_type: encoded page type
+ * @page_size: page size 0 indicates 4K and 1 indicates 2MB page
+ * @address: system physical address of destination page to encrypt
+ * @vmpl1_perms: VMPL permission mask for VMPL1
+ * @vmpl2_perms: VMPL permission mask for VMPL2
+ * @vmpl3_perms: VMPL permission mask for VMPL3
+ */
+struct sev_data_snp_launch_update {
+ u64 gctx_paddr; /* In */
+ u32 page_size:1; /* In */
+ u32 page_type:3; /* In */
+ u32 imi_page:1; /* In */
+ u32 rsvd:27;
+ u32 rsvd2;
+ u64 address; /* In */
+ u32 rsvd3:8;
+ u32 vmpl1_perms:8; /* In */
+ u32 vmpl2_perms:8; /* In */
+ u32 vmpl3_perms:8; /* In */
+ u32 rsvd4;
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_finish - SNP_LAUNCH_FINISH command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ */
+struct sev_data_snp_launch_finish {
+ u64 gctx_paddr;
+ u64 id_block_paddr;
+ u64 id_auth_paddr;
+ u8 id_block_en:1;
+ u8 auth_key_en:1;
+ u64 rsvd:62;
+ u8 host_data[32];
+} __packed;
+
+/**
+ * struct sev_data_snp_guest_status - SNP_GUEST_STATUS command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @address: system physical address of guest status page
+ */
+struct sev_data_snp_guest_status {
+ u64 gctx_paddr;
+ u64 address;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_reclaim - SNP_PAGE_RECLAIM command params
+ *
+ * @paddr: system physical address of page to be claimed. The 0th bit
+ * in the address indicates the page size. 0h indicates 4 kB and
+ * 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_reclaim {
+ u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_unsmash - SNP_PAGE_UNSMASH command params
+ *
+ * @paddr: system physical address of page to be unsmashed. The 0th bit
+ * in the address indicates the page size. 0h indicates 4 kB and
+ * 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_unsmash {
+ u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_dbg - DBG_ENCRYPT/DBG_DECRYPT command parameters
+ *
+ * @handle: handle of the VM to perform debug operation
+ * @src_addr: source address of data to operate on
+ * @dst_addr: destination address of data to operate on
+ * @len: len of data to operate on
+ */
+struct sev_data_snp_dbg {
+ u64 gctx_paddr; /* In */
+ u64 src_addr; /* In */
+ u64 dst_addr; /* In */
+ u32 len; /* In */
+} __packed;
+
+/**
+ * struct sev_snp_guest_request - SNP_GUEST_REQUEST command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @req_paddr: system physical address of request page
+ * @res_paddr: system physical address of response page
+ */
+struct sev_data_snp_guest_request {
+ u64 gctx_paddr; /* In */
+ u64 req_paddr; /* In */
+ u64 res_paddr; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_init - SNP_INIT_EX structure
+ *
+ * @init_rmp: indicate that the RMP should be initialized.
+ * @list_paddr_en: indicate that list_paddr is valid
+ * #list_paddr: system physical address of range list
+ */
+struct sev_data_snp_init_ex {
+ u32 init_rmp:1;
+ u32 list_paddr_en:1;
+ u32 rsvd:30;
+ u32 rsvd1;
+ u64 list_paddr;
+ u8 rsvd2[48];
+} __packed;
+
+/**
+ * struct sev_data_range - RANGE structure
+ *
+ * @base: system physical address of first byte of range
+ * @page_count: number of 4KB pages in this range
+ */
+struct sev_data_range {
+ u64 base;
+ u32 page_count;
+ u32 rsvd;
+} __packed;
+
+/**
+ * struct sev_data_range_list - RANGE_LIST structure
+ *
+ * @num_elements: number of elements in RANGE_ARRAY
+ * @ranges: array of num_elements of type RANGE
+ */
+struct sev_data_range_list {
+ u32 num_elements;
+ u32 rsvd;
+ struct sev_data_range ranges[0];
+} __packed;
+
+/**
+ * struct sev_data_snp_shutdown_ex - SNP_SHUTDOWN_EX structure
+ *
+ * @length: len of the command buffer read by the PSP
+ * @iommu_snp_shutdown: Disable enforcement of SNP in the IOMMU
+ */
+struct sev_data_snp_shutdown_ex {
+ u32 length;
+ u32 iommu_snp_shutdown:1;
+ u32 rsvd1:31;
+} __packed;
+
#ifdef CONFIG_CRYPTO_DEV_SP_PSP
/**
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 91b4c63d5cbf..c66f7c372645 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -61,6 +61,13 @@ typedef enum {
SEV_RET_INVALID_PARAM,
SEV_RET_RESOURCE_LIMIT,
SEV_RET_SECURE_DATA_INVALID,
+ SEV_RET_INVALID_PAGE_SIZE,
+ SEV_RET_INVALID_PAGE_STATE,
+ SEV_RET_INVALID_MDATA_ENTRY,
+ SEV_RET_INVALID_PAGE_OWNER,
+ SEV_RET_INVALID_PAGE_AEAD_OFLOW,
+ SEV_RET_RMP_INIT_REQUIRED,
+
SEV_RET_MAX,
} sev_ret_code;
@@ -147,6 +154,43 @@ struct sev_user_data_get_id2 {
__u32 length; /* In/Out */
} __packed;
+/**
+ * struct sev_user_data_snp_status - SNP status
+ *
+ * @major: API major version
+ * @minor: API minor version
+ * @state: current platform state
+ * @build: firmware build id for the API version
+ * @guest_count: the number of guest currently managed by the firmware
+ * @tcb_version: current TCB version
+ */
+struct sev_user_data_snp_status {
+ __u8 api_major; /* Out */
+ __u8 api_minor; /* Out */
+ __u8 state; /* Out */
+ __u8 rsvd;
+ __u32 build_id; /* Out */
+ __u32 rsvd1;
+ __u32 guest_count; /* Out */
+ __u64 tcb_version; /* Out */
+ __u64 rsvd2;
+} __packed;
+
+/*
+ * struct sev_user_data_snp_config - system wide configuration value for SNP.
+ *
+ * @reported_tcb: The TCB version to report in the guest attestation report.
+ * @mask_chip_id: Indicates that the CHID_ID field in the attestation report
+ * will always be zero.
+ */
+struct sev_user_data_snp_config {
+ __u64 reported_tcb ; /* In */
+ __u32 mask_chip_id:1; /* In */
+ __u32 mask_chip_key:1; /* In */
+ __u32 rsvd:30; /* In */
+ __u8 rsvd1[52];
+} __packed;
+
/**
* struct sev_issue_cmd - SEV ioctl parameters
*
--
2.25.1
From: Brijesh Singh <[email protected]>
Before SNP VMs can be launched, the platform must be appropriately
configured and initialized. Platform initialization is accomplished via
the SNP_INIT command. Make sure to do a WBINVD and issue DF_FLUSH
command to prepare for the first SNP guest launch after INIT.
During the execution of SNP_INIT command, the firmware configures
and enables SNP security policy enforcement in many system components.
Some system components write to regions of memory reserved by early
x86 firmware (e.g. UEFI). Other system components write to regions
provided by the operation system, hypervisor, or x86 firmware.
Such system components can only write to HV-fixed pages or Default
pages. They will error when attempting to write to other page states
after SNP_INIT enables their SNP enforcement.
Starting in SNP firmware v1.52, the SNP_INIT_EX command takes a list of
system physical address ranges to convert into the HV-fixed page states
during the RMP initialization. If INIT_RMP is 1, hypervisors should
provide all system physical address ranges that the hypervisor will
never assign to a guest until the next RMP re-initialization.
For instance, the memory that UEFI reserves should be included in the
range list. This allows system components that occasionally write to
memory (e.g. logging to UEFI reserved regions) to not fail due to
RMP initialization and SNP enablement.
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 225 +++++++++++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.h | 2 +
include/linux/psp-sev.h | 17 +++
3 files changed, 244 insertions(+)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 9d84720a41d7..af20420bd6c2 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -26,6 +26,7 @@
#include <linux/fs_struct.h>
#include <asm/smp.h>
+#include <asm/e820/types.h>
#include "psp-dev.h"
#include "sev-dev.h"
@@ -34,6 +35,10 @@
#define SEV_FW_FILE "amd/sev.fw"
#define SEV_FW_NAME_SIZE 64
+/* Minimum firmware version required for the SEV-SNP support */
+#define SNP_MIN_API_MAJOR 1
+#define SNP_MIN_API_MINOR 51
+
static DEFINE_MUTEX(sev_cmd_mutex);
static struct sev_misc_dev *misc_dev;
@@ -76,6 +81,13 @@ static void *sev_es_tmr;
#define NV_LENGTH (32 * 1024)
static void *sev_init_ex_buffer;
+/*
+ * SEV_DATA_RANGE_LIST:
+ * Array containing range of pages that firmware transitions to HV-fixed
+ * page state.
+ */
+struct sev_data_range_list *snp_range_list;
+
static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
{
struct sev_device *sev = psp_master->sev_data;
@@ -830,6 +842,186 @@ static int sev_update_firmware(struct device *dev)
return ret;
}
+static void snp_set_hsave_pa(void *arg)
+{
+ wrmsrl(MSR_VM_HSAVE_PA, 0);
+}
+
+static int snp_filter_reserved_mem_regions(struct resource *rs, void *arg)
+{
+ struct sev_data_range_list *range_list = arg;
+ struct sev_data_range *range = &range_list->ranges[range_list->num_elements];
+ size_t size;
+
+ if ((range_list->num_elements * sizeof(struct sev_data_range) +
+ sizeof(struct sev_data_range_list)) > PAGE_SIZE)
+ return -E2BIG;
+
+ switch (rs->desc) {
+ case E820_TYPE_RESERVED:
+ case E820_TYPE_PMEM:
+ case E820_TYPE_ACPI:
+ range->base = rs->start & PAGE_MASK;
+ size = (rs->end + 1) - rs->start;
+ range->page_count = size >> PAGE_SHIFT;
+ range_list->num_elements++;
+ break;
+ default:
+ break;
+ }
+
+ return 0;
+}
+
+static int __sev_snp_init_locked(int *error)
+{
+ struct psp_device *psp = psp_master;
+ struct sev_data_snp_init_ex data;
+ struct sev_device *sev;
+ int rc = 0;
+
+ if (!psp || !psp->sev_data)
+ return -ENODEV;
+
+ sev = psp->sev_data;
+
+ if (sev->snp_initialized)
+ return 0;
+
+ /*
+ * The SNP_INIT requires the MSR_VM_HSAVE_PA must be set to 0h
+ * across all cores.
+ */
+ on_each_cpu(snp_set_hsave_pa, NULL, 1);
+
+ /*
+ * Starting in SNP firmware v1.52, the SNP_INIT_EX command takes a list of
+ * system physical address ranges to convert into the HV-fixed page states
+ * during the RMP initialization. For instance, the memory that UEFI
+ * reserves should be included in the range list. This allows system
+ * components that occasionally write to memory (e.g. logging to UEFI
+ * reserved regions) to not fail due to RMP initialization and SNP enablement.
+ */
+ if (sev_version_greater_or_equal(SNP_MIN_API_MAJOR, 52)) {
+ /*
+ * Firmware checks that the pages containing the ranges enumerated
+ * in the RANGES structure are either in the Default page state or in the
+ * firmware page state.
+ */
+ snp_range_list = sev_fw_alloc(PAGE_SIZE);
+ if (!snp_range_list) {
+ dev_err(sev->dev,
+ "SEV: SNP_INIT_EX range list memory allocation failed\n");
+ return -ENOMEM;
+ }
+
+ memset(snp_range_list, 0, PAGE_SIZE);
+
+ /*
+ * Retrieve all reserved memory regions setup by UEFI from the e820 memory map
+ * to be setup as HV-fixed pages.
+ */
+
+ rc = walk_iomem_res_desc(IORES_DESC_NONE, IORESOURCE_MEM, 0, ~0,
+ snp_range_list, snp_filter_reserved_mem_regions);
+ if (rc) {
+ dev_err(sev->dev,
+ "SEV: SNP_INIT_EX walk_iomem_res_desc failed rc = %d\n", rc);
+ return rc;
+ }
+
+ memset(&data, 0, sizeof(data));
+ data.init_rmp = 1;
+ data.list_paddr_en = 1;
+ data.list_paddr = __pa(snp_range_list);
+
+ rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT_EX, &data, error);
+ if (rc)
+ return rc;
+ } else {
+ rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT, NULL, error);
+ if (rc)
+ return rc;
+ }
+
+ /* Prepare for first SNP guest launch after INIT */
+ wbinvd_on_all_cpus();
+ rc = __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, error);
+ if (rc)
+ return rc;
+
+ sev->snp_initialized = true;
+ dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
+
+ return rc;
+}
+
+int sev_snp_init(int *error, bool init_on_probe)
+{
+ int rc;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return -ENODEV;
+
+ if (init_on_probe && !psp_init_on_probe)
+ return 0;
+
+ mutex_lock(&sev_cmd_mutex);
+ rc = __sev_snp_init_locked(error);
+ mutex_unlock(&sev_cmd_mutex);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(sev_snp_init);
+
+static int __sev_snp_shutdown_locked(int *error)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_data_snp_shutdown_ex data;
+ int ret;
+
+ if (!sev->snp_initialized)
+ return 0;
+
+ memset(&data, 0, sizeof(data));
+ data.length = sizeof(data);
+ data.iommu_snp_shutdown = 1;
+
+ wbinvd_on_all_cpus();
+
+retry:
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_SHUTDOWN_EX, &data, error);
+ /* SHUTDOWN may require DF_FLUSH */
+ if (*error == SEV_RET_DFFLUSH_REQUIRED) {
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
+ if (ret) {
+ dev_err(sev->dev, "SEV-SNP DF_FLUSH failed\n");
+ return ret;
+ }
+ goto retry;
+ }
+ if (ret) {
+ dev_err(sev->dev, "SEV-SNP firmware shutdown failed\n");
+ return ret;
+ }
+
+ sev->snp_initialized = false;
+ dev_dbg(sev->dev, "SEV-SNP firmware shutdown\n");
+
+ return ret;
+}
+
+static int sev_snp_shutdown(int *error)
+{
+ int rc;
+
+ mutex_lock(&sev_cmd_mutex);
+ rc = __sev_snp_shutdown_locked(error);
+ mutex_unlock(&sev_cmd_mutex);
+
+ return rc;
+}
+
static int sev_ioctl_do_pek_import(struct sev_issue_cmd *argp, bool writable)
{
struct sev_device *sev = psp_master->sev_data;
@@ -1270,6 +1462,8 @@ int sev_dev_init(struct psp_device *psp)
static void sev_firmware_shutdown(struct sev_device *sev)
{
+ int error;
+
sev_platform_shutdown(NULL);
if (sev_es_tmr) {
@@ -1286,6 +1480,14 @@ static void sev_firmware_shutdown(struct sev_device *sev)
get_order(NV_LENGTH));
sev_init_ex_buffer = NULL;
}
+
+ if (snp_range_list) {
+ free_pages((unsigned long)snp_range_list,
+ get_order(PAGE_SIZE));
+ snp_range_list = NULL;
+ }
+
+ sev_snp_shutdown(&error);
}
void sev_dev_destroy(struct psp_device *psp)
@@ -1341,6 +1543,26 @@ void sev_pci_init(void)
}
}
+ /*
+ * If boot CPU supports SNP, then first attempt to initialize
+ * the SNP firmware.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SEV_SNP)) {
+ if (!sev_version_greater_or_equal(SNP_MIN_API_MAJOR, SNP_MIN_API_MINOR)) {
+ dev_err(sev->dev, "SEV-SNP support requires firmware version >= %d:%d\n",
+ SNP_MIN_API_MAJOR, SNP_MIN_API_MINOR);
+ } else {
+ rc = sev_snp_init(&error, true);
+ if (rc) {
+ /*
+ * Don't abort the probe if SNP INIT failed,
+ * continue to initialize the legacy SEV firmware.
+ */
+ dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
+ }
+ }
+ }
+
/* Obtain the TMR memory area for SEV-ES use */
sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
if (!sev_es_tmr)
@@ -1356,6 +1578,9 @@ void sev_pci_init(void)
dev_err(sev->dev, "SEV: failed to INIT error %#x, rc %d\n",
error, rc);
+ dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_initialized ?
+ "-SNP" : "", sev->api_major, sev->api_minor, sev->build);
+
return;
err:
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 666c21eb81ab..34767657beb5 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -52,6 +52,8 @@ struct sev_device {
u8 build;
void *cmd_buf;
+
+ bool snp_initialized;
};
int sev_dev_init(struct psp_device *psp);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 31b045e1926f..8cfe92e82743 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -794,6 +794,21 @@ struct sev_data_snp_shutdown_ex {
*/
int sev_platform_init(int *error);
+/**
+ * sev_snp_init - perform SEV SNP_INIT command
+ *
+ * @error: SEV command return code
+ * @init_on_probe: indicates if called during module probe/init
+ *
+ * Returns:
+ * 0 if the SEV successfully processed the command
+ * -%ENODEV if the SEV device is not available
+ * -%ENOTSUPP if the SEV does not support SEV
+ * -%ETIMEDOUT if the SEV command timed out
+ * -%EIO if the SEV returned a non-zero return code
+ */
+int sev_snp_init(int *error, bool init_on_probe);
+
/**
* sev_platform_status - perform SEV PLATFORM_STATUS command
*
@@ -901,6 +916,8 @@ sev_platform_status(struct sev_user_data_status *status, int *error) { return -E
static inline int sev_platform_init(int *error) { return -ENODEV; }
+static inline int sev_snp_init(int *error, bool init_on_probe) { return -ENODEV; }
+
static inline int
sev_guest_deactivate(struct sev_data_deactivate *data, int *error) { return -ENODEV; }
--
2.25.1
From: Brijesh Singh <[email protected]>
Make sev_do_cmd() a generic API interface for the hypervisor
to issue commands to manage an SEV and SNP guest. The commands
for SEV and SNP are defined in the SEV and SEV-SNP firmware
specifications.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 3 ++-
include/linux/psp-sev.h | 17 +++++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index af20420bd6c2..35f605936f1b 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -415,7 +415,7 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
return ret;
}
-static int sev_do_cmd(int cmd, void *data, int *psp_ret)
+int sev_do_cmd(int cmd, void *data, int *psp_ret)
{
int rc;
@@ -425,6 +425,7 @@ static int sev_do_cmd(int cmd, void *data, int *psp_ret)
return rc;
}
+EXPORT_SYMBOL_GPL(sev_do_cmd);
static int __sev_init_locked(int *error)
{
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 8cfe92e82743..46f61e3ae33b 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -907,6 +907,20 @@ int sev_guest_df_flush(int *error);
*/
int sev_guest_decommission(struct sev_data_decommission *data, int *error);
+/**
+ * sev_do_cmd - perform SEV command
+ *
+ * @error: SEV command return code
+ *
+ * Returns:
+ * 0 if the SEV successfully processed the command
+ * -%ENODEV if the SEV device is not available
+ * -%ENOTSUPP if the SEV does not support SEV
+ * -%ETIMEDOUT if the SEV command timed out
+ * -%EIO if the SEV returned a non-zero return code
+ */
+int sev_do_cmd(int cmd, void *data, int *psp_ret);
+
void *psp_copy_user_blob(u64 uaddr, u32 len);
#else /* !CONFIG_CRYPTO_DEV_SP_PSP */
@@ -924,6 +938,9 @@ sev_guest_deactivate(struct sev_data_deactivate *data, int *error) { return -ENO
static inline int
sev_guest_decommission(struct sev_data_decommission *data, int *error) { return -ENODEV; }
+static inline int
+sev_do_cmd(int cmd, void *data, int *psp_ret) { return -ENODEV; }
+
static inline int
sev_guest_activate(struct sev_data_activate *data, int *error) { return -ENODEV; }
--
2.25.1
From: Brijesh Singh <[email protected]>
The behavior and requirement for the SEV-legacy command is altered when
the SNP firmware is in the INIT state. See SEV-SNP firmware specification
for more details.
Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region
when SNP is enabled to satisfy new requirements for the SNP. Continue
allocating a 1mb region for !SNP configuration.
While at it, provide API that can be used by others to allocate a page
that can be used by the firmware. The immediate user for this API will
be the KVM driver. The KVM driver to need to allocate a firmware context
page during the guest creation. The context page need to be updated
by the firmware. See the SEV-SNP specification for further details.
Co-developed-by: Ashish Kalra <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 148 +++++++++++++++++++++++++++++++++--
include/linux/psp-sev.h | 9 +++
2 files changed, 149 insertions(+), 8 deletions(-)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index eca4e59b0f44..4c12e98a1219 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -94,6 +94,13 @@ static void *sev_init_ex_buffer;
*/
struct sev_data_range_list *snp_range_list;
+/* When SEV-SNP is enabled the TMR needs to be 2MB aligned and 2MB size. */
+#define SEV_SNP_ES_TMR_SIZE (2 * 1024 * 1024)
+
+static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;
+
+static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
+
static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
{
struct sev_device *sev = psp_master->sev_data;
@@ -216,11 +223,134 @@ void snp_mark_pages_offline(unsigned long pfn, unsigned int npages)
}
EXPORT_SYMBOL_GPL(snp_mark_pages_offline);
+static int snp_reclaim_pages(unsigned long paddr, unsigned int npages, bool locked)
+{
+ /* Cbit maybe set in the paddr */
+ unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
+ int ret, err, i, n = 0;
+
+ if (!pfn_valid(pfn)) {
+ pr_err("%s: Invalid PFN %lx\n", __func__, pfn);
+ return 0;
+ }
+
+ for (i = 0; i < npages; i++, pfn++, n++) {
+ paddr = pfn << PAGE_SHIFT;
+
+ if (locked)
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
+ else
+ ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
+
+ if (ret)
+ goto cleanup;
+
+ ret = rmp_make_shared(pfn, PG_LEVEL_4K);
+ if (ret)
+ goto cleanup;
+ }
+
+ return 0;
+
+cleanup:
+ /*
+ * If failed to reclaim the page then page is no longer safe to
+ * be release back to the system, leak it.
+ */
+ snp_mark_pages_offline(pfn, npages - n);
+ return ret;
+}
+
+static int rmp_mark_pages_firmware(unsigned long paddr, unsigned int npages, bool locked)
+{
+ /* Cbit maybe set in the paddr */
+ unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
+ int rc, n = 0, i;
+
+ for (i = 0; i < npages; i++, n++, pfn++) {
+ rc = rmp_make_private(pfn, 0, PG_LEVEL_4K, 0, true);
+ if (rc)
+ goto cleanup;
+ }
+
+ return 0;
+
+cleanup:
+ /*
+ * Try unrolling the firmware state changes by
+ * reclaiming the pages which were already changed to the
+ * firmware state.
+ */
+ snp_reclaim_pages(paddr, n, locked);
+
+ return rc;
+}
+
+static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
+{
+ unsigned long npages = 1ul << order, paddr;
+ struct sev_device *sev;
+ struct page *page;
+
+ if (!psp_master || !psp_master->sev_data)
+ return NULL;
+
+ page = alloc_pages(gfp_mask, order);
+ if (!page)
+ return NULL;
+
+ /* If SEV-SNP is initialized then add the page in RMP table. */
+ sev = psp_master->sev_data;
+ if (!sev->snp_initialized)
+ return page;
+
+ paddr = __pa((unsigned long)page_address(page));
+ if (rmp_mark_pages_firmware(paddr, npages, locked))
+ return NULL;
+
+ return page;
+}
+
+void *snp_alloc_firmware_page(gfp_t gfp_mask)
+{
+ struct page *page;
+
+ page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
+
+ return page ? page_address(page) : NULL;
+}
+EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
+
+static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ unsigned long paddr, npages = 1ul << order;
+
+ if (!page)
+ return;
+
+ paddr = __pa((unsigned long)page_address(page));
+ if (sev->snp_initialized &&
+ snp_reclaim_pages(paddr, npages, locked))
+ return;
+
+ __free_pages(page, order);
+}
+
+void snp_free_firmware_page(void *addr)
+{
+ if (!addr)
+ return;
+
+ __snp_free_firmware_pages(virt_to_page(addr), 0, false);
+}
+EXPORT_SYMBOL_GPL(snp_free_firmware_page);
+
static void *sev_fw_alloc(unsigned long len)
{
struct page *page;
- page = alloc_pages(GFP_KERNEL, get_order(len));
+ page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(len), false);
if (!page)
return NULL;
@@ -468,7 +598,7 @@ static int __sev_init_locked(int *error)
data.tmr_address = __pa(sev_es_tmr);
data.flags |= SEV_INIT_FLAGS_SEV_ES;
- data.tmr_len = SEV_ES_TMR_SIZE;
+ data.tmr_len = sev_es_tmr_size;
}
return __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
@@ -491,7 +621,7 @@ static int __sev_init_ex_locked(int *error)
data.tmr_address = __pa(sev_es_tmr);
data.flags |= SEV_INIT_FLAGS_SEV_ES;
- data.tmr_len = SEV_ES_TMR_SIZE;
+ data.tmr_len = sev_es_tmr_size;
}
return __sev_do_cmd_locked(SEV_CMD_INIT_EX, &data, error);
@@ -982,6 +1112,8 @@ static int __sev_snp_init_locked(int *error)
sev->snp_initialized = true;
dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
+ sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
+
return rc;
}
@@ -1499,8 +1631,9 @@ static void sev_firmware_shutdown(struct sev_device *sev)
/* The TMR area was encrypted, flush it from the cache */
wbinvd_on_all_cpus();
- free_pages((unsigned long)sev_es_tmr,
- get_order(SEV_ES_TMR_SIZE));
+ __snp_free_firmware_pages(virt_to_page(sev_es_tmr),
+ get_order(sev_es_tmr_size),
+ false);
sev_es_tmr = NULL;
}
@@ -1511,8 +1644,7 @@ static void sev_firmware_shutdown(struct sev_device *sev)
}
if (snp_range_list) {
- free_pages((unsigned long)snp_range_list,
- get_order(PAGE_SIZE));
+ snp_free_firmware_page(snp_range_list);
snp_range_list = NULL;
}
@@ -1593,7 +1725,7 @@ void sev_pci_init(void)
}
/* Obtain the TMR memory area for SEV-ES use */
- sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
+ sev_es_tmr = sev_fw_alloc(sev_es_tmr_size);
if (!sev_es_tmr)
dev_warn(sev->dev,
"SEV: TMR allocation failed, SEV-ES support unavailable\n");
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 8edf5c548fbf..d19744807471 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -922,6 +922,8 @@ int sev_guest_decommission(struct sev_data_decommission *data, int *error);
int sev_do_cmd(int cmd, void *data, int *psp_ret);
void *psp_copy_user_blob(u64 uaddr, u32 len);
+void *snp_alloc_firmware_page(gfp_t mask);
+void snp_free_firmware_page(void *addr);
/**
* sev_mark_pages_offline - insert non-reclaimed firmware/guest pages
@@ -959,6 +961,13 @@ static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_P
void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
+static inline void *snp_alloc_firmware_page(gfp_t mask)
+{
+ return NULL;
+}
+
+static inline void snp_free_firmware_page(void *addr) { }
+
#endif /* CONFIG_CRYPTO_DEV_SP_PSP */
#endif /* __PSP_SEV_H__ */
--
2.25.1
From: Ashish Kalra <[email protected]>
Pages are unsafe to be released back to the page-allocator, if they
have been transitioned to firmware/guest state and can't be reclaimed
or transitioned back to hypervisor/shared state. In this case add
them to an internal leaked pages list to ensure that they are not freed
or touched/accessed to cause fatal page faults.
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 28 ++++++++++++++++++++++++++++
include/linux/psp-sev.h | 8 ++++++++
2 files changed, 36 insertions(+)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 35f605936f1b..eca4e59b0f44 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -42,6 +42,12 @@
static DEFINE_MUTEX(sev_cmd_mutex);
static struct sev_misc_dev *misc_dev;
+/* list of pages which are leaked and cannot be reclaimed */
+static LIST_HEAD(snp_leaked_pages_list);
+static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
+
+static atomic_long_t snp_nr_leaked_pages = ATOMIC_LONG_INIT(0);
+
static int psp_cmd_timeout = 100;
module_param(psp_cmd_timeout, int, 0644);
MODULE_PARM_DESC(psp_cmd_timeout, " default timeout value, in seconds, for PSP commands");
@@ -188,6 +194,28 @@ static int sev_cmd_buffer_len(int cmd)
return 0;
}
+void snp_mark_pages_offline(unsigned long pfn, unsigned int npages)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ WARN(1, "psc failed, pfn 0x%lx pages %d (marked offline)\n", pfn, npages);
+
+ spin_lock(&snp_leaked_pages_list_lock);
+ while (npages--) {
+ /*
+ * Reuse the page's buddy list for chaining into the leaked
+ * pages list. This page should not be on a free list currently
+ * and is also unsafe to be added to a free list.
+ */
+ list_add_tail(&page->buddy_list, &snp_leaked_pages_list);
+ sev_dump_rmpentry(pfn);
+ pfn++;
+ }
+ spin_unlock(&snp_leaked_pages_list_lock);
+ atomic_long_inc(&snp_nr_leaked_pages);
+}
+EXPORT_SYMBOL_GPL(snp_mark_pages_offline);
+
static void *sev_fw_alloc(unsigned long len)
{
struct page *page;
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 46f61e3ae33b..8edf5c548fbf 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -923,6 +923,12 @@ int sev_do_cmd(int cmd, void *data, int *psp_ret);
void *psp_copy_user_blob(u64 uaddr, u32 len);
+/**
+ * sev_mark_pages_offline - insert non-reclaimed firmware/guest pages
+ * into a leaked pages list.
+ */
+void snp_mark_pages_offline(unsigned long pfn, unsigned int npages);
+
#else /* !CONFIG_CRYPTO_DEV_SP_PSP */
static inline int
@@ -951,6 +957,8 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int
static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }
+void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
+
#endif /* CONFIG_CRYPTO_DEV_SP_PSP */
#endif /* __PSP_SEV_H__ */
--
2.25.1
From: Brijesh Singh <[email protected]>
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..
A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:
1) system physical address points to a guest memory
2) system physical address points to a host memory
To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.
For the case #2, use a bounce buffer to complete the request.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 371 ++++++++++++++++++++++++++++++++++-
drivers/crypto/ccp/sev-dev.h | 12 ++
2 files changed, 373 insertions(+), 10 deletions(-)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 4c12e98a1219..fd8893af6ed7 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -27,6 +27,7 @@
#include <asm/smp.h>
#include <asm/e820/types.h>
+#include <asm/sev.h>
#include "psp-dev.h"
#include "sev-dev.h"
@@ -286,6 +287,30 @@ static int rmp_mark_pages_firmware(unsigned long paddr, unsigned int npages, boo
return rc;
}
+static int rmp_mark_pages_shared(unsigned long paddr, unsigned int npages)
+{
+ /* Cbit maybe set in the paddr */
+ unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
+ int rc, n = 0, i;
+
+ for (i = 0; i < npages; i++, pfn++, n++) {
+ rc = rmp_make_shared(pfn, PG_LEVEL_4K);
+ if (rc)
+ goto cleanup;
+ }
+
+ return 0;
+
+cleanup:
+ /*
+ * If failed to change the page state to shared, then its not safe
+ * to release the page back to the system, leak it.
+ */
+ snp_mark_pages_offline(pfn, npages - n);
+
+ return rc;
+}
+
static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
{
unsigned long npages = 1ul << order, paddr;
@@ -487,12 +512,295 @@ static int sev_write_init_ex_file_if_required(int cmd_id)
return sev_write_init_ex_file();
}
+static int alloc_snp_host_map(struct sev_device *sev)
+{
+ struct page *page;
+ int i;
+
+ for (i = 0; i < MAX_SNP_HOST_MAP_BUFS; i++) {
+ struct snp_host_map *map = &sev->snp_host_map[i];
+
+ memset(map, 0, sizeof(*map));
+
+ page = alloc_pages(GFP_KERNEL_ACCOUNT, get_order(SEV_FW_BLOB_MAX_SIZE));
+ if (!page)
+ return -ENOMEM;
+
+ map->host = page_address(page);
+ }
+
+ return 0;
+}
+
+static void free_snp_host_map(struct sev_device *sev)
+{
+ int i;
+
+ for (i = 0; i < MAX_SNP_HOST_MAP_BUFS; i++) {
+ struct snp_host_map *map = &sev->snp_host_map[i];
+
+ if (map->host) {
+ __free_pages(virt_to_page(map->host), get_order(SEV_FW_BLOB_MAX_SIZE));
+ memset(map, 0, sizeof(*map));
+ }
+ }
+}
+
+static int map_firmware_writeable(u64 *paddr, u32 len, bool guest, struct snp_host_map *map)
+{
+ unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+ map->active = false;
+
+ if (!paddr || !len)
+ return 0;
+
+ map->paddr = *paddr;
+ map->len = len;
+
+ /* If paddr points to a guest memory then change the page state to firmwware. */
+ if (guest) {
+ if (rmp_mark_pages_firmware(*paddr, npages, true))
+ return -EFAULT;
+
+ goto done;
+ }
+
+ if (!map->host)
+ return -ENOMEM;
+
+ /* Check if the pre-allocated buffer can be used to fullfil the request. */
+ if (len > SEV_FW_BLOB_MAX_SIZE)
+ return -EINVAL;
+
+ /* Transition the pre-allocated buffer to the firmware state. */
+ if (rmp_mark_pages_firmware(__pa(map->host), npages, true))
+ return -EFAULT;
+
+ /* Set the paddr to use pre-allocated firmware buffer */
+ *paddr = __psp_pa(map->host);
+
+done:
+ map->active = true;
+ return 0;
+}
+
+static int unmap_firmware_writeable(u64 *paddr, u32 len, bool guest, struct snp_host_map *map)
+{
+ unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+ if (!map->active)
+ return 0;
+
+ /* If paddr points to a guest memory then restore the page state to hypervisor. */
+ if (guest) {
+ if (snp_reclaim_pages(*paddr, npages, true))
+ return -EFAULT;
+
+ goto done;
+ }
+
+ /*
+ * Transition the pre-allocated buffer to hypervisor state before the access.
+ *
+ * This is because while changing the page state to firmware, the kernel unmaps
+ * the pages from the direct map, and to restore the direct map the pages must
+ * be transitioned back to the shared state.
+ */
+ if (snp_reclaim_pages(__pa(map->host), npages, true))
+ return -EFAULT;
+
+ /* Copy the response data firmware buffer to the callers buffer. */
+ memcpy(__va(__sme_clr(map->paddr)), map->host, min_t(size_t, len, map->len));
+ *paddr = map->paddr;
+
+done:
+ map->active = false;
+ return 0;
+}
+
+static bool sev_legacy_cmd_buf_writable(int cmd)
+{
+ switch (cmd) {
+ case SEV_CMD_PLATFORM_STATUS:
+ case SEV_CMD_GUEST_STATUS:
+ case SEV_CMD_LAUNCH_START:
+ case SEV_CMD_RECEIVE_START:
+ case SEV_CMD_LAUNCH_MEASURE:
+ case SEV_CMD_SEND_START:
+ case SEV_CMD_SEND_UPDATE_DATA:
+ case SEV_CMD_SEND_UPDATE_VMSA:
+ case SEV_CMD_PEK_CSR:
+ case SEV_CMD_PDH_CERT_EXPORT:
+ case SEV_CMD_GET_ID:
+ case SEV_CMD_ATTESTATION_REPORT:
+ return true;
+ default:
+ return false;
+ }
+}
+
+#define prep_buffer(name, addr, len, guest, map) \
+ func(&((typeof(name *))cmd_buf)->addr, ((typeof(name *))cmd_buf)->len, guest, map)
+
+static int __snp_cmd_buf_copy(int cmd, void *cmd_buf, bool to_fw, int fw_err)
+{
+ int (*func)(u64 *paddr, u32 len, bool guest, struct snp_host_map *map);
+ struct sev_device *sev = psp_master->sev_data;
+ bool from_fw = !to_fw;
+
+ /*
+ * After the command is completed, change the command buffer memory to
+ * hypervisor state.
+ *
+ * The immutable bit is automatically cleared by the firmware, so
+ * no not need to reclaim the page.
+ */
+ if (from_fw && sev_legacy_cmd_buf_writable(cmd)) {
+ if (rmp_mark_pages_shared(__pa(cmd_buf), 1))
+ return -EFAULT;
+
+ /* No need to go further if firmware failed to execute command. */
+ if (fw_err)
+ return 0;
+ }
+
+ if (to_fw)
+ func = map_firmware_writeable;
+ else
+ func = unmap_firmware_writeable;
+
+ /*
+ * A command buffer may contains a system physical address. If the address
+ * points to a host memory then use an intermediate firmware page otherwise
+ * change the page state in the RMP table.
+ */
+ switch (cmd) {
+ case SEV_CMD_PDH_CERT_EXPORT:
+ if (prep_buffer(struct sev_data_pdh_cert_export, pdh_cert_address,
+ pdh_cert_len, false, &sev->snp_host_map[0]))
+ goto err;
+ if (prep_buffer(struct sev_data_pdh_cert_export, cert_chain_address,
+ cert_chain_len, false, &sev->snp_host_map[1]))
+ goto err;
+ break;
+ case SEV_CMD_GET_ID:
+ if (prep_buffer(struct sev_data_get_id, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_PEK_CSR:
+ if (prep_buffer(struct sev_data_pek_csr, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_UPDATE_DATA:
+ if (prep_buffer(struct sev_data_launch_update_data, address, len,
+ true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_UPDATE_VMSA:
+ if (prep_buffer(struct sev_data_launch_update_vmsa, address, len,
+ true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_MEASURE:
+ if (prep_buffer(struct sev_data_launch_measure, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_UPDATE_SECRET:
+ if (prep_buffer(struct sev_data_launch_secret, guest_address, guest_len,
+ true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_DBG_DECRYPT:
+ if (prep_buffer(struct sev_data_dbg, dst_addr, len, false,
+ &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_DBG_ENCRYPT:
+ if (prep_buffer(struct sev_data_dbg, dst_addr, len, true,
+ &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_ATTESTATION_REPORT:
+ if (prep_buffer(struct sev_data_attestation_report, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_SEND_START:
+ if (prep_buffer(struct sev_data_send_start, session_address,
+ session_len, false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_SEND_UPDATE_DATA:
+ if (prep_buffer(struct sev_data_send_update_data, hdr_address, hdr_len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ if (prep_buffer(struct sev_data_send_update_data, trans_address,
+ trans_len, false, &sev->snp_host_map[1]))
+ goto err;
+ break;
+ case SEV_CMD_SEND_UPDATE_VMSA:
+ if (prep_buffer(struct sev_data_send_update_vmsa, hdr_address, hdr_len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ if (prep_buffer(struct sev_data_send_update_vmsa, trans_address,
+ trans_len, false, &sev->snp_host_map[1]))
+ goto err;
+ break;
+ case SEV_CMD_RECEIVE_UPDATE_DATA:
+ if (prep_buffer(struct sev_data_receive_update_data, guest_address,
+ guest_len, true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_RECEIVE_UPDATE_VMSA:
+ if (prep_buffer(struct sev_data_receive_update_vmsa, guest_address,
+ guest_len, true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ default:
+ break;
+ }
+
+ /* The command buffer need to be in the firmware state. */
+ if (to_fw && sev_legacy_cmd_buf_writable(cmd)) {
+ if (rmp_mark_pages_firmware(__pa(cmd_buf), 1, true))
+ return -EFAULT;
+ }
+
+ return 0;
+
+err:
+ return -EINVAL;
+}
+
+static inline bool need_firmware_copy(int cmd)
+{
+ struct sev_device *sev = psp_master->sev_data;
+
+ /* After SNP is INIT'ed, the behavior of legacy SEV command is changed. */
+ return ((cmd < SEV_CMD_SNP_INIT) && sev->snp_initialized) ? true : false;
+}
+
+static int snp_aware_copy_to_firmware(int cmd, void *data)
+{
+ return __snp_cmd_buf_copy(cmd, data, true, 0);
+}
+
+static int snp_aware_copy_from_firmware(int cmd, void *data, int fw_err)
+{
+ return __snp_cmd_buf_copy(cmd, data, false, fw_err);
+}
+
static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
{
struct psp_device *psp = psp_master;
struct sev_device *sev;
unsigned int phys_lsb, phys_msb;
unsigned int reg, ret = 0;
+ void *cmd_buf;
int buf_len;
if (!psp || !psp->sev_data)
@@ -512,12 +820,28 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
* work for some memory, e.g. vmalloc'd addresses, and @data may not be
* physically contiguous.
*/
- if (data)
- memcpy(sev->cmd_buf, data, buf_len);
+ if (data) {
+ if (sev->cmd_buf_active > 2)
+ return -EBUSY;
+
+ cmd_buf = sev->cmd_buf_active ? sev->cmd_buf_backup : sev->cmd_buf;
+
+ memcpy(cmd_buf, data, buf_len);
+ sev->cmd_buf_active++;
+
+ /*
+ * The behavior of the SEV-legacy commands is altered when the
+ * SNP firmware is in the INIT state.
+ */
+ if (need_firmware_copy(cmd) && snp_aware_copy_to_firmware(cmd, cmd_buf))
+ return -EFAULT;
+ } else {
+ cmd_buf = sev->cmd_buf;
+ }
/* Get the physical address of the command buffer */
- phys_lsb = data ? lower_32_bits(__psp_pa(sev->cmd_buf)) : 0;
- phys_msb = data ? upper_32_bits(__psp_pa(sev->cmd_buf)) : 0;
+ phys_lsb = data ? lower_32_bits(__psp_pa(cmd_buf)) : 0;
+ phys_msb = data ? upper_32_bits(__psp_pa(cmd_buf)) : 0;
dev_dbg(sev->dev, "sev command id %#x buffer 0x%08x%08x timeout %us\n",
cmd, phys_msb, phys_lsb, psp_timeout);
@@ -560,15 +884,24 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
ret = sev_write_init_ex_file_if_required(cmd);
}
- print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
- buf_len, false);
-
/*
* Copy potential output from the PSP back to data. Do this even on
* failure in case the caller wants to glean something from the error.
*/
- if (data)
- memcpy(data, sev->cmd_buf, buf_len);
+ if (data) {
+ /*
+ * Restore the page state after the command completes.
+ */
+ if (need_firmware_copy(cmd) &&
+ snp_aware_copy_from_firmware(cmd, cmd_buf, ret))
+ return -EFAULT;
+
+ memcpy(data, cmd_buf, buf_len);
+ sev->cmd_buf_active--;
+ }
+
+ print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
+ buf_len, false);
return ret;
}
@@ -1579,10 +1912,12 @@ int sev_dev_init(struct psp_device *psp)
if (!sev)
goto e_err;
- sev->cmd_buf = (void *)devm_get_free_pages(dev, GFP_KERNEL, 0);
+ sev->cmd_buf = (void *)devm_get_free_pages(dev, GFP_KERNEL, 1);
if (!sev->cmd_buf)
goto e_sev;
+ sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
+
psp->sev_data = sev;
sev->dev = dev;
@@ -1648,6 +1983,12 @@ static void sev_firmware_shutdown(struct sev_device *sev)
snp_range_list = NULL;
}
+ /*
+ * The host map need to clear the immutable bit so it must be free'd before the
+ * SNP firmware shutdown.
+ */
+ free_snp_host_map(sev);
+
sev_snp_shutdown(&error);
}
@@ -1722,6 +2063,14 @@ void sev_pci_init(void)
dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
}
}
+
+ /*
+ * Allocate the intermediate buffers used for the legacy command handling.
+ */
+ if (alloc_snp_host_map(sev)) {
+ dev_notice(sev->dev, "Failed to alloc host map (disabling legacy SEV)\n");
+ goto skip_legacy;
+ }
}
/* Obtain the TMR memory area for SEV-ES use */
@@ -1739,12 +2088,14 @@ void sev_pci_init(void)
dev_err(sev->dev, "SEV: failed to INIT error %#x, rc %d\n",
error, rc);
+skip_legacy:
dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_initialized ?
"-SNP" : "", sev->api_major, sev->api_minor, sev->build);
return;
err:
+ free_snp_host_map(sev);
psp_master->sev_data = NULL;
}
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 34767657beb5..19d79f9d4212 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -29,11 +29,20 @@
#define SEV_CMDRESP_CMD_SHIFT 16
#define SEV_CMDRESP_IOC BIT(0)
+#define MAX_SNP_HOST_MAP_BUFS 2
+
struct sev_misc_dev {
struct kref refcount;
struct miscdevice misc;
};
+struct snp_host_map {
+ u64 paddr;
+ u32 len;
+ void *host;
+ bool active;
+};
+
struct sev_device {
struct device *dev;
struct psp_device *psp;
@@ -52,8 +61,11 @@ struct sev_device {
u8 build;
void *cmd_buf;
+ void *cmd_buf_backup;
+ int cmd_buf_active;
bool snp_initialized;
+ struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
};
int sev_dev_init(struct psp_device *psp);
--
2.25.1
From: Brijesh Singh <[email protected]>
The command can be used by the userspace to query the SNP platform status
report. See the SEV-SNP spec for more details.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
Documentation/virt/coco/sev-guest.rst | 27 ++++++++++++++++
drivers/crypto/ccp/sev-dev.c | 45 +++++++++++++++++++++++++++
include/uapi/linux/psp-sev.h | 1 +
3 files changed, 73 insertions(+)
diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
index bf593e88cfd9..11ea67c944df 100644
--- a/Documentation/virt/coco/sev-guest.rst
+++ b/Documentation/virt/coco/sev-guest.rst
@@ -61,6 +61,22 @@ counter (e.g. counter overflow), then -EIO will be returned.
__u64 fw_err;
};
+The host ioctl should be called to /dev/sev device. The ioctl accepts command
+id and command input structure.
+
+::
+ struct sev_issue_cmd {
+ /* Command ID */
+ __u32 cmd;
+
+ /* Command request structure */
+ __u64 data;
+
+ /* firmware error code on failure (see psp-sev.h) */
+ __u32 error;
+ };
+
+
2.1 SNP_GET_REPORT
------------------
@@ -118,6 +134,17 @@ be updated with the expected value.
See GHCB specification for further detail on how to parse the certificate blob.
+2.4 SNP_PLATFORM_STATUS
+-----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_platform_status
+:Returns (out): 0 on success, -negative on error
+
+The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
+status includes API major, minor version and more. See the SEV-SNP
+specification for further details.
+
3. SEV-SNP CPUID Enforcement
============================
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index fd8893af6ed7..65e13a562f3b 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1751,6 +1751,48 @@ static int sev_ioctl_do_pdh_export(struct sev_issue_cmd *argp, bool writable)
return ret;
}
+static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_data_snp_addr buf;
+ struct page *status_page;
+ void *data;
+ int ret;
+
+ if (!sev->snp_initialized || !argp->data)
+ return -EINVAL;
+
+ status_page = alloc_page(GFP_KERNEL_ACCOUNT);
+ if (!status_page)
+ return -ENOMEM;
+
+ data = page_address(status_page);
+ if (rmp_mark_pages_firmware(__pa(data), 1, true)) {
+ __free_pages(status_page, 0);
+ return -EFAULT;
+ }
+
+ buf.gctx_paddr = __psp_pa(data);
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_PLATFORM_STATUS, &buf, &argp->error);
+
+ /* Change the page state before accessing it */
+ if (snp_reclaim_pages(__pa(data), 1, true)) {
+ snp_mark_pages_offline(__pa(data) >> PAGE_SHIFT, 1);
+ return -EFAULT;
+ }
+
+ if (ret)
+ goto cleanup;
+
+ if (copy_to_user((void __user *)argp->data, data,
+ sizeof(struct sev_user_data_snp_status)))
+ ret = -EFAULT;
+
+cleanup:
+ __free_pages(status_page, 0);
+ return ret;
+}
+
static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
{
void __user *argp = (void __user *)arg;
@@ -1802,6 +1844,9 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
case SEV_GET_ID2:
ret = sev_ioctl_do_get_id2(&input);
break;
+ case SNP_PLATFORM_STATUS:
+ ret = sev_ioctl_snp_platform_status(&input);
+ break;
default:
ret = -EINVAL;
goto out;
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index c66f7c372645..5adfaea7df97 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -28,6 +28,7 @@ enum {
SEV_PEK_CERT_IMPORT,
SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
SEV_GET_ID2,
+ SNP_PLATFORM_STATUS,
SEV_MAX,
};
--
2.25.1
From: Brijesh Singh <[email protected]>
The SEV-SNP firmware provides the SNP_CONFIG command used to set the
system-wide configuration value for SNP guests. The information includes
the TCB version string to be reported in guest attestation reports.
Version 2 of the GHCB specification adds an NAE (SNP extended guest
request) that a guest can use to query the reports that include additional
certificates.
In both cases, userspace provided additional data is included in the
attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
command to give the certificate blob and the reported TCB version string
at once. Note that the specification defines certificate blob with a
specific GUID format; the userspace is responsible for building the
proper certificate blob. The ioctl treats it an opaque blob.
While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
command that can be used to obtain the data programmed through the
SNP_SET_EXT_CONFIG.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
Documentation/virt/coco/sev-guest.rst | 27 ++++++
drivers/crypto/ccp/sev-dev.c | 123 ++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.h | 4 +
include/uapi/linux/psp-sev.h | 17 ++++
4 files changed, 171 insertions(+)
diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
index 11ea67c944df..6cad4226c348 100644
--- a/Documentation/virt/coco/sev-guest.rst
+++ b/Documentation/virt/coco/sev-guest.rst
@@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
status includes API major, minor version and more. See the SEV-SNP
specification for further details.
+2.5 SNP_SET_EXT_CONFIG
+----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_ext_config
+:Returns (out): 0 on success, -negative on error
+
+The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
+reported TCB version in the attestation report. The command is similar to
+SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
+command also accepts an additional certificate blob defined in the GHCB
+specification.
+
+If the certs_address is zero, then the previous certificate blob will deleted.
+For more information on the certificate blob layout, see the GHCB spec
+(extended guest request message).
+
+2.6 SNP_GET_EXT_CONFIG
+----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_ext_config
+:Returns (out): 0 on success, -negative on error
+
+The SNP_GET_EXT_CONFIG is used to query the system-wide configuration set
+through the SNP_SET_EXT_CONFIG.
+
3. SEV-SNP CPUID Enforcement
============================
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 65e13a562f3b..b56b00ca2cd4 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1481,6 +1481,10 @@ static int __sev_snp_shutdown_locked(int *error)
data.length = sizeof(data);
data.iommu_snp_shutdown = 1;
+ /* Free the memory used for caching the certificate data */
+ kfree(sev->snp_certs_data);
+ sev->snp_certs_data = NULL;
+
wbinvd_on_all_cpus();
retry:
@@ -1793,6 +1797,118 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
return ret;
}
+static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_user_data_ext_snp_config input;
+ int ret;
+
+ if (!sev->snp_initialized || !argp->data)
+ return -EINVAL;
+
+ memset(&input, 0, sizeof(input));
+
+ if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
+ return -EFAULT;
+
+ /* Copy the TCB version programmed through the SET_CONFIG to userspace */
+ if (input.config_address) {
+ if (copy_to_user((void * __user)input.config_address,
+ &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
+ return -EFAULT;
+ }
+
+ /* Copy the extended certs programmed through the SNP_SET_CONFIG */
+ if (input.certs_address && sev->snp_certs_data) {
+ if (input.certs_len < sev->snp_certs_len) {
+ /* Return the certs length to userspace */
+ input.certs_len = sev->snp_certs_len;
+
+ ret = -ENOSR;
+ goto e_done;
+ }
+
+ if (copy_to_user((void * __user)input.certs_address,
+ sev->snp_certs_data, sev->snp_certs_len))
+ return -EFAULT;
+ }
+
+ ret = 0;
+
+e_done:
+ if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
+static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_user_data_ext_snp_config input;
+ struct sev_user_data_snp_config config;
+ void *certs = NULL;
+ int ret = 0;
+
+ if (!sev->snp_initialized || !argp->data)
+ return -EINVAL;
+
+ if (!writable)
+ return -EPERM;
+
+ memset(&input, 0, sizeof(input));
+
+ if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
+ return -EFAULT;
+
+ /* Copy the certs from userspace */
+ if (input.certs_address) {
+ if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
+ return -EINVAL;
+
+ certs = psp_copy_user_blob(input.certs_address, input.certs_len);
+ if (IS_ERR(certs))
+ return PTR_ERR(certs);
+ }
+
+ /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
+ if (input.config_address) {
+ memset(&config, 0, sizeof(config));
+ if (copy_from_user(&config,
+ (void __user *)input.config_address, sizeof(config))) {
+ ret = -EFAULT;
+ goto e_free;
+ }
+
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
+ if (ret)
+ goto e_free;
+
+ memcpy(&sev->snp_config, &config, sizeof(config));
+ }
+
+ /*
+ * If the new certs are passed then cache it else free the old certs.
+ */
+ mutex_lock(&sev->snp_certs_lock);
+ if (certs) {
+ kfree(sev->snp_certs_data);
+ sev->snp_certs_data = certs;
+ sev->snp_certs_len = input.certs_len;
+ } else {
+ kfree(sev->snp_certs_data);
+ sev->snp_certs_data = NULL;
+ sev->snp_certs_len = 0;
+ }
+ mutex_unlock(&sev->snp_certs_lock);
+
+ return 0;
+
+e_free:
+ kfree(certs);
+ return ret;
+}
+
static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
{
void __user *argp = (void __user *)arg;
@@ -1847,6 +1963,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
case SNP_PLATFORM_STATUS:
ret = sev_ioctl_snp_platform_status(&input);
break;
+ case SNP_SET_EXT_CONFIG:
+ ret = sev_ioctl_snp_set_config(&input, writable);
+ break;
+ case SNP_GET_EXT_CONFIG:
+ ret = sev_ioctl_snp_get_config(&input);
+ break;
default:
ret = -EINVAL;
goto out;
@@ -1962,6 +2084,7 @@ int sev_dev_init(struct psp_device *psp)
goto e_sev;
sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
+ mutex_init(&sev->snp_certs_lock);
psp->sev_data = sev;
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 19d79f9d4212..41d5353d5bab 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -66,6 +66,10 @@ struct sev_device {
bool snp_initialized;
struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
+ void *snp_certs_data;
+ u32 snp_certs_len;
+ struct mutex snp_certs_lock;
+ struct sev_user_data_snp_config snp_config;
};
int sev_dev_init(struct psp_device *psp);
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 5adfaea7df97..c20d37586d21 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -29,6 +29,8 @@ enum {
SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
SEV_GET_ID2,
SNP_PLATFORM_STATUS,
+ SNP_SET_EXT_CONFIG,
+ SNP_GET_EXT_CONFIG,
SEV_MAX,
};
@@ -192,6 +194,21 @@ struct sev_user_data_snp_config {
__u8 rsvd1[52];
} __packed;
+/**
+ * struct sev_data_snp_ext_config - system wide configuration value for SNP.
+ *
+ * @config_address: address of the struct sev_user_data_snp_config or 0 when
+ * reported_tcb does not need to be updated.
+ * @certs_address: address of extended guest request certificate chain or
+ * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
+ * @certs_len: length of the certs
+ */
+struct sev_user_data_ext_snp_config {
+ __u64 config_address; /* In */
+ __u64 certs_address; /* In */
+ __u32 certs_len; /* In */
+};
+
/**
* struct sev_issue_cmd - SEV ioctl parameters
*
--
2.25.1
From: Brijesh Singh <[email protected]>
Version 2 of the GHCB specification defines VMGEXIT that is used to get
the extended attestation report. The extended attestation report includes
the certificate blobs provided through the SNP_SET_EXT_CONFIG.
The snp_guest_ext_guest_request() will be used by the hypervisor to get
the extended attestation report. See the GHCB specification for more
details.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 47 ++++++++++++++++++++++++++++++++++++
include/linux/psp-sev.h | 33 +++++++++++++++++++++++++
2 files changed, 80 insertions(+)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index b56b00ca2cd4..e65563bc8298 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -2017,6 +2017,53 @@ int sev_guest_df_flush(int *error)
}
EXPORT_SYMBOL_GPL(sev_guest_df_flush);
+int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+ unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
+{
+ unsigned long expected_npages;
+ struct sev_device *sev;
+ int rc;
+
+ if (!psp_master || !psp_master->sev_data)
+ return -ENODEV;
+
+ sev = psp_master->sev_data;
+
+ if (!sev->snp_initialized)
+ return -EINVAL;
+
+ mutex_lock(&sev->snp_certs_lock);
+ /*
+ * Check if there is enough space to copy the certificate chain. Otherwise
+ * return ERROR code defined in the GHCB specification.
+ */
+ expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
+ if (*npages < expected_npages) {
+ *npages = expected_npages;
+ *fw_err = SNP_GUEST_REQ_INVALID_LEN;
+ mutex_unlock(&sev->snp_certs_lock);
+ return -EINVAL;
+ }
+
+ rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)fw_err);
+ if (rc) {
+ mutex_unlock(&sev->snp_certs_lock);
+ return rc;
+ }
+
+ /* Copy the certificate blob */
+ if (sev->snp_certs_data) {
+ *npages = expected_npages;
+ memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
+ } else {
+ *npages = 0;
+ }
+
+ mutex_unlock(&sev->snp_certs_lock);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
+
static void sev_exit(struct kref *ref)
{
misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index d19744807471..81bafc049eca 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -931,6 +931,32 @@ void snp_free_firmware_page(void *addr);
*/
void snp_mark_pages_offline(unsigned long pfn, unsigned int npages);
+/**
+ * snp_guest_ext_guest_request - perform the SNP extended guest request command
+ * defined in the GHCB specification.
+ *
+ * @data: the input guest request structure
+ * @vaddr: address where the certificate blob need to be copied.
+ * @npages: number of pages for the certificate blob.
+ * If the specified page count is less than the certificate blob size, then the
+ * required page count is returned with error code defined in the GHCB spec.
+ * If the specified page count is more than the certificate blob size, then
+ * page count is updated to reflect the amount of valid data copied in the
+ * vaddr.
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV if the sev device is not available
+ * -%ENOTSUPP if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO if the sev returned a non-zero return code
+ */
+int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+ unsigned long vaddr, unsigned long *npages,
+ unsigned long *error);
+
#else /* !CONFIG_CRYPTO_DEV_SP_PSP */
static inline int
@@ -968,6 +994,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
static inline void snp_free_firmware_page(void *addr) { }
+static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+ unsigned long vaddr, unsigned long *n,
+ unsigned long *error)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_CRYPTO_DEV_SP_PSP */
#endif /* __PSP_SEV_H__ */
--
2.25.1
From: Tom Lendacky <[email protected]>
Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
available in version 2 of the GHCB specification.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev-common.h | 2 ++
arch/x86/kvm/svm/sev.c | 56 ++++++++++++++++++++++++++-----
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..e15548d88f2a 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -56,6 +56,8 @@
/* AP Reset Hold */
#define GHCB_MSR_AP_RESET_HOLD_REQ 0x006
#define GHCB_MSR_AP_RESET_HOLD_RESP 0x007
+#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS 12
+#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0)
/* GHCB GPA Register */
#define GHCB_MSR_REG_GPA_REQ 0x012
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index ad9b29ff4590..05eda0940e22 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -58,6 +58,10 @@ module_param_named(sev_es, sev_es_enabled, bool, 0444);
#define sev_es_enabled false
#endif /* CONFIG_KVM_AMD_SEV */
+#define AP_RESET_HOLD_NONE 0
+#define AP_RESET_HOLD_NAE_EVENT 1
+#define AP_RESET_HOLD_MSR_PROTO 2
+
static u8 sev_enc_bit;
static DECLARE_RWSEM(sev_deactivate_lock);
static DEFINE_MUTEX(sev_bitmap_lock);
@@ -2706,6 +2710,9 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
void sev_es_unmap_ghcb(struct vcpu_svm *svm)
{
+ /* Clear any indication that the vCPU is in a type of AP Reset Hold */
+ svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_NONE;
+
if (!svm->sev_es.ghcb)
return;
@@ -2918,6 +2925,22 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_POS);
break;
}
+ case GHCB_MSR_AP_RESET_HOLD_REQ:
+ svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
+ ret = kvm_emulate_ap_reset_hold(&svm->vcpu);
+
+ /*
+ * Preset the result to a non-SIPI return and then only set
+ * the result to non-zero when delivering a SIPI.
+ */
+ set_ghcb_msr_bits(svm, 0,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
+
+ set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
+ GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;
@@ -3017,6 +3040,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
break;
case SVM_VMGEXIT_AP_HLT_LOOP:
+ svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_NAE_EVENT;
ret = kvm_emulate_ap_reset_hold(vcpu);
break;
case SVM_VMGEXIT_AP_JUMP_TABLE: {
@@ -3177,13 +3201,29 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
return;
}
- /*
- * Subsequent SIPI: Return from an AP Reset Hold VMGEXIT, where
- * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
- * non-zero value.
- */
- if (!svm->sev_es.ghcb)
- return;
+ /* Subsequent SIPI */
+ switch (svm->sev_es.ap_reset_hold_type) {
+ case AP_RESET_HOLD_NAE_EVENT:
+ /*
+ * Return from an AP Reset Hold VMGEXIT, where the guest will
+ * set the CS and RIP. Set SW_EXIT_INFO_2 to a non-zero value.
+ */
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
+ break;
+ case AP_RESET_HOLD_MSR_PROTO:
+ /*
+ * Return from an AP Reset Hold VMGEXIT, where the guest will
+ * set the CS and RIP. Set GHCB data field to a non-zero value.
+ */
+ set_ghcb_msr_bits(svm, 1,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
- ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
+ set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
+ GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ default:
+ break;
+ }
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 56e306a1f0c7..f4848e6aba28 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -191,6 +191,7 @@ struct vcpu_sev_es_state {
struct ghcb *ghcb;
struct kvm_host_map ghcb_map;
bool received_first_sipi;
+ unsigned int ap_reset_hold_type;
/* SEV-ES scratch area support */
void *ghcb_sa;
--
2.25.1
From: Brijesh Singh <[email protected]>
Version 2 of the GHCB specification introduced advertisement of features
that are supported by the Hypervisor.
Now that KVM supports version 2 of the GHCB specification, bump the
maximum supported protocol version.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev-common.h | 2 ++
arch/x86/kvm/svm/sev.c | 14 ++++++++++++++
arch/x86/kvm/svm/svm.h | 3 ++-
3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index e15548d88f2a..539de6b93420 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,6 +101,8 @@ enum psc_op {
/* GHCB Hypervisor Feature Request/Response */
#define GHCB_MSR_HV_FT_REQ 0x080
#define GHCB_MSR_HV_FT_RESP 0x081
+#define GHCB_MSR_HV_FT_POS 12
+#define GHCB_MSR_HV_FT_MASK GENMASK_ULL(51, 0)
#define GHCB_MSR_HV_FT_RESP_VAL(v) \
/* GHCBData[63:12] */ \
(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 05eda0940e22..c1f0d4898ce3 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2675,6 +2675,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
case SVM_VMGEXIT_AP_HLT_LOOP:
case SVM_VMGEXIT_AP_JUMP_TABLE:
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
+ case SVM_VMGEXIT_HV_FEATURES:
break;
default:
reason = GHCB_ERR_INVALID_EVENT;
@@ -2941,6 +2942,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_MASK,
GHCB_MSR_INFO_POS);
break;
+ case GHCB_MSR_HV_FT_REQ: {
+ set_ghcb_msr_bits(svm, GHCB_HV_FT_SUPPORTED,
+ GHCB_MSR_HV_FT_MASK, GHCB_MSR_HV_FT_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_HV_FT_RESP,
+ GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
+ break;
+ }
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;
@@ -3065,6 +3073,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = 1;
break;
}
+ case SVM_VMGEXIT_HV_FEATURES: {
+ ghcb_set_sw_exit_info_2(ghcb, GHCB_HV_FT_SUPPORTED);
+
+ ret = 1;
+ break;
+ }
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f4848e6aba28..c249c360fe36 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -662,9 +662,10 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu);
/* sev.c */
-#define GHCB_VERSION_MAX 1ULL
+#define GHCB_VERSION_MAX 2ULL
#define GHCB_VERSION_MIN 1ULL
+#define GHCB_HV_FT_SUPPORTED 0
extern unsigned int max_sev_asid;
--
2.25.1
This callback will do any platform-specific handling needed for
converting pages between shared/private.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
include/linux/kvm_host.h | 4 ++++
virt/kvm/kvm_main.c | 29 +++++++++++++++++++++++++++++
5 files changed, 49 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 72183da010b8..a8aaf532c2ab 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -132,6 +132,7 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
+KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f856d689dda0..2da3fb2d5d1b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1644,6 +1644,8 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
+ int (*update_mem_attr)(struct kvm_memory_slot *slot, unsigned int attr,
+ gfn_t start, gfn_t end);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fb3f34b7391c..053bd77bbf52 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7251,4 +7251,17 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
linfo_update_mixed(gfn, slot, level, mixed);
}
}
+
+void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned long attrs,
+ gfn_t start, gfn_t end)
+{
+ int ret;
+
+ ret = static_call(kvm_x86_update_mem_attr)(slot, attrs, start, end);
+ if (ret)
+ pr_warn_ratelimited("Failed to update GFN range 0x%llx-0x%llx with attributes 0x%lx. Ret: %d\n",
+ start, end, attrs, ret);
+}
#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fdc59479b3e2..d200b8f45583 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2330,6 +2330,10 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
struct kvm_memory_slot *slot,
unsigned long attrs,
gfn_t start, gfn_t end);
+void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned long attrs,
+ gfn_t start, gfn_t end);
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b68574ff6c30..8ec985f1c57d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2561,6 +2561,32 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
kvm_flush_remote_tlbs(kvm);
}
+static void kvm_post_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
+ gfn_t start_orig, gfn_t end_orig)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int i;
+
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start_orig, end_orig) {
+ gfn_t start, end;
+
+ slot = iter.slot;
+ start = max(start_orig, slot->base_gfn);
+ end = min(end_orig, slot->base_gfn + slot->npages);
+
+ if (start >= end)
+ continue;
+
+ kvm_arch_post_set_memory_attributes(kvm, slot, attrs, start, end);
+ }
+ }
+}
+
static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
struct kvm_memory_attributes *attrs)
{
@@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
kvm_mmu_invalidate_end(kvm);
KVM_MMU_UNLOCK(kvm);
+ if (i > start)
+ kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
+
mutex_unlock(&kvm->slots_lock);
attrs->address = i << PAGE_SHIFT;
--
2.25.1
From: Brijesh Singh <[email protected]>
Implement a workaround for an SNP erratum where the CPU will incorrectly
signal an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
RMP entry of a VMCB, VMSA or AVIC backing page.
When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
backing pages as "in-use" in the RMP after a successful VMRUN. This
is done for _all_ VMs, not just SNP-Active VMs.
If the hypervisor accesses an in-use page through a writable
translation, the CPU will throw an RMP violation #PF. On early SNP
hardware, if an in-use page is 2mb aligned and software accesses any
part of the associated 2mb region with a hupage, the CPU will
incorrectly treat the entire 2mb region as in-use and signal a spurious
RMP violation #PF.
The recommended is to not use the hugepage for the VMCB, VMSA or
AVIC backing page. Add a generic allocator that will ensure that the
page returns is not hugepage (2mb or 1gb) and is safe to be used when
SEV-SNP is enabled.
Co-developed-by: Marc Orr <[email protected]>
Signed-off-by: Marc Orr <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/lapic.c | 5 ++++-
arch/x86/kvm/svm/sev.c | 33 ++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 15 ++++++++++++--
arch/x86/kvm/svm/svm.h | 1 +
6 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 6a885f024a00..e116405cbb5f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
KVM_X86_OP_OPTIONAL(invalidate_restricted_mem)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37c92412035f..a9363a6f779d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1729,6 +1729,8 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 80f92cbc4029..72e46d5b4201 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2740,7 +2740,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
vcpu->arch.apic = apic;
- apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ if (kvm_x86_ops.alloc_apic_backing_page)
+ apic->regs = static_call(kvm_x86_alloc_apic_backing_page)(vcpu);
+ else
+ apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
if (!apic->regs) {
printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
vcpu->vcpu_id);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c1f0d4898ce3..9e9efb42a766 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3241,3 +3241,36 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
break;
}
}
+
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
+{
+ unsigned long pfn;
+ struct page *p;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+
+ /*
+ * Allocate an SNP safe page to workaround the SNP erratum where
+ * the CPU will incorrectly signal an RMP violation #PF if a
+ * hugepage (2mb or 1gb) collides with the RMP entry of VMCB, VMSA
+ * or AVIC backing page. The recommeded workaround is to not use the
+ * hugepage.
+ *
+ * Allocate one extra page, use a page which is not 2mb aligned
+ * and free the other.
+ */
+ p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
+ if (!p)
+ return NULL;
+
+ split_page(p, 1);
+
+ pfn = page_to_pfn(p);
+ if (IS_ALIGNED(pfn, PTRS_PER_PMD))
+ __free_page(p++);
+ else
+ __free_page(p + 1);
+
+ return p;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 213593dbd7a1..1061aaf66f0a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1372,7 +1372,7 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
svm = to_svm(vcpu);
err = -ENOMEM;
- vmcb01_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ vmcb01_page = snp_safe_alloc_page(vcpu);
if (!vmcb01_page)
goto out;
@@ -1381,7 +1381,7 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
* SEV-ES guests require a separate VMSA page used to contain
* the encrypted register state of the guest.
*/
- vmsa_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ vmsa_page = snp_safe_alloc_page(vcpu);
if (!vmsa_page)
goto error_free_vmcb_page;
@@ -4696,6 +4696,16 @@ static int svm_vm_init(struct kvm *kvm)
return 0;
}
+static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
+{
+ struct page *page = snp_safe_alloc_page(vcpu);
+
+ if (!page)
+ return NULL;
+
+ return page_address(page);
+}
+
static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = KBUILD_MODNAME,
@@ -4824,6 +4834,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+ .alloc_apic_backing_page = svm_alloc_apic_backing_page,
};
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index c249c360fe36..5efcf036ccad 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -692,6 +692,7 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm);
void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
void sev_es_unmap_ghcb(struct vcpu_svm *svm);
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
/* vmenter.S */
--
2.25.1
From: Brijesh Singh <[email protected]>
The KVM_SNP_INIT command is used by the hypervisor to initialize the
SEV-SNP platform context. In a typical workflow, this command should be the
first command issued. When creating SEV-SNP guest, the VMM must use this
command instead of the KVM_SEV_INIT or KVM_SEV_ES_INIT.
The flags value must be zero, it will be extended in future SNP support to
communicate the optional features (such as restricted INT injection etc).
Co-developed-by: Pavan Kumar Paluri <[email protected]>
Signed-off-by: Pavan Kumar Paluri <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 27 ++++++++++++
arch/x86/include/asm/svm.h | 1 +
arch/x86/kvm/svm/sev.c | 44 ++++++++++++++++++-
arch/x86/kvm/svm/svm.h | 4 ++
include/uapi/linux/kvm.h | 13 ++++++
5 files changed, 87 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 935aaeb97fe6..2432213bd0ea 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -434,6 +434,33 @@ issued by the hypervisor to make the guest ready for execution.
Returns: 0 on success, -negative on error
+18. KVM_SNP_INIT
+----------------
+
+The KVM_SNP_INIT command can be used by the hypervisor to initialize SEV-SNP
+context. In a typical workflow, this command should be the first command issued.
+
+Parameters (in/out): struct kvm_snp_init
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_snp_init {
+ __u64 flags;
+ };
+
+The flags bitmap is defined as::
+
+ /* enable the restricted injection */
+ #define KVM_SEV_SNP_RESTRICTED_INJET (1<<0)
+
+ /* enable the restricted injection timer */
+ #define KVM_SEV_SNP_RESTRICTED_TIMER_INJET (1<<1)
+
+If the specified flags is not supported then return -EOPNOTSUPP, and the supported
+flags are returned.
+
References
==========
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index cb1ee53ad3b1..c18d78d5e505 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -278,6 +278,7 @@ enum avic_ipi_failure_cause {
#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF)
#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL
+#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)
struct vmcb_seg {
u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 51db01b282eb..a8efe1f6bf77 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -243,6 +243,25 @@ static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
sev_decommission(handle);
}
+static int verify_snp_init_flags(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_snp_init params;
+ int ret = 0;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ if (params.flags & ~SEV_SNP_SUPPORTED_FLAGS)
+ ret = -EOPNOTSUPP;
+
+ params.flags = SEV_SNP_SUPPORTED_FLAGS;
+
+ if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms, sizeof(params)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -256,13 +275,23 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
sev->active = true;
- sev->es_active = argp->id == KVM_SEV_ES_INIT;
+ sev->es_active = (argp->id == KVM_SEV_ES_INIT || argp->id == KVM_SEV_SNP_INIT);
+ sev->snp_active = argp->id == KVM_SEV_SNP_INIT;
asid = sev_asid_new(sev);
if (asid < 0)
goto e_no_asid;
sev->asid = asid;
- ret = sev_platform_init(&argp->error);
+ if (sev->snp_active) {
+ ret = verify_snp_init_flags(kvm, argp);
+ if (ret)
+ goto e_free;
+
+ ret = sev_snp_init(&argp->error, false);
+ } else {
+ ret = sev_platform_init(&argp->error);
+ }
+
if (ret)
goto e_free;
@@ -277,6 +306,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
sev_asid_free(sev);
sev->asid = 0;
e_no_asid:
+ sev->snp_active = false;
sev->es_active = false;
sev->active = false;
return ret;
@@ -749,6 +779,10 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
save->xss = svm->vcpu.arch.ia32_xss;
save->dr6 = svm->vcpu.arch.dr6;
+ /* Enable the SEV-SNP feature */
+ if (sev_snp_guest(svm->vcpu.kvm))
+ save->sev_features |= SVM_SEV_FEAT_SNP_ACTIVE;
+
pr_debug("Virtual Machine Save Area (VMSA):\n");
print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false);
@@ -2001,6 +2035,12 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
}
switch (sev_cmd.id) {
+ case KVM_SEV_SNP_INIT:
+ if (!sev_snp_enabled) {
+ r = -ENOTTY;
+ goto out;
+ }
+ fallthrough;
case KVM_SEV_ES_INIT:
if (!sev_es_enabled) {
r = -ENOTTY;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 8eb1b51e92f5..56a5c96d8a36 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -73,6 +73,9 @@ enum {
/* TPR and CR2 are always written before VMRUN */
#define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2))
+/* Supported init feature flags */
+#define SEV_SNP_SUPPORTED_FLAGS 0x0
+
struct kvm_sev_info {
bool active; /* SEV enabled guest */
bool es_active; /* SEV-ES enabled guest */
@@ -88,6 +91,7 @@ struct kvm_sev_info {
struct list_head mirror_entry; /* Use as a list entry of mirrors */
struct misc_cg *misc_cg; /* For misc cgroup accounting */
atomic_t migration_in_progress;
+ u64 snp_init_flags;
};
struct kvm_svm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2fba29125ec2..499cc323f793 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1917,6 +1917,9 @@ enum sev_cmd_id {
/* Guest Migration Extension */
KVM_SEV_SEND_CANCEL,
+ /* SNP specific commands */
+ KVM_SEV_SNP_INIT,
+
KVM_SEV_NR_MAX,
};
@@ -2013,6 +2016,16 @@ struct kvm_sev_receive_update_data {
__u32 trans_len;
};
+/* enable the restricted injection */
+#define KVM_SEV_SNP_RESTRICTED_INJET (1 << 0)
+
+/* enable the restricted injection timer */
+#define KVM_SEV_SNP_RESTRICTED_TIMER_INJET (1 << 1)
+
+struct kvm_snp_init {
+ __u64 flags;
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1
From: Brijesh Singh <[email protected]>
KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. If the guest is expected to be migrated,
the command also binds a migration agent (MA) to the guest.
For more information see the SEV-SNP specification.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 24 ++++
arch/x86/kvm/svm/sev.c | 121 +++++++++++++++++-
arch/x86/kvm/svm/svm.h | 1 +
include/uapi/linux/kvm.h | 10 ++
4 files changed, 153 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 2432213bd0ea..58971fc02a15 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -461,6 +461,30 @@ The flags bitmap is defined as::
If the specified flags is not supported then return -EOPNOTSUPP, and the supported
flags are returned.
+19. KVM_SNP_LAUNCH_START
+------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. To create the encryption context, user must
+provide a guest policy, migration agent (if any) and guest OS visible
+workarounds value as defined SEV-SNP specification.
+
+Parameters (in): struct kvm_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_start {
+ __u64 policy; /* Guest policy to use. */
+ __u64 ma_uaddr; /* userspace address of migration agent */
+ __u8 ma_en; /* 1 if the migration agent is enabled */
+ __u8 imi_en; /* set IMI to 1. */
+ __u8 gosvw[16]; /* guest OS visible workarounds */
+ };
+
+See the SEV-SNP specification for further detail on the launch input.
+
References
==========
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a8efe1f6bf77..097bb2138360 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -22,6 +22,7 @@
#include <asm/pkru.h>
#include <asm/trapnr.h>
#include <asm/fpu/xcr.h>
+#include <asm/sev.h>
#include "mmu.h"
#include "x86.h"
@@ -75,6 +76,8 @@ static unsigned int nr_asids;
static unsigned long *sev_asid_bitmap;
static unsigned long *sev_reclaim_asid_bitmap;
+static int snp_decommission_context(struct kvm *kvm);
+
struct enc_region {
struct list_head list;
unsigned long npages;
@@ -100,12 +103,17 @@ static int sev_flush_asids(int min_asid, int max_asid)
down_write(&sev_deactivate_lock);
wbinvd_on_all_cpus();
- ret = sev_guest_df_flush(&error);
+
+ if (sev_snp_enabled)
+ ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+ else
+ ret = sev_guest_df_flush(&error);
up_write(&sev_deactivate_lock);
if (ret)
- pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+ pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+ sev_snp_enabled ? "-SNP" : "", ret, error);
return ret;
}
@@ -2011,6 +2019,80 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
return ret;
}
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct sev_data_snp_addr data = {};
+ void *context;
+ int rc;
+
+ /* Allocate memory for context page */
+ context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+ if (!context)
+ return NULL;
+
+ data.gctx_paddr = __psp_pa(context);
+ rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+ if (rc) {
+ snp_free_firmware_page(context);
+ return NULL;
+ }
+
+ return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_activate data = {0};
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+ data.asid = sev_get_asid(kvm);
+ return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_start start = {0};
+ struct kvm_sev_snp_launch_start params;
+ int rc;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ sev->snp_context = snp_context_create(kvm, argp);
+ if (!sev->snp_context)
+ return -ENOTTY;
+
+ start.gctx_paddr = __psp_pa(sev->snp_context);
+ start.policy = params.policy;
+ memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+ rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+ if (rc)
+ goto e_free_context;
+
+ sev->fd = argp->sev_fd;
+ rc = snp_bind_asid(kvm, &argp->error);
+ if (rc)
+ goto e_free_context;
+
+ return 0;
+
+e_free_context:
+ snp_decommission_context(kvm);
+
+ return rc;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2101,6 +2183,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_RECEIVE_FINISH:
r = sev_receive_finish(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_START:
+ r = snp_launch_start(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2292,6 +2377,28 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
return ret;
}
+static int snp_decommission_context(struct kvm *kvm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_addr data = {};
+ int ret;
+
+ /* If context is not created then do nothing */
+ if (!sev->snp_context)
+ return 0;
+
+ data.gctx_paddr = __sme_pa(sev->snp_context);
+ ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+ if (WARN_ONCE(ret, "failed to release guest context"))
+ return ret;
+
+ /* free the context page now */
+ snp_free_firmware_page(sev->snp_context);
+ sev->snp_context = NULL;
+
+ return 0;
+}
+
void sev_vm_destroy(struct kvm *kvm)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2333,7 +2440,15 @@ void sev_vm_destroy(struct kvm *kvm)
}
}
- sev_unbind_asid(kvm, sev->handle);
+ if (sev_snp_guest(kvm)) {
+ if (snp_decommission_context(kvm)) {
+ WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+ return;
+ }
+ } else {
+ sev_unbind_asid(kvm, sev->handle);
+ }
+
sev_asid_free(sev);
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 56a5c96d8a36..740969b57425 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -92,6 +92,7 @@ struct kvm_sev_info {
struct misc_cg *misc_cg; /* For misc cgroup accounting */
atomic_t migration_in_progress;
u64 snp_init_flags;
+ void *snp_context; /* SNP guest context page */
};
struct kvm_svm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 499cc323f793..cf19799ca5ce 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1919,6 +1919,7 @@ enum sev_cmd_id {
/* SNP specific commands */
KVM_SEV_SNP_INIT,
+ KVM_SEV_SNP_LAUNCH_START,
KVM_SEV_NR_MAX,
};
@@ -2026,6 +2027,15 @@ struct kvm_snp_init {
__u64 flags;
};
+struct kvm_sev_snp_launch_start {
+ __u64 policy;
+ __u64 ma_uaddr;
+ __u8 ma_en;
+ __u8 imi_en;
+ __u8 gosvw[16];
+ __u8 pad[6];
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1
From: Brijesh Singh <[email protected]>
The next generation of SEV is called SEV-SNP (Secure Nested Paging).
SEV-SNP builds upon existing SEV and SEV-ES functionality while adding new
hardware based security protection. SEV-SNP adds strong memory encryption
integrity protection to help prevent malicious hypervisor-based attacks
such as data replay, memory re-mapping, and more, to create an isolated
execution environment.
The SNP feature is added incrementally, the later patches adds a new module
parameters that can be used to enabled SEV-SNP in the KVM.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 10 +++++++++-
arch/x86/kvm/svm/svm.h | 8 ++++++++
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 9e9efb42a766..51db01b282eb 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -58,6 +58,9 @@ module_param_named(sev_es, sev_es_enabled, bool, 0444);
#define sev_es_enabled false
#endif /* CONFIG_KVM_AMD_SEV */
+/* enable/disable SEV-SNP support */
+static bool sev_snp_enabled;
+
#define AP_RESET_HOLD_NONE 0
#define AP_RESET_HOLD_NAE_EVENT 1
#define AP_RESET_HOLD_MSR_PROTO 2
@@ -2306,6 +2309,7 @@ void __init sev_hardware_setup(void)
{
#ifdef CONFIG_KVM_AMD_SEV
unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
+ bool sev_snp_supported = false;
bool sev_es_supported = false;
bool sev_supported = false;
@@ -2385,12 +2389,16 @@ void __init sev_hardware_setup(void)
if (misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count))
goto out;
- pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
sev_es_supported = true;
+ sev_snp_supported = sev_snp_enabled && cpu_feature_enabled(X86_FEATURE_SEV_SNP);
+
+ pr_info("SEV-ES %ssupported: %u ASIDs\n",
+ sev_snp_supported ? "and SEV-SNP " : "", sev_es_asid_count);
out:
sev_enabled = sev_supported;
sev_es_enabled = sev_es_supported;
+ sev_snp_enabled = sev_snp_supported;
#endif
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 5efcf036ccad..8eb1b51e92f5 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -76,6 +76,7 @@ enum {
struct kvm_sev_info {
bool active; /* SEV enabled guest */
bool es_active; /* SEV-ES enabled guest */
+ bool snp_active; /* SEV-SNP enabled guest */
unsigned int asid; /* ASID used for this guest */
unsigned int handle; /* SEV firmware handle */
int fd; /* SEV device fd */
@@ -323,6 +324,13 @@ static __always_inline bool sev_es_guest(struct kvm *kvm)
#endif
}
+static inline bool sev_snp_guest(struct kvm *kvm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+ return sev_es_guest(kvm) && sev->snp_active;
+}
+
static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
{
vmcb->control.clean = 0;
--
2.25.1
From: Brijesh Singh <[email protected]>
The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
it as the measurement of the guest at launch.
While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
to encrypt the VMSA pages.
If its an SNP guest, then VMSA was added in the RMP entry as
a guest owned page and also removed from the kernel direct map
so flush it later after it is transitioned back to hypervisor
state and restored in the direct map.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Harald Hoyer <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 23 ++++
arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++
include/uapi/linux/kvm.h | 14 ++
3 files changed, 159 insertions(+)
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index c94be8e6d657..dafb0c9984f1 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -513,6 +513,29 @@ Returns: 0 on success, -negative on error
See the SEV-SNP spec for further details on how to build the VMPL permission
mask and page type.
+21. KVM_SNP_LAUNCH_FINISH
+-------------------------
+
+After completion of the SNP guest launch flow, the KVM_SNP_LAUNCH_FINISH command can be
+issued to make the guest ready for the execution.
+
+Parameters (in): struct kvm_sev_snp_launch_finish
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 host_data[32];
+ __u8 pad[6];
+ };
+
+
+See SEV-SNP specification for further details on launch finish input parameters.
References
==========
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 03dd227f6090..515e22d0dc30 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2280,6 +2280,109 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
snp_launch_update_gfn_handler, argp);
}
+static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_update data = {};
+ struct kvm_vcpu *vcpu;
+ unsigned long i;
+ int ret;
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+ data.page_type = SNP_PAGE_TYPE_VMSA;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct vcpu_svm *svm = to_svm(vcpu);
+ u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
+
+ /* Perform some pre-encryption checks against the VMSA */
+ ret = sev_es_sync_vmsa(svm);
+ if (ret)
+ return ret;
+
+ /* Transition the VMSA page to a firmware state. */
+ ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);
+ if (ret)
+ return ret;
+
+ /* Issue the SNP command to encrypt the VMSA */
+ data.address = __sme_pa(svm->sev_es.vmsa);
+ ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
+ &data, &argp->error);
+ if (ret) {
+ snp_page_reclaim(pfn);
+ return ret;
+ }
+
+ svm->vcpu.arch.guest_state_protected = true;
+ }
+
+ return 0;
+}
+
+static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct kvm_sev_snp_launch_finish params;
+ struct sev_data_snp_launch_finish *data;
+ void *id_block = NULL, *id_auth = NULL;
+ int ret;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ /* Measure all vCPUs using LAUNCH_UPDATE before finalizing the launch flow. */
+ ret = snp_launch_update_vmsa(kvm, argp);
+ if (ret)
+ return ret;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+ if (!data)
+ return -ENOMEM;
+
+ if (params.id_block_en) {
+ id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
+ if (IS_ERR(id_block)) {
+ ret = PTR_ERR(id_block);
+ goto e_free;
+ }
+
+ data->id_block_en = 1;
+ data->id_block_paddr = __sme_pa(id_block);
+
+ id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
+ if (IS_ERR(id_auth)) {
+ ret = PTR_ERR(id_auth);
+ goto e_free_id_block;
+ }
+
+ data->id_auth_paddr = __sme_pa(id_auth);
+
+ if (params.auth_key_en)
+ data->auth_key_en = 1;
+ }
+
+ memcpy(data->host_data, params.host_data, KVM_SEV_SNP_FINISH_DATA_SIZE);
+ data->gctx_paddr = __psp_pa(sev->snp_context);
+ ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
+
+ kfree(id_auth);
+
+e_free_id_block:
+ kfree(id_block);
+
+e_free:
+ kfree(data);
+
+ return ret;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2376,6 +2479,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_UPDATE:
r = snp_launch_update(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_FINISH:
+ r = snp_launch_finish(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2831,11 +2937,27 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
svm = to_svm(vcpu);
+ /*
+ * If its an SNP guest, then VMSA was added in the RMP entry as
+ * a guest owned page. Transition the page to hypervisor state
+ * before releasing it back to the system.
+ * Also the page is removed from the kernel direct map, so flush it
+ * later after it is transitioned back to hypervisor state and
+ * restored in the direct map.
+ */
+ if (sev_snp_guest(vcpu->kvm)) {
+ u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
+
+ if (host_rmp_make_shared(pfn, PG_LEVEL_4K, true))
+ goto skip_vmsa_free;
+ }
+
if (vcpu->arch.guest_state_protected)
sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);
__free_page(virt_to_page(svm->sev_es.vmsa));
+skip_vmsa_free:
if (svm->sev_es.ghcb_sa_free)
kvfree(svm->sev_es.ghcb_sa);
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4098bba17aa4..2bab08a5b5d7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1921,6 +1921,7 @@ enum sev_cmd_id {
KVM_SEV_SNP_INIT,
KVM_SEV_SNP_LAUNCH_START,
KVM_SEV_SNP_LAUNCH_UPDATE,
+ KVM_SEV_SNP_LAUNCH_FINISH,
KVM_SEV_NR_MAX,
};
@@ -2055,6 +2056,19 @@ struct kvm_sev_snp_launch_update {
__u8 vmpl1_perms;
};
+#define KVM_SEV_SNP_ID_BLOCK_SIZE 96
+#define KVM_SEV_SNP_ID_AUTH_SIZE 4096
+#define KVM_SEV_SNP_FINISH_DATA_SIZE 32
+
+struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
+ __u8 pad[6];
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1
From: Brijesh Singh <[email protected]>
When running an SEV-SNP VM, the sPA used to index the RMP entry is
obtained through the NPT translation (gva->gpa->spa). The NPT page
level is checked against the page level programmed in the RMP entry.
If the page level does not match, then it will cause a nested page
fault with the RMP bit set to indicate the RMP violation.
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 9 ++++++
arch/x86/kvm/svm/sev.c | 51 ++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 2 ++
arch/x86/kvm/svm/svm.h | 1 +
6 files changed, 66 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e116405cbb5f..87a087ec3277 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -135,6 +135,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
KVM_X86_OP_OPTIONAL(invalidate_restricted_mem)
+KVM_X86_OP_OPTIONAL(adjust_mapping_level)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a9363a6f779d..456b42cb167b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+
+ void (*adjust_mapping_level)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 360af0c9997e..d8e5254f314d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3081,6 +3081,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn,
out:
local_irq_restore(flags);
+
return level;
}
@@ -3141,6 +3142,14 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
fault->gfn, fault->max_level,
fault->is_private);
+ if (kvm_slot_can_be_private(slot)) {
+ int req_level = fault->req_level;
+
+ static_call_cond(kvm_x86_adjust_mapping_level)(vcpu->kvm, fault->gfn, fault->pfn,
+ &req_level);
+ fault->req_level = req_level;
+ }
+
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 515e22d0dc30..e8740c35be39 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3749,3 +3749,54 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
return p;
}
+
+static bool is_gfn_range_shared(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ while (start++ < end)
+ if (kvm_mem_is_private(kvm, start))
+ return false;
+
+ return true;
+}
+
+void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level)
+{
+ int assigned;
+ int rmp_level = 1;
+ int level_orig = *level;
+
+ if (!sev_snp_guest(kvm))
+ return;
+
+ /* If there's an error retrieving RMP entry, stick with 4K mappings */
+ assigned = snp_lookup_rmpentry(pfn, &rmp_level);
+ if (unlikely(assigned < 0))
+ goto out_adjust;
+
+ if (!assigned) {
+ gfn_t huge_gfn;
+
+ /*
+ * If all the pages are shared then no need to keep the RMP
+ * and NPT in sync.
+ */
+ huge_gfn = gfn & ~(PTRS_PER_PMD - 1);
+ if (is_gfn_range_shared(kvm, huge_gfn, huge_gfn + PTRS_PER_PMD))
+ goto out;
+ }
+
+ /*
+ * The hardware installs 2MB TLB entries to access to 1GB pages,
+ * therefore allow NPT to use 1GB pages when pfn was added as 2MB
+ * in the RMP table.
+ */
+ if (rmp_level == PG_LEVEL_2M && (*level == PG_LEVEL_1G))
+ goto out;
+
+out_adjust:
+ /* Adjust the level to keep the NPT and RMP in sync */
+ *level = min_t(size_t, *level, rmp_level);
+out:
+ pr_debug("%s: GFN: 0x%llx, PFN: 0x%llx, level: %d, rmp_level: %d, level_orig: %d, assigned: %d\n",
+ __func__, gfn, pfn, *level, rmp_level, level_orig, assigned);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1061aaf66f0a..9eb750c8b04c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4835,6 +4835,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+
+ .adjust_mapping_level = sev_adjust_mapping_level,
};
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 740969b57425..cbd4594f1cca 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -706,6 +706,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
void sev_es_unmap_ghcb(struct vcpu_svm *svm);
struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
/* vmenter.S */
--
2.25.1
From: Brijesh Singh <[email protected]>
When SEV-SNP is enabled globally, the hardware places restrictions on all
memory accesses based on the RMP entry, whether the hypervisor or a VM,
performs the accesses. When hardware encounters an RMP access violation
during a guest access, it will cause a #VMEXIT(NPF).
See APM2 section 16.36.10 for more details.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 456b42cb167b..d2e1c109dde5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -253,9 +253,13 @@ enum x86_intercept_stage;
#define PFERR_FETCH_BIT 4
#define PFERR_PK_BIT 5
#define PFERR_SGX_BIT 15
+#define PFERR_GUEST_RMP_BIT 31
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
#define PFERR_IMPLICIT_ACCESS_BIT 48
+#define PFERR_GUEST_ENC_BIT 34
+#define PFERR_GUEST_SIZEM_BIT 35
+#define PFERR_GUEST_VMPL_BIT 36
#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
#define PFERR_WRITE_MASK BIT(PFERR_WRITE_BIT)
@@ -267,6 +271,10 @@ enum x86_intercept_stage;
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
+#define PFERR_GUEST_RMP_MASK BIT_ULL(PFERR_GUEST_RMP_BIT)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
+#define PFERR_GUEST_SIZEM_MASK BIT_ULL(PFERR_GUEST_SIZEM_BIT)
+#define PFERR_GUEST_VMPL_MASK BIT_ULL(PFERR_GUEST_VMPL_BIT)
#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
PFERR_WRITE_MASK | \
--
2.25.1
From: Brijesh Singh <[email protected]>
SEV-SNP guests are required to perform a GHCB GPA registration. Before
using a GHCB GPA for a vCPU the first time, a guest must register the
vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then
it must respond back with the same GPA otherwise return -1.
On VMEXIT, Verify that GHCB GPA matches with the registered value. If a
mismatch is detected then abort the guest.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/sev-common.h | 8 ++++++++
arch/x86/kvm/svm/sev.c | 27 +++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.h | 7 +++++++
3 files changed, 42 insertions(+)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 539de6b93420..0a9055cdfae2 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -59,6 +59,14 @@
#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS 12
#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0)
+/* Preferred GHCB GPA Request */
+#define GHCB_MSR_PREF_GPA_REQ 0x010
+#define GHCB_MSR_GPA_VALUE_POS 12
+#define GHCB_MSR_GPA_VALUE_MASK GENMASK_ULL(51, 0)
+
+#define GHCB_MSR_PREF_GPA_RESP 0x011
+#define GHCB_MSR_PREF_GPA_NONE 0xfffffffffffff
+
/* GHCB GPA Register */
#define GHCB_MSR_REG_GPA_REQ 0x012
#define GHCB_MSR_REG_GPA_REQ_VAL(v) \
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e8740c35be39..2613311f4fcc 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3424,6 +3424,27 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
break;
}
+ case GHCB_MSR_PREF_GPA_REQ: {
+ set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_NONE, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_RESP, GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ }
+ case GHCB_MSR_REG_GPA_REQ: {
+ u64 gfn;
+
+ gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+
+ svm->sev_es.ghcb_registered_gpa = gfn_to_gpa(gfn);
+
+ set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_REG_GPA_RESP, GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ }
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;
@@ -3490,6 +3511,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
exit_code = ghcb_get_sw_exit_code(ghcb);
+ /* SEV-SNP guest requires that the GHCB GPA must be registered */
+ if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
+ vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
+ return -EINVAL;
+ }
+
ret = sev_es_validate_vmgexit(svm);
if (ret)
return ret;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index cbd4594f1cca..0c655a4d32d5 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -204,6 +204,8 @@ struct vcpu_sev_es_state {
u32 ghcb_sa_len;
bool ghcb_sa_sync;
bool ghcb_sa_free;
+
+ u64 ghcb_registered_gpa;
};
struct vcpu_svm {
@@ -336,6 +338,11 @@ static inline bool sev_snp_guest(struct kvm *kvm)
return sev_es_guest(kvm) && sev->snp_active;
}
+static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
+{
+ return svm->sev_es.ghcb_registered_gpa == val;
+}
+
static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
{
vmcb->control.clean = 0;
--
2.25.1
For private memslots, GHCB page state change requests will be forwarded
to userspace for processing. Define a new KVM_EXIT_VMGEXIT for exits of
this type.
Signed-off-by: Michael Roth <[email protected]>
---
include/uapi/linux/kvm.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2bab08a5b5d7..6e684bf5f723 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -279,6 +279,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_RISCV_CSR 36
#define KVM_EXIT_NOTIFY 37
#define KVM_EXIT_MEMORY_FAULT 38
+#define KVM_EXIT_VMGEXIT 50
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -527,6 +528,11 @@ struct kvm_run {
__u64 gpa;
__u64 size;
} memory;
+ /* KVM_EXIT_VMGEXIT */
+ struct {
+ __u64 ghcb_msr; /* GHCB MSR contents */
+ __u8 error; /* user -> kernel */
+ } vmgexit;
/* Fix the size of the union. */
char padding[256];
};
--
2.25.1
From: Brijesh Singh <[email protected]>
The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
guest's memory. The data is encrypted with the cryptographic context
created with the KVM_SEV_SNP_LAUNCH_START.
In addition to the inserting data, it can insert a two special pages
into the guests memory: the secrets page and the CPUID page.
While terminating the guest, reclaim the guest pages added in the RMP
table. If the reclaim fails, then the page is no longer safe to be
released back to the system and leak them.
For more information see the SEV-SNP specification.
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 29 +++
arch/x86/kvm/svm/sev.c | 190 ++++++++++++++++++
include/uapi/linux/kvm.h | 19 ++
3 files changed, 238 insertions(+)
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 58971fc02a15..c94be8e6d657 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -485,6 +485,35 @@ Returns: 0 on success, -negative on error
See the SEV-SNP specification for further detail on the launch input.
+20. KVM_SNP_LAUNCH_UPDATE
+-------------------------
+
+The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
+calculates a measurement of the memory contents. The measurement is a signature
+of the memory contents that can be sent to the guest owner as an attestation
+that the memory was encrypted correctly by the firmware.
+
+Parameters (in): struct kvm_snp_launch_update
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_update {
+ __u64 start_gfn; /* Guest page number to start from. */
+ __u64 uaddr; /* userspace address need to be encrypted */
+ __u32 len; /* length of memory region */
+ __u8 imi_page; /* 1 if memory is part of the IMI */
+ __u8 page_type; /* page type */
+ __u8 vmpl3_perms; /* VMPL3 permission mask */
+ __u8 vmpl2_perms; /* VMPL2 permission mask */
+ __u8 vmpl1_perms; /* VMPL1 permission mask */
+ };
+
+See the SEV-SNP spec for further details on how to build the VMPL permission
+mask and page type.
+
+
References
==========
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 097bb2138360..03dd227f6090 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -234,6 +234,37 @@ static void sev_decommission(unsigned int handle)
sev_guest_decommission(&decommission, NULL);
}
+static int snp_page_reclaim(u64 pfn)
+{
+ struct sev_data_snp_page_reclaim data = {0};
+ int err, rc;
+
+ data.paddr = __sme_set(pfn << PAGE_SHIFT);
+ rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
+ if (rc) {
+ /*
+ * If the reclaim failed, then page is no longer safe
+ * to use.
+ */
+ snp_mark_pages_offline(pfn,
+ page_level_size(PG_LEVEL_4K) >> PAGE_SHIFT);
+ }
+
+ return rc;
+}
+
+static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
+{
+ int rc;
+
+ rc = rmp_make_shared(pfn, level);
+ if (rc && leak)
+ snp_mark_pages_offline(pfn,
+ page_level_size(level) >> PAGE_SHIFT);
+
+ return rc;
+}
+
static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
{
struct sev_data_deactivate deactivate;
@@ -2093,6 +2124,162 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
return rc;
}
+static int snp_launch_update_gfn_handler(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ void *opaque)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct kvm_memory_slot *memslot = range->slot;
+ struct sev_data_snp_launch_update data = {0};
+ struct kvm_sev_snp_launch_update params;
+ struct kvm_sev_cmd *argp = opaque;
+ int *error = &argp->error;
+ int i, n = 0, ret = 0;
+ unsigned long npages;
+ kvm_pfn_t *pfns;
+ gfn_t gfn;
+
+ if (!kvm_slot_can_be_private(memslot)) {
+ pr_err("SEV-SNP requires restricted memory.\n");
+ return -EINVAL;
+ }
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) {
+ pr_err("Failed to copy user parameters for SEV-SNP launch.\n");
+ return -EFAULT;
+ }
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+
+ npages = range->end - range->start;
+ pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL_ACCOUNT);
+ if (!pfns)
+ return -ENOMEM;
+
+ pr_debug("%s: GFN range 0x%llx-0x%llx, type %d\n", __func__,
+ range->start, range->end, params.page_type);
+
+ for (gfn = range->start, i = 0; gfn < range->end; gfn++, i++) {
+ int order, level;
+ void *kvaddr;
+
+ ret = kvm_restrictedmem_get_pfn(memslot, gfn, &pfns[i], &order);
+ if (ret)
+ goto e_release;
+
+ n++;
+ ret = snp_lookup_rmpentry((u64)pfns[i], &level);
+ if (ret) {
+ pr_err("Failed to ensure GFN 0x%llx is in initial shared state, ret: %d\n",
+ gfn, ret);
+ return -EFAULT;
+ }
+
+ kvaddr = pfn_to_kaddr(pfns[i]);
+ if (!virt_addr_valid(kvaddr)) {
+ pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn);
+ ret = -EINVAL;
+ goto e_release;
+ }
+
+ ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
+ if (ret) {
+ pr_err("Guest read failed, ret: 0x%x\n", ret);
+ goto e_release;
+ }
+
+ ret = rmp_make_private(pfns[i], gfn << PAGE_SHIFT, PG_LEVEL_4K,
+ sev_get_asid(kvm), true);
+ if (ret) {
+ ret = -EFAULT;
+ goto e_release;
+ }
+
+ data.address = __sme_set(pfns[i] << PAGE_SHIFT);
+ data.page_size = X86_TO_RMP_PG_LEVEL(PG_LEVEL_4K);
+ data.page_type = params.page_type;
+ data.vmpl3_perms = params.vmpl3_perms;
+ data.vmpl2_perms = params.vmpl2_perms;
+ data.vmpl1_perms = params.vmpl1_perms;
+ ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
+ &data, error);
+ if (ret) {
+ pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n",
+ ret, *error);
+ snp_page_reclaim(pfns[i]);
+
+ /*
+ * When invalid CPUID function entries are detected, the firmware
+ * corrects these entries for debugging purpose and leaves the
+ * page unencrypted so it can be provided users for debugging
+ * and error-reporting.
+ *
+ * Copy the corrected CPUID page back to shared memory so
+ * userpsace can retrieve this information.
+ */
+ if (params.page_type == SNP_PAGE_TYPE_CPUID &&
+ *error == SEV_RET_INVALID_PARAM) {
+ int ret;
+
+ host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
+
+ ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
+ if (ret)
+ pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n",
+ ret);
+ }
+
+
+ goto e_release;
+ }
+ }
+
+ /*
+ * Memory attribute updates via KVM_SET_MEMORY_ATTRIBUTES are serialized
+ * via kvm->slots_lock, so use the same protocol for updating them here.
+ */
+ mutex_lock(&kvm->slots_lock);
+ kvm_vm_set_region_attr(kvm, range->start, range->end, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+ mutex_unlock(&kvm->slots_lock);
+
+e_release:
+ /* Content of memory is updated, mark pages dirty */
+ for (i = 0; i < n; i++) {
+ set_page_dirty(pfn_to_page(pfns[i]));
+ mark_page_accessed(pfn_to_page(pfns[i]));
+
+ /*
+ * If its an error, then update RMP entry to change page ownership
+ * to the hypervisor.
+ */
+ if (ret)
+ host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
+
+ put_page(pfn_to_page(pfns[i]));
+ }
+
+ kvfree(pfns);
+ return ret;
+}
+
+static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct kvm_sev_snp_launch_update params;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ return kvm_vm_do_hva_range_op(kvm, params.uaddr, params.uaddr + params.len,
+ snp_launch_update_gfn_handler, argp);
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2186,6 +2373,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_START:
r = snp_launch_start(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_UPDATE:
+ r = snp_launch_update(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index cf19799ca5ce..4098bba17aa4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1920,6 +1920,7 @@ enum sev_cmd_id {
/* SNP specific commands */
KVM_SEV_SNP_INIT,
KVM_SEV_SNP_LAUNCH_START,
+ KVM_SEV_SNP_LAUNCH_UPDATE,
KVM_SEV_NR_MAX,
};
@@ -2036,6 +2037,24 @@ struct kvm_sev_snp_launch_start {
__u8 pad[6];
};
+#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
+#define KVM_SEV_SNP_PAGE_TYPE_VMSA 0x2
+#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
+#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
+#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
+#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6
+
+struct kvm_sev_snp_launch_update {
+ __u64 start_gfn;
+ __u64 uaddr;
+ __u32 len;
+ __u8 imi_page;
+ __u8 page_type;
+ __u8 vmpl3_perms;
+ __u8 vmpl2_perms;
+ __u8 vmpl1_perms;
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1
From: Brijesh Singh <[email protected]>
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.
Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
it is done for requests that don't use a GHCB page.
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/include/asm/sev-common.h | 7 +++++++
arch/x86/kvm/svm/sev.c | 20 ++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index ee38f7408470..1b111cde8c82 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -130,6 +130,13 @@ enum psc_op {
/* SNP Page State Change NAE event */
#define VMGEXIT_PSC_MAX_ENTRY 253
+/* The page state change hdr structure in not valid */
+#define PSC_INVALID_HDR 1
+/* The hdr.cur_entry or hdr.end_entry is not valid */
+#define PSC_INVALID_ENTRY 2
+/* Page state change encountered undefined error */
+#define PSC_UNDEF_ERR 3
+
struct psc_hdr {
u16 cur_entry;
u16 end_entry;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a1a2686dde7b..102966c43e28 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3152,6 +3152,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
case SVM_VMGEXIT_AP_JUMP_TABLE:
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
case SVM_VMGEXIT_HV_FEATURES:
+ case SVM_VMGEXIT_PSC:
break;
default:
reason = GHCB_ERR_INVALID_EVENT;
@@ -3363,6 +3364,19 @@ static int snp_complete_psc_msr_protocol(struct kvm_vcpu *vcpu)
return 1; /* resume */
}
+/*
+ * TODO: need to process the GHCB contents and report the proper error code
+ * instead of assuming success.
+ */
+static int snp_complete_psc(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 0);
+
+ return 1;
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3606,6 +3620,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = 1;
break;
}
+ case SVM_VMGEXIT_PSC:
+ /* Let userspace handling allocating/deallocating backing pages. */
+ vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+ vcpu->run->vmgexit.ghcb_msr = ghcb_gpa;
+ vcpu->arch.complete_userspace_io = snp_complete_psc;
+ break;
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
--
2.25.1
From: Brijesh Singh <[email protected]>
While resolving the RMP page fault, there may be cases where the page
level between the RMP entry and TDP does not match and the 2M RMP entry
must be split into 4K RMP entries. Or a 2M TDP page need to be broken
into multiple of 4K pages.
To keep the RMP and TDP page level in sync, zap the gfn range after
splitting the pages in the RMP entry. The zap should force the TDP to
gets rebuilt with the new page level.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 2 --
arch/x86/kvm/mmu/mmu.c | 1 +
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2e1c109dde5..28b01cc7f64d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1845,6 +1845,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+
int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 168c46fd8dd1..0afaf8ff2bb8 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -211,8 +211,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return -(u32)fault & errcode;
}
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d8e5254f314d..d7847af3e177 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6615,6 +6615,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
return need_tlb_flush;
}
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
--
2.25.1
In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Add hooks to wire up handling of this sort to the invalidation notifiers
for restricted memory.
Also issue invalidations of all allocated pages during notifier/memslot
unbinding so that the pages are not left in an unusable state when
they eventually get freed back to the host upon FD release.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 5 +++++
include/linux/kvm_host.h | 3 +++
mm/restrictedmem.c | 12 +++++++++++-
virt/kvm/kvm_main.c | 12 +++++++++++-
6 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index a8aaf532c2ab..6a885f024a00 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -133,6 +133,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
+KVM_X86_OP_OPTIONAL(invalidate_restricted_mem)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2da3fb2d5d1b..37c92412035f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1646,6 +1646,7 @@ struct kvm_x86_ops {
bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
int (*update_mem_attr)(struct kvm_memory_slot *slot, unsigned int attr,
gfn_t start, gfn_t end);
+ void (*invalidate_restricted_mem)(struct kvm_memory_slot *slot, gfn_t start, gfn_t end);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 053bd77bbf52..360af0c9997e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7264,4 +7264,9 @@ void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
pr_warn_ratelimited("Failed to update GFN range 0x%llx-0x%llx with attributes 0x%lx. Ret: %d\n",
start, end, attrs, ret);
}
+
+void kvm_arch_invalidate_restricted_mem(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+ static_call_cond(kvm_x86_invalidate_restricted_mem)(slot, start, end);
+}
#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d200b8f45583..4d542060cd93 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2341,6 +2341,9 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
}
+
+void kvm_arch_invalidate_restricted_mem(struct kvm_memory_slot *slot, gfn_t start, gfn_t end);
+
#else
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
{
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index fd6f3c66033f..c8353c592cfe 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -17,7 +17,7 @@ struct restrictedmem {
static int restrictedmem_release(struct inode *inode, struct file *file)
{
- struct restrictedmem *rm = inode->i_mapping->private_data;
+ struct restrictedmem *rm = file->f_mapping->private_data;
xa_destroy(&rm->bindings);
fput(rm->memfd);
@@ -305,10 +305,20 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
struct restrictedmem_notifier *notifier)
{
struct restrictedmem *rm = file->f_mapping->private_data;
+ unsigned long index;
+ pr_debug("%s: unregistering notifier, invalidating page offsets 0x%lx-0x%lx\n",
+ __func__, start, end);
down_write(&rm->lock);
+
+ xa_for_each_range(&rm->bindings, index, notifier, start, end)
+ notifier->ops->invalidate_start(notifier, start, end);
+ xa_for_each_range(&rm->bindings, index, notifier, start, end)
+ notifier->ops->invalidate_end(notifier, start, end);
+
xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL);
synchronize_rcu();
+
up_write(&rm->lock);
}
EXPORT_SYMBOL_GPL(restrictedmem_unbind);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8ec985f1c57d..f7e00593cc5d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -960,8 +960,15 @@ static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *no
struct kvm *kvm = slot->kvm;
int idx;
- if (restrictedmem_get_gfn_range(slot, start, end, &gfn_range))
+ if (restrictedmem_get_gfn_range(slot, start, end, &gfn_range)) {
+ pr_debug("%s: Invalidation skipped, slot: %d, start: 0x%lx, end: 0x%lx, restrictedmem.index: 0x%lx\n",
+ __func__, slot->id, start, end, slot->restrictedmem.index);
return;
+ }
+
+ pr_debug("%s: slot: %d, start: 0x%lx, end: 0x%lx, restrictedmem.index: 0x%lx, gfn_start: 0x%llx, gfn_end: 0x%llx\n",
+ __func__, slot->id, start, end, slot->restrictedmem.index, gfn_range.start,
+ gfn_range.end);
idx = srcu_read_lock(&kvm->srcu);
KVM_MMU_LOCK(kvm);
@@ -972,7 +979,10 @@ static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *no
kvm_flush_remote_tlbs(kvm);
KVM_MMU_UNLOCK(kvm);
+
srcu_read_unlock(&kvm->srcu, idx);
+
+ kvm_arch_invalidate_restricted_mem(slot, gfn_range.start, gfn_range.end);
}
static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier *notifier,
--
2.25.1
From: Brijesh Singh <[email protected]>
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.
Forward these requests to userspace via KVM_EXIT_VMGEXIT so the VMM can
issue the KVM ioctls to update the page state accordingly.
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/include/asm/sev-common.h | 9 ++++++++
arch/x86/kvm/svm/sev.c | 25 +++++++++++++++++++++++
arch/x86/kvm/trace.h | 34 +++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
4 files changed, 69 insertions(+)
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 0a9055cdfae2..ee38f7408470 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -93,6 +93,10 @@ enum psc_op {
};
#define GHCB_MSR_PSC_REQ 0x014
+#define GHCB_MSR_PSC_GFN_POS 12
+#define GHCB_MSR_PSC_GFN_MASK GENMASK_ULL(39, 0)
+#define GHCB_MSR_PSC_OP_POS 52
+#define GHCB_MSR_PSC_OP_MASK 0xf
#define GHCB_MSR_PSC_REQ_GFN(gfn, op) \
/* GHCBData[55:52] */ \
(((u64)((op) & 0xf) << 52) | \
@@ -102,6 +106,11 @@ enum psc_op {
GHCB_MSR_PSC_REQ)
#define GHCB_MSR_PSC_RESP 0x015
+#define GHCB_MSR_PSC_ERROR_POS 32
+#define GHCB_MSR_PSC_ERROR_MASK GENMASK_ULL(31, 0)
+#define GHCB_MSR_PSC_ERROR GENMASK_ULL(31, 0)
+#define GHCB_MSR_PSC_RSVD_POS 12
+#define GHCB_MSR_PSC_RSVD_MASK GENMASK_ULL(19, 0)
#define GHCB_MSR_PSC_RESP_VAL(val) \
/* GHCBData[63:32] */ \
(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 2613311f4fcc..a1a2686dde7b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -30,6 +30,7 @@
#include "svm_ops.h"
#include "cpuid.h"
#include "trace.h"
+#include "mmu.h"
#ifndef CONFIG_KVM_AMD_SEV
/*
@@ -3345,6 +3346,23 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
svm->vmcb->control.ghcb_gpa = value;
}
+/*
+ * TODO: need to get the value set by userspace in vcpu->run->vmgexit.ghcb_msr
+ * and process that here accordingly.
+ */
+static int snp_complete_psc_msr_protocol(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ set_ghcb_msr_bits(svm, 0,
+ GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS);
+
+ set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
+
+ return 1; /* resume */
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3445,6 +3463,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_POS);
break;
}
+ case GHCB_MSR_PSC_REQ:
+ vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+ vcpu->run->vmgexit.ghcb_msr = control->ghcb_gpa;
+ vcpu->arch.complete_userspace_io = snp_complete_psc_msr_protocol;
+
+ ret = -1;
+ break;
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 83843379813e..65861d2d086c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -7,6 +7,7 @@
#include <asm/svm.h>
#include <asm/clocksource.h>
#include <asm/pvclock-abi.h>
+#include <asm/sev-common.h>
#undef TRACE_SYSTEM
#define TRACE_SYSTEM kvm
@@ -1831,6 +1832,39 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
__entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
);
+/*
+ * Tracepoint for the SEV-SNP page state change processing
+ */
+#define psc_operation \
+ {SNP_PAGE_STATE_PRIVATE, "private"}, \
+ {SNP_PAGE_STATE_SHARED, "shared"} \
+
+TRACE_EVENT(kvm_snp_psc,
+ TP_PROTO(unsigned int vcpu_id, u64 pfn, u64 gpa, u8 op, int level),
+ TP_ARGS(vcpu_id, pfn, gpa, op, level),
+
+ TP_STRUCT__entry(
+ __field(int, vcpu_id)
+ __field(u64, pfn)
+ __field(u64, gpa)
+ __field(u8, op)
+ __field(int, level)
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_id = vcpu_id;
+ __entry->pfn = pfn;
+ __entry->gpa = gpa;
+ __entry->op = op;
+ __entry->level = level;
+ ),
+
+ TP_printk("vcpu %u, pfn %llx, gpa %llx, op %s, level %d",
+ __entry->vcpu_id, __entry->pfn, __entry->gpa,
+ __print_symbolic(__entry->op, psc_operation),
+ __entry->level)
+);
+
#endif /* _TRACE_KVM_H */
#undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 268c3d16894d..0154fc7a28c1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13515,6 +13515,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_snp_psc);
static int __init kvm_x86_init(void)
{
--
2.25.1
From: Tom Lendacky <[email protected]>
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to alter the register state of the APs on their own. This allows
the guest a way of simulating INIT-SIPI.
A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.
For CREATE
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID. The GPA is saved in the svm struct of the
target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
to the vCPU and then the vCPU is kicked.
For CREATE_ON_INIT:
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID the next time an INIT is performed. The GPA is
saved in the svm struct of the target vCPU.
For DESTROY:
The guest indicates it wishes to stop the vCPU. The GPA is cleared
from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
added to vCPU and then the vCPU is kicked.
The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
as a result of the event or as a result of an INIT. The handler sets the
vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
leave the vCPU as not runnable. Any previous VMSA pages that were
installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
a new VMSA is to be installed, the VMSA guest page is pinned and set as
the VMSA in the vCPU VMCB and the vCPU state is set to
KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
cleared in the vCPU VMCB and the vCPU state is left as
KVM_MP_STATE_UNINITIALIZED to prevent it from being run.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
[mdr: add handling for restrictedmem]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/svm.h | 7 +-
arch/x86/kvm/svm/sev.c | 245 ++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 3 +
arch/x86/kvm/svm/svm.h | 7 +
arch/x86/kvm/x86.c | 9 ++
6 files changed, 271 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28b01cc7f64d..09b36462582c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -113,6 +113,7 @@
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index c18d78d5e505..e76ad26ba64f 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -278,7 +278,12 @@ enum avic_ipi_failure_cause {
#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF)
#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL
-#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)
+#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)
+#define SVM_SEV_FEAT_RESTRICTED_INJECTION BIT(3)
+#define SVM_SEV_FEAT_ALTERNATE_INJECTION BIT(4)
+#define SVM_SEV_FEAT_INT_INJ_MODES \
+ (SVM_SEV_FEAT_RESTRICTED_INJECTION | \
+ SVM_SEV_FEAT_ALTERNATE_INJECTION)
struct vmcb_seg {
u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6bec2712ecc6..b2f1a12685ed 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -779,6 +779,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
static int sev_es_sync_vmsa(struct vcpu_svm *svm)
{
+ struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
struct sev_es_save_area *save = svm->sev_es.vmsa;
/* Check some debug related fields before encrypting the VMSA */
@@ -824,6 +825,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
if (sev_snp_guest(svm->vcpu.kvm))
save->sev_features |= SVM_SEV_FEAT_SNP_ACTIVE;
+ /*
+ * Save the VMSA synced SEV features. For now, they are the same for
+ * all vCPUs, so just save each time.
+ */
+ sev->sev_features = save->sev_features;
+
pr_debug("Virtual Machine Save Area (VMSA):\n");
print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false);
@@ -3161,6 +3168,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
if (!ghcb_sw_scratch_is_valid(ghcb))
goto vmgexit_err;
break;
+ case SVM_VMGEXIT_AP_CREATION:
+ if (!ghcb_rax_is_valid(ghcb))
+ goto vmgexit_err;
+ break;
case SVM_VMGEXIT_NMI_COMPLETE:
case SVM_VMGEXIT_AP_HLT_LOOP:
case SVM_VMGEXIT_AP_JUMP_TABLE:
@@ -3543,6 +3554,226 @@ static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gp
ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
}
+static kvm_pfn_t gfn_to_pfn_restricted(struct kvm *kvm, gfn_t gfn)
+{
+ struct kvm_memory_slot *slot;
+ kvm_pfn_t pfn;
+ int order = 0;
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!kvm_slot_can_be_private(slot)) {
+ pr_err("SEV: Failure retrieving restricted memslot for GFN 0x%llx, flags 0x%x, userspace_addr: 0x%lx\n",
+ gfn, slot->flags, slot->userspace_addr);
+ return INVALID_PAGE;
+ }
+
+ if (!kvm_mem_is_private(kvm, gfn)) {
+ pr_err("SEV: Failure retrieving restricted PFN for GFN 0x%llx\n", gfn);
+ return INVALID_PAGE;
+ }
+
+ if (kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order)) {
+ pr_err("SEV: Failure retrieving restricted PFN for GFN 0x%llx\n", gfn);
+ return INVALID_PAGE;
+ }
+
+ put_page(pfn_to_page(pfn));
+
+ return pfn;
+}
+
+static int __sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ kvm_pfn_t pfn;
+ hpa_t cur_pa;
+
+ WARN_ON(!mutex_is_locked(&svm->sev_es.snp_vmsa_mutex));
+
+ /* Save off the current VMSA PA for later checks */
+ cur_pa = svm->sev_es.vmsa_pa;
+
+ /* Mark the vCPU as offline and not runnable */
+ vcpu->arch.pv.pv_unhalted = false;
+ vcpu->arch.mp_state = KVM_MP_STATE_STOPPED;
+
+ /* Clear use of the VMSA */
+ svm->sev_es.vmsa_pa = INVALID_PAGE;
+ svm->vmcb->control.vmsa_pa = INVALID_PAGE;
+
+ if (cur_pa != __pa(svm->sev_es.vmsa) && VALID_PAGE(cur_pa)) {
+ /*
+ * The svm->sev_es.vmsa_pa field holds the hypervisor physical
+ * address of the about to be replaced VMSA which will no longer
+ * be used or referenced, so un-pin it. However, restricted
+ * pages (e.g. via AP creation) should be left to the
+ * restrictedmem backend to deal with, so don't release the
+ * page in that case.
+ */
+ if (!VALID_PAGE(gfn_to_pfn_restricted(vcpu->kvm,
+ gpa_to_gfn(svm->sev_es.snp_vmsa_gpa))))
+ kvm_release_pfn_dirty(__phys_to_pfn(cur_pa));
+ }
+
+ if (VALID_PAGE(svm->sev_es.snp_vmsa_gpa)) {
+ /*
+ * The VMSA is referenced by the hypervisor physical address,
+ * so retrieve the PFN and ensure it is restricted memory.
+ */
+ pfn = gfn_to_pfn_restricted(vcpu->kvm, gpa_to_gfn(svm->sev_es.snp_vmsa_gpa));
+ if (!VALID_PAGE(pfn))
+ return pfn;
+
+ /* Use the new VMSA */
+ svm->sev_es.vmsa_pa = pfn_to_hpa(pfn);
+ svm->vmcb->control.vmsa_pa = svm->sev_es.vmsa_pa;
+
+ /* Mark the vCPU as runnable */
+ vcpu->arch.pv.pv_unhalted = false;
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+ svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
+ }
+
+ /*
+ * When replacing the VMSA during SEV-SNP AP creation,
+ * mark the VMCB dirty so that full state is always reloaded.
+ */
+ vmcb_mark_all_dirty(svm->vmcb);
+
+ return 0;
+}
+
+/*
+ * Invoked as part of svm_vcpu_reset() processing of an init event.
+ */
+void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ int ret;
+
+ if (!sev_snp_guest(vcpu->kvm))
+ return;
+
+ mutex_lock(&svm->sev_es.snp_vmsa_mutex);
+
+ if (!svm->sev_es.snp_ap_create)
+ goto unlock;
+
+ svm->sev_es.snp_ap_create = false;
+
+ ret = __sev_snp_update_protected_guest_state(vcpu);
+ if (ret)
+ vcpu_unimpl(vcpu, "snp: AP state update on init failed\n");
+
+unlock:
+ mutex_unlock(&svm->sev_es.snp_vmsa_mutex);
+}
+
+static int sev_snp_ap_creation(struct vcpu_svm *svm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm_vcpu *target_vcpu;
+ struct vcpu_svm *target_svm;
+ unsigned int request;
+ unsigned int apic_id;
+ bool kick;
+ int ret;
+
+ request = lower_32_bits(svm->vmcb->control.exit_info_1);
+ apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
+
+ /* Validate the APIC ID */
+ target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
+ if (!target_vcpu) {
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
+ apic_id);
+ return -EINVAL;
+ }
+
+ ret = 0;
+
+ target_svm = to_svm(target_vcpu);
+
+ /*
+ * The target vCPU is valid, so the vCPU will be kicked unless the
+ * request is for CREATE_ON_INIT. For any errors at this stage, the
+ * kick will place the vCPU in an non-runnable state.
+ */
+ kick = true;
+
+ mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
+
+ target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
+ target_svm->sev_es.snp_ap_create = true;
+
+ /* Interrupt injection mode shouldn't change for AP creation */
+ if (request < SVM_VMGEXIT_AP_DESTROY) {
+ u64 sev_features;
+
+ sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
+ sev_features ^= sev->sev_features;
+ if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
+ vcpu->arch.regs[VCPU_REGS_RAX]);
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ switch (request) {
+ case SVM_VMGEXIT_AP_CREATE_ON_INIT:
+ kick = false;
+ fallthrough;
+ case SVM_VMGEXIT_AP_CREATE:
+ if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
+ svm->vmcb->control.exit_info_2);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Malicious guest can RMPADJUST a large page into VMSA which
+ * will hit the SNP erratum where the CPU will incorrectly signal
+ * an RMP violation #PF if a hugepage collides with the RMP entry
+ * of VMSA page, reject the AP CREATE request if VMSA address from
+ * guest is 2M aligned.
+ */
+ if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) {
+ vcpu_unimpl(vcpu,
+ "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n",
+ svm->vmcb->control.exit_info_2);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
+ break;
+ case SVM_VMGEXIT_AP_DESTROY:
+ break;
+ default:
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
+ request);
+ ret = -EINVAL;
+ break;
+ }
+
+out:
+ if (kick) {
+ if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)
+ target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+ kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
+ kvm_vcpu_kick(target_vcpu);
+ }
+
+ mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex);
+
+ return ret;
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3806,6 +4037,18 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = 1;
break;
}
+ case SVM_VMGEXIT_AP_CREATION:
+ ret = sev_snp_ap_creation(svm);
+ if (ret) {
+ ghcb_set_sw_exit_info_1(ghcb, 1);
+ ghcb_set_sw_exit_info_2(ghcb,
+ X86_TRAP_GP |
+ SVM_EVTINJ_TYPE_EXEPT |
+ SVM_EVTINJ_VALID);
+ }
+
+ ret = 1;
+ break;
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
@@ -3910,6 +4153,8 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm)
set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
GHCB_VERSION_MIN,
sev_enc_bit));
+
+ mutex_init(&svm->sev_es.snp_vmsa_mutex);
}
void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 745f736d9c98..539926b07ee5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1349,6 +1349,9 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
svm->spec_ctrl = 0;
svm->virt_spec_ctrl = 0;
+ if (init_event)
+ sev_snp_init_protected_guest_state(vcpu);
+
init_vmcb(vcpu);
if (!init_event)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index b6ca6657aa6c..37bd7b728d52 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -95,6 +95,8 @@ struct kvm_sev_info {
void *snp_context; /* SNP guest context page */
void *snp_certs_data;
struct mutex guest_req_lock; /* Lock for guest request handling */
+
+ u64 sev_features; /* Features set at VMSA creation */
};
struct kvm_svm {
@@ -209,6 +211,10 @@ struct vcpu_sev_es_state {
bool ghcb_sa_free;
u64 ghcb_registered_gpa;
+
+ struct mutex snp_vmsa_mutex;
+ gpa_t snp_vmsa_gpa;
+ bool snp_ap_create;
};
struct vcpu_svm {
@@ -718,6 +724,7 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm);
struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
+void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
/* vmenter.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0154fc7a28c1..9872217e3a06 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10501,6 +10501,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+ if (kvm_check_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu)) {
+ kvm_vcpu_reset(vcpu, true);
+ if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE)
+ goto out;
+ }
}
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -12698,6 +12704,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
return true;
#endif
+ if (kvm_test_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu))
+ return true;
+
if (kvm_arch_interrupt_allowed(vcpu) &&
(kvm_cpu_has_interrupt(vcpu) ||
kvm_guest_apic_has_interrupt(vcpu)))
--
2.25.1
KVM MMU will use this to determine whether an #NPF should be serviced
with restricted memory or not.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 10 ++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 2 ++
3 files changed, 13 insertions(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 73d614c538da..7a74a92cb39a 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4499,3 +4499,13 @@ int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
return 0;
}
+
+bool sev_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault)
+{
+ if (!sev_snp_guest(kvm))
+ return false;
+
+ *private_fault = (error_code & PFERR_GUEST_ENC_MASK) ? true : false;
+
+ return true;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e2edc4700e55..18e4a6c17d11 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4861,6 +4861,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.adjust_mapping_level = sev_adjust_mapping_level,
.update_mem_attr = sev_update_mem_attr,
+ .fault_is_private = sev_fault_is_private,
};
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 50a2bcaf3fd7..97038afa8020 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -728,6 +728,8 @@ void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
gfn_t start, gfn_t end);
+bool sev_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
+
/* vmenter.S */
void __svm_sev_es_vcpu_run(struct vcpu_svm *svm, bool spec_ctrl_intercepted);
--
2.25.1
Implement a platform hook to do the work of restoring the direct map
entries and cleaning up RMP table entries for restricted memory that is
being freed back to the host.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 62 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 64 insertions(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 7a74a92cb39a..bedec90d034f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4509,3 +4509,65 @@ bool sev_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *priv
return true;
}
+
+void sev_invalidate_private_range(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+ gfn_t gfn = start;
+
+ if (!sev_snp_guest(slot->kvm))
+ return;
+
+ if (!kvm_slot_can_be_private(slot)) {
+ pr_warn_ratelimited("SEV: Memslot for GFN: 0x%llx is not private.\n",
+ gfn);
+ return;
+ }
+
+ while (gfn <= end) {
+ gpa_t gpa = gfn_to_gpa(gfn);
+ int level = PG_LEVEL_4K;
+ int order, rc;
+ kvm_pfn_t pfn;
+
+ rc = kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order);
+ if (rc) {
+ pr_warn_ratelimited("SEV: Failed to retrieve restricted PFN for GFN 0x%llx, rc: %d\n",
+ gfn, rc);
+ gfn++;
+ continue;
+ }
+
+ if (order) {
+ int rmp_level;
+
+ if (IS_ALIGNED(gpa, page_level_size(PG_LEVEL_2M)) &&
+ gpa + page_level_size(PG_LEVEL_2M) <= gfn_to_gpa(end))
+ level = PG_LEVEL_2M;
+ else
+ pr_debug("%s: GPA 0x%llx is not aligned to 2M, skipping 2M directmap restoration\n",
+ __func__, gpa);
+
+ /*
+ * TODO: It may still be possible to restore 2M mapping here,
+ * but keep it simple for now.
+ */
+ if (level == PG_LEVEL_2M &&
+ (!snp_lookup_rmpentry(pfn, &rmp_level) || rmp_level == PG_LEVEL_4K)) {
+ pr_debug("%s: PFN 0x%llx is not mapped as 2M private range, skipping 2M directmap restoration\n",
+ __func__, pfn);
+ level = PG_LEVEL_4K;
+ }
+ }
+
+ pr_debug("%s: GPA %llx PFN %llx order %d level %d\n",
+ __func__, gpa, pfn, order, level);
+ rc = snp_make_page_shared(slot->kvm, gpa, pfn, level);
+ if (rc)
+ pr_err("SEV: Failed to restore page to shared, GPA: 0x%llx PFN: 0x%llx order: %d rc: %d\n",
+ gpa, pfn, order, rc);
+
+ gfn += page_level_size(level) >> PAGE_SHIFT;
+ put_page(pfn_to_page(pfn));
+ cond_resched();
+ }
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 18e4a6c17d11..3fe5f13b5f3a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4862,6 +4862,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.adjust_mapping_level = sev_adjust_mapping_level,
.update_mem_attr = sev_update_mem_attr,
.fault_is_private = sev_fault_is_private,
+ .invalidate_restricted_mem = sev_invalidate_private_range,
};
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 97038afa8020..857b674e68f0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -727,6 +727,7 @@ void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
gfn_t start, gfn_t end);
+void sev_invalidate_private_range(struct kvm_memory_slot *slot, gfn_t start, gfn_t end);
bool sev_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
--
2.25.1
From: Tom Lendacky <[email protected]>
In preparation to support SEV-SNP AP Creation, use a variable that holds
the VMSA physical address rather than converting the virtual address.
This will allow SEV-SNP AP Creation to set the new physical address that
will be used should the vCPU reset path be taken.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 5 ++---
arch/x86/kvm/svm/svm.c | 9 ++++++++-
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 92179614102e..6bec2712ecc6 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3849,10 +3849,9 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
/*
* An SEV-ES guest requires a VMSA area that is a separate from the
- * VMCB page. Do not include the encryption mask on the VMSA physical
- * address since hardware will access it using the guest key.
+ * VMCB page.
*/
- svm->vmcb->control.vmsa_pa = __pa(svm->sev_es.vmsa);
+ svm->vmcb->control.vmsa_pa = svm->sev_es.vmsa_pa;
/* Can't intercept CR register access, HV can't modify CR registers */
svm_clr_intercept(svm, INTERCEPT_CR0_READ);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f9ab4bf6d245..745f736d9c98 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1410,9 +1410,16 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
svm_switch_vmcb(svm, &svm->vmcb01);
- if (vmsa_page)
+ if (vmsa_page) {
svm->sev_es.vmsa = page_address(vmsa_page);
+ /*
+ * Do not include the encryption mask on the VMSA physical
+ * address since hardware will access it using the guest key.
+ */
+ svm->sev_es.vmsa_pa = __pa(svm->sev_es.vmsa);
+ }
+
svm->guest_state_loaded = false;
return 0;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 4a9ffb7e5139..b6ca6657aa6c 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -198,6 +198,7 @@ struct vcpu_sev_es_state {
struct sev_es_save_area *vmsa;
struct ghcb *ghcb;
struct kvm_host_map ghcb_map;
+ hpa_t vmsa_pa;
bool received_first_sipi;
unsigned int ap_reset_hold_type;
--
2.25.1
From: Brijesh Singh <[email protected]>
When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.
Page state changes are handled by userspace, so if an RMP fault is
triggered as a result of an RMP NPT fault, exit to userspace just like
with explicit page-state change requests.
RMP NPT faults can also occur if the guest pvalidates a 2M page as 4K,
in which case the RMP entries need to be PSMASH'd. Handle this case
immediately in the kernel.
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/kvm/svm/sev.c | 84 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 21 +++++++++--
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 102 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 102966c43e28..197b1f904567 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3347,6 +3347,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
svm->vmcb->control.ghcb_gpa = value;
}
+static int snp_rmptable_psmash(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+ return psmash(pfn);
+}
+
/*
* TODO: need to get the value set by userspace in vcpu->run->vmgexit.ghcb_msr
* and process that here accordingly.
@@ -3872,3 +3879,80 @@ void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *le
pr_debug("%s: GFN: 0x%llx, PFN: 0x%llx, level: %d, rmp_level: %d, level_orig: %d, assigned: %d\n",
__func__, gfn, pfn, *level, rmp_level, level_orig, assigned);
}
+
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+ int order, rmp_level, assigned, ret;
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = vcpu->kvm;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+
+ /*
+ * Private memslots punt handling of implicit page state changes to
+ * userspace, so the only RMP faults expected here for
+ * PFERR_GUEST_SIZEM_MASK. Anything else suggests that the RMP table has
+ * gotten out of sync with the private memslot.
+ *
+ * TODO: However, this case has also been noticed when an access occurs
+ * to an NPT mapping that has just been split/PSMASHED, in which case
+ * PFERR_GUEST_SIZEM_MASK might not be set. In those cases it should be
+ * safe to ignore and let the guest retry, but log these just in case
+ * for now.
+ */
+ if (!(error_code & PFERR_GUEST_SIZEM_MASK)) {
+ pr_warn_ratelimited("Unexpected RMP fault for GPA 0x%llx, error_code 0x%llx",
+ gpa, error_code);
+ return;
+ }
+
+ gfn = gpa >> PAGE_SHIFT;
+
+ /*
+ * Only RMPADJUST/PVALIDATE should cause PFERR_GUEST_SIZEM.
+ *
+ * For PVALIDATE, this should only happen if a guest PVALIDATEs a 4K GFN
+ * that is backed by a huge page in the host whose RMP entry has the
+ * hugepage/assigned bits set. With UPM, that should only ever happen
+ * for private pages.
+ *
+ * For RMPADJUST, this assumption might not hold, in which case handling
+ * for obtaining the PFN from HVA-backed memory may be needed. For now,
+ * just print warnings.
+ */
+ if (!kvm_mem_is_private(kvm, gfn)) {
+ pr_warn_ratelimited("Unexpected RMP fault, size-mismatch for non-private GPA 0x%llx\n",
+ gpa);
+ return;
+ }
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!kvm_slot_can_be_private(slot)) {
+ pr_warn_ratelimited("Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+ gpa);
+ return;
+ }
+
+ ret = kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order);
+ if (ret) {
+ pr_warn_ratelimited("Unexpected RMP fault, no private backing page for GPA 0x%llx\n",
+ gpa);
+ return;
+ }
+
+ assigned = snp_lookup_rmpentry(pfn, &rmp_level);
+ if (assigned != 1) {
+ pr_warn_ratelimited("Unexpected RMP fault, no assigned RMP entry for GPA 0x%llx\n",
+ gpa);
+ goto out;
+ }
+
+ ret = snp_rmptable_psmash(kvm, pfn);
+ if (ret)
+ pr_err_ratelimited("Unable to split RMP entries for GPA 0x%llx PFN 0x%llx ret %d\n",
+ gpa, pfn, ret);
+
+out:
+ kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+ put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9eb750c8b04c..f9ab4bf6d245 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1976,15 +1976,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
static int npf_interception(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ int rc;
u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;
trace_kvm_page_fault(vcpu, fault_address, error_code);
- return kvm_mmu_page_fault(vcpu, fault_address, error_code,
- static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
- svm->vmcb->control.insn_bytes : NULL,
- svm->vmcb->control.insn_len);
+ rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+ static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+ svm->vmcb->control.insn_bytes : NULL,
+ svm->vmcb->control.insn_len);
+
+ /*
+ * rc == 0 indicates a userspace exit is needed to handle page
+ * transitions, so do that first before updating the RMP table.
+ */
+ if (error_code & PFERR_GUEST_RMP_MASK) {
+ if (rc == 0)
+ return rc;
+ handle_rmp_page_fault(vcpu, fault_address, error_code);
+ }
+
+ return rc;
}
static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0c655a4d32d5..13b00233b315 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -714,6 +714,7 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
void sev_es_unmap_ghcb(struct vcpu_svm *svm);
struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
/* vmenter.S */
--
2.25.1
From: Vishal Annapurve <[email protected]>
Introduce HVA range operator so that other KVM subsystems
can operate on HVA range.
Signed-off-by: Vishal Annapurve <[email protected]>
[mdr: minor checkpatch alignment fixups]
Signed-off-by: Michael Roth <[email protected]>
---
include/linux/kvm_host.h | 6 +++++
virt/kvm/kvm_main.c | 48 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4d542060cd93..c615650ed256 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1402,6 +1402,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm);
void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
void kvm_mmu_invalidate_end(struct kvm *kvm);
+typedef int (*kvm_hva_range_op_t)(struct kvm *kvm,
+ struct kvm_gfn_range *range, void *data);
+
+int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
+ unsigned long hva_end, kvm_hva_range_op_t handler, void *data);
+
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
long kvm_arch_vcpu_ioctl(struct file *filp,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f7e00593cc5d..4ccd655dd5af 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -642,6 +642,54 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
return (int)ret;
}
+int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
+ unsigned long hva_end, kvm_hva_range_op_t handler, void *data)
+{
+ int ret = 0;
+ struct kvm_gfn_range gfn_range;
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ int i, idx;
+
+ if (WARN_ON_ONCE(hva_end <= hva_start))
+ return -EINVAL;
+
+ idx = srcu_read_lock(&kvm->srcu);
+
+ for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+ struct interval_tree_node *node;
+
+ slots = __kvm_memslots(kvm, i);
+ kvm_for_each_memslot_in_hva_range(node, slots,
+ hva_start, hva_end - 1) {
+ unsigned long start, end;
+
+ slot = container_of(node, struct kvm_memory_slot,
+ hva_node[slots->node_idx]);
+ start = max(hva_start, slot->userspace_addr);
+ end = min(hva_end, slot->userspace_addr +
+ (slot->npages << PAGE_SHIFT));
+
+ /*
+ * {gfn(page) | page intersects with [hva_start, hva_end)} =
+ * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+ */
+ gfn_range.start = hva_to_gfn_memslot(start, slot);
+ gfn_range.end = hva_to_gfn_memslot(end + PAGE_SIZE - 1, slot);
+ gfn_range.slot = slot;
+
+ ret = handler(kvm, &gfn_range, data);
+ if (ret)
+ goto e_ret;
+ }
+ }
+
+e_ret:
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ return ret;
+}
+
static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
--
2.25.1
From: Brijesh Singh <[email protected]>
Add support to decrypt guest encrypted memory. These API interfaces can
be used for example to dump VMCBs on SNP guest exit.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
[mdr: minor commit fixups]
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 32 ++++++++++++++++++++++++++++++++
include/linux/psp-sev.h | 22 ++++++++++++++++++++--
2 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index e65563bc8298..bf5167b2acfc 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -2017,6 +2017,38 @@ int sev_guest_df_flush(int *error)
}
EXPORT_SYMBOL_GPL(sev_guest_df_flush);
+int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
+{
+ struct sev_data_snp_dbg data = {0};
+ struct sev_device *sev;
+ int ret;
+
+ if (!psp_master || !psp_master->sev_data)
+ return -ENODEV;
+
+ sev = psp_master->sev_data;
+
+ if (!sev->snp_initialized)
+ return -EINVAL;
+
+ data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
+ data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
+ data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
+
+ /* The destination page must be in the firmware state. */
+ if (rmp_mark_pages_firmware(data.dst_addr, 1, false))
+ return -EIO;
+
+ ret = sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, &data, error);
+
+ /* Restore the page state */
+ if (snp_reclaim_pages(data.dst_addr, 1, false))
+ ret = -EIO;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt_page);
+
int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
{
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 81bafc049eca..92116e2b74fd 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -710,7 +710,6 @@ struct sev_data_snp_dbg {
u64 gctx_paddr; /* In */
u64 src_addr; /* In */
u64 dst_addr; /* In */
- u32 len; /* In */
} __packed;
/**
@@ -913,13 +912,27 @@ int sev_guest_decommission(struct sev_data_decommission *data, int *error);
* @error: SEV command return code
*
* Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV if the sev device is not available
+ * -%ENOTSUPP if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO if the sev returned a non-zero return code
+ */
+int sev_do_cmd(int cmd, void *data, int *psp_ret);
+
+/**
+ * snp_guest_dbg_decrypt_page - perform SEV SNP_DBG_DECRYPT command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
* 0 if the SEV successfully processed the command
* -%ENODEV if the SEV device is not available
* -%ENOTSUPP if the SEV does not support SEV
* -%ETIMEDOUT if the SEV command timed out
* -%EIO if the SEV returned a non-zero return code
*/
-int sev_do_cmd(int cmd, void *data, int *psp_ret);
+int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error);
void *psp_copy_user_blob(u64 uaddr, u32 len);
void *snp_alloc_firmware_page(gfp_t mask);
@@ -987,6 +1000,11 @@ static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_P
void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
+static inline int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
+{
+ return -ENODEV;
+}
+
static inline void *snp_alloc_firmware_page(gfp_t mask)
{
return NULL;
--
2.25.1
This will handle RMP table updates and direct map changes needed for
page state conversions requested by userspace.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 126 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 2 +
3 files changed, 129 insertions(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b2f1a12685ed..73d614c538da 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3381,6 +3381,31 @@ static int snp_rmptable_psmash(struct kvm *kvm, kvm_pfn_t pfn)
return psmash(pfn);
}
+static int snp_make_page_shared(struct kvm *kvm, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+ int rc, rmp_level;
+
+ rc = snp_lookup_rmpentry(pfn, &rmp_level);
+ if (rc < 0)
+ return -EINVAL;
+
+ /* If page is not assigned then do nothing */
+ if (!rc)
+ return 0;
+
+ /*
+ * Is the page part of an existing 2MB RMP entry ? Split the 2MB into
+ * multiple of 4K-page before making the memory shared.
+ */
+ if (level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M) {
+ rc = snp_rmptable_psmash(kvm, pfn);
+ if (rc)
+ return rc;
+ }
+
+ return rmp_make_shared(pfn, level);
+}
+
/*
* TODO: need to get the value set by userspace in vcpu->run->vmgexit.ghcb_msr
* and process that here accordingly.
@@ -4373,3 +4398,104 @@ void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
put_page(pfn_to_page(pfn));
}
+
+static inline u8 order_to_level(int order)
+{
+ BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+ return PG_LEVEL_1G;
+
+ if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+ return PG_LEVEL_2M;
+
+ return PG_LEVEL_4K;
+}
+
+int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(slot->kvm)->sev_info;
+ enum psc_op op = (attr & KVM_MEMORY_ATTRIBUTE_PRIVATE) ? SNP_PAGE_STATE_PRIVATE
+ : SNP_PAGE_STATE_SHARED;
+ gfn_t gfn = start;
+
+ pr_debug("%s: GFN 0x%llx - 0x%llx, op: %d\n", __func__, start, end, op);
+
+ if (!sev_snp_guest(slot->kvm))
+ return 0;
+
+ if (!kvm_slot_can_be_private(slot)) {
+ pr_err_ratelimited("%s: memslot for gfn: 0x%llx is not private.\n",
+ __func__, gfn);
+ return -EPERM;
+ }
+
+ while (gfn < end) {
+ kvm_pfn_t pfn;
+ int level = PG_LEVEL_4K; /* TODO: take actual order into account */
+ gpa_t gpa = gfn_to_gpa(gfn);
+ int npages = 1;
+ int order;
+ int rc;
+
+ /*
+ * No work to do if there was never a page allocated from private
+ * memory. If there was a page that was deallocated previously,
+ * the invalidation notifier should have restored the page to
+ * shared.
+ */
+ rc = kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order);
+ if (rc) {
+ pr_warn_ratelimited("%s: failed to retrieve gfn 0x%llx from private FD\n",
+ __func__, gfn);
+ gfn++;
+ continue;
+ }
+
+ /*
+ * TODO: The RMP entry's hugepage bit is ignored for
+ * shared/unassigned pages. Either handle looping through each
+ * sub-page as part of snp_make_page_shared(), or remove the
+ * level argument.
+ */
+ if (op == SNP_PAGE_STATE_PRIVATE && order &&
+ IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
+ level = order_to_level(order);
+ npages = 1 << order;
+ }
+
+ /*
+ * Grab the PFN from private memslot and update the RMP entry.
+ * It may be worthwhile to go ahead and map it into the TDP at
+ * this point if the guest is doing lazy acceptance, but for
+ * up-front bulk shared->private conversions it's not likely
+ * the guest will try to access the PFN any time soon, so for
+ * now just take the let KVM MMU handle faulting it on the next
+ * access.
+ */
+ switch (op) {
+ case SNP_PAGE_STATE_SHARED:
+ rc = snp_make_page_shared(slot->kvm, gpa, pfn, level);
+ break;
+ case SNP_PAGE_STATE_PRIVATE:
+ rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
+ break;
+ default:
+ rc = PSC_INVALID_ENTRY;
+ break;
+ }
+
+ put_page(pfn_to_page(pfn));
+
+ if (rc) {
+ pr_err_ratelimited("%s: failed op %d gpa %llx pfn %llx level %d rc %d\n",
+ __func__, op, gpa, pfn, level, rc);
+ return -EINVAL;
+ }
+
+ gfn += npages;
+ }
+
+ return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 539926b07ee5..e2edc4700e55 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4860,6 +4860,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.alloc_apic_backing_page = svm_alloc_apic_backing_page,
.adjust_mapping_level = sev_adjust_mapping_level,
+ .update_mem_attr = sev_update_mem_attr,
};
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 37bd7b728d52..50a2bcaf3fd7 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -725,6 +725,8 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
+int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
+ gfn_t start, gfn_t end);
/* vmenter.S */
--
2.25.1
From: Dionna Glaze <[email protected]>
The /dev/sev device has the ability to store host-wide certificates for
the key used by the AMD-SP for SEV-SNP attestation report signing,
but for hosts that want to specify additional certificates that are
specific to the image launched in a VM, a different way is needed to
communicate those certificates.
Add two new KVM ioctl to handle this: KVM_SEV_SNP_{GET,SET}_CERTS
The certificates that are set with this command are expected to follow
the same format as the host certificates, but that format is opaque
to the kernel.
The new behavior for custom certificates is that the extended guest
request command will now return the overridden certificates if they
were installed for the instance. The error condition for a too small
data buffer is changed to return the overridden certificate data size
if there is an overridden certificate set installed.
Setting a 0 length certificate returns the system state to only return
the host certificates on an extended guest request.
Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space
for an extra certificate.
Cc: Tom Lendacky <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Dionna Glaze <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
[mdr: remove used of "we" and "this patch" in commit log]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 111 ++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/svm/svm.h | 1 +
include/linux/psp-sev.h | 2 +-
include/uapi/linux/kvm.h | 12 +++++
4 files changed, 123 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 70d5650d8d95..18b64b7005e7 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2089,6 +2089,7 @@ static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
goto e_free;
sev->snp_certs_data = certs_data;
+ sev->snp_certs_len = 0;
return context;
@@ -2404,6 +2405,86 @@ static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
}
+static int snp_get_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct kvm_sev_snp_get_certs params;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data,
+ sizeof(params)))
+ return -EFAULT;
+
+ /* No instance certs set. */
+ if (!sev->snp_certs_len)
+ return -ENOENT;
+
+ if (params.certs_len < sev->snp_certs_len) {
+ /* Output buffer too small. Return the required size. */
+ params.certs_len = sev->snp_certs_len;
+
+ if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms,
+ sizeof(params)))
+ return -EFAULT;
+
+ return -EINVAL;
+ }
+
+ if (copy_to_user((void __user *)(uintptr_t)params.certs_uaddr,
+ sev->snp_certs_data, sev->snp_certs_len))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int snp_set_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ unsigned long length = SEV_FW_BLOB_MAX_SIZE;
+ void *to_certs = sev->snp_certs_data;
+ struct kvm_sev_snp_set_certs params;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data,
+ sizeof(params)))
+ return -EFAULT;
+
+ if (params.certs_len > SEV_FW_BLOB_MAX_SIZE)
+ return -EINVAL;
+
+ /*
+ * Setting a length of 0 is the same as "uninstalling" instance-
+ * specific certificates.
+ */
+ if (params.certs_len == 0) {
+ sev->snp_certs_len = 0;
+ return 0;
+ }
+
+ /* Page-align the length */
+ length = (params.certs_len + PAGE_SIZE - 1) & PAGE_MASK;
+
+ if (copy_from_user(to_certs,
+ (void __user *)(uintptr_t)params.certs_uaddr,
+ params.certs_len)) {
+ return -EFAULT;
+ }
+
+ sev->snp_certs_len = length;
+
+ return 0;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2503,6 +2584,12 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_FINISH:
r = snp_launch_finish(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_GET_CERTS:
+ r = snp_get_instance_certs(kvm, &sev_cmd);
+ break;
+ case KVM_SEV_SNP_SET_CERTS:
+ r = snp_set_instance_certs(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -3550,8 +3637,28 @@ static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gp
if (rc)
goto unlock;
- rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
- &data_npages, &err);
+ /*
+ * If the VMM has overridden the certs, then change the error message
+ * if the size is inappropriate for the override. Otherwise, use a
+ * regular guest request and copy back the instance certs.
+ */
+ if (sev->snp_certs_len) {
+ if ((data_npages << PAGE_SHIFT) < sev->snp_certs_len) {
+ rc = -EINVAL;
+ err = SNP_GUEST_REQ_INVALID_LEN;
+ goto datalen;
+ }
+ rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &req,
+ (int *)&err);
+ } else {
+ rc = snp_guest_ext_guest_request(&req,
+ (unsigned long)sev->snp_certs_data,
+ &data_npages, &err);
+ }
+datalen:
+ if (sev->snp_certs_len)
+ data_npages = sev->snp_certs_len >> PAGE_SHIFT;
+
if (rc) {
/*
* If buffer length is small then return the expected
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 221b38d3c845..dced46559508 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -94,6 +94,7 @@ struct kvm_sev_info {
u64 snp_init_flags;
void *snp_context; /* SNP guest context page */
void *snp_certs_data;
+ unsigned int snp_certs_len; /* Size of instance override for certs */
struct mutex guest_req_lock; /* Lock for guest request handling */
u64 sev_features; /* Features set at VMSA creation */
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 92116e2b74fd..3b28b78938f6 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -22,7 +22,7 @@
#define __psp_pa(x) __pa(x)
#endif
-#define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */
+#define SEV_FW_BLOB_MAX_SIZE 0x5000 /* 20KB */
/**
* SEV platform state
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6e684bf5f723..ad7e24e43547 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1928,6 +1928,8 @@ enum sev_cmd_id {
KVM_SEV_SNP_LAUNCH_START,
KVM_SEV_SNP_LAUNCH_UPDATE,
KVM_SEV_SNP_LAUNCH_FINISH,
+ KVM_SEV_SNP_GET_CERTS,
+ KVM_SEV_SNP_SET_CERTS,
KVM_SEV_NR_MAX,
};
@@ -2075,6 +2077,16 @@ struct kvm_sev_snp_launch_finish {
__u8 pad[6];
};
+struct kvm_sev_snp_get_certs {
+ __u64 certs_uaddr;
+ __u64 certs_len;
+};
+
+struct kvm_sev_snp_set_certs {
+ __u64 certs_uaddr;
+ __u64 certs_len;
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1
From: Ashish Kalra <[email protected]>
Implement a workaround for an SNP erratum where the CPU will incorrectly
signal an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
RMP entry of the VMSAVE target page.
When SEV-SNP is globally enabled, the CPU marks the VMSAVE target page
as "InUse" while the VMSAVE instruction is executing. If another
CPU writes to a different page in the same 2MB region while the VMSAVE
is executing, the CPU will throw an RMP violation #PF.
Use the snp safe generic allocator for allocating the VMSA target
page which will ensure that the page returned is not a hugepage, as it
is already being used for the allocating the VMCB, VMSA and AVIC backing
page.
Co-developed-by: Marc Orr <[email protected]>
Signed-off-by: Marc Orr <[email protected]>
Reported-by: Alper Gun <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/svm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3fe5f13b5f3a..8bda31a61757 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -665,7 +665,7 @@ static int svm_cpu_init(int cpu)
int ret = -ENOMEM;
memset(sd, 0, sizeof(struct svm_cpu_data));
- sd->save_area = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ sd->save_area = snp_safe_alloc_page(NULL);
if (!sd->save_area)
return ret;
--
2.25.1
From: Dionna Glaze <[email protected]>
Update the KVM_MEMORY_ENCRYPT_OP documentation to include the new
commands for overriding the host certificates that the guest receives
from an extended guest request.
Cc: Thomas Lendacky <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Dionna Glaze <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 44 +++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index dafb0c9984f1..153003ff2c51 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -537,6 +537,50 @@ Returns: 0 on success, -negative on error
See SEV-SNP specification for further details on launch finish input parameters.
+22. KVM_SEV_SNP_GET_CERTS
+-------------------------
+
+After the SNP guest launch flow has started, the KVM_SEV_SNP_GET_CERTS command
+can be issued to request the data that has been installed with the
+KVM_SEV_SNP_SET_CERTS command.
+
+Parameters (in/out): struct kvm_sev_snp_get_certs
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_get_certs {
+ __u64 certs_uaddr;
+ __u64 certs_len
+ };
+
+If no certs have been installed, then the return value is -ENOENT.
+If the buffer specified in the struct is too small, the certs_len field will be
+overwritten with the required bytes to receive all the certificate bytes and the
+return value will be -EINVAL.
+
+23. KVM_SEV_SNP_SET_CERTS
+-------------------------
+
+After the SNP guest launch flow has started, the KVM_SEV_SNP_SET_CERTS command
+can be issued to override the /dev/sev certs data that is returned when a
+guest issues an extended guest request. This is useful for instance-specific
+extensions to the host certificates.
+
+Parameters (in/out): struct kvm_sev_snp_set_certs
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_set_certs {
+ __u64 certs_uaddr;
+ __u64 certs_len
+ };
+
+The certs_len field may not exceed SEV_FW_BLOB_MAX_SIZE.
+
References
==========
--
2.25.1
From: Brijesh Singh <[email protected]>
Version 2 of GHCB specification added the support for two SNP Guest
Request Message NAE events. The events allows for an SEV-SNP guest to
make request to the SEV-SNP firmware through hypervisor using the
SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
difference of an additional certificate blob that can be passed through
the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
provides snp_guest_ext_guest_request() that is used by the KVM to get
both the report and certificate data at once.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 185 +++++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/svm/svm.h | 2 +
2 files changed, 181 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 197b1f904567..92179614102e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -327,6 +327,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
if (ret)
goto e_free;
+ mutex_init(&sev->guest_req_lock);
ret = sev_snp_init(&argp->error, false);
} else {
ret = sev_platform_init(&argp->error);
@@ -2059,23 +2060,34 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
*/
static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
struct sev_data_snp_addr data = {};
- void *context;
+ void *context, *certs_data;
int rc;
+ /* Allocate memory used for the certs data in SNP guest request */
+ certs_data = kzalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
+ if (!certs_data)
+ return NULL;
+
/* Allocate memory for context page */
context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
if (!context)
- return NULL;
+ goto e_free;
data.gctx_paddr = __psp_pa(context);
rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
- if (rc) {
- snp_free_firmware_page(context);
- return NULL;
- }
+ if (rc)
+ goto e_free;
+
+ sev->snp_certs_data = certs_data;
return context;
+
+e_free:
+ snp_free_firmware_page(context);
+ kfree(certs_data);
+ return NULL;
}
static int snp_bind_asid(struct kvm *kvm, int *error)
@@ -2693,6 +2705,8 @@ static int snp_decommission_context(struct kvm *kvm)
snp_free_firmware_page(sev->snp_context);
sev->snp_context = NULL;
+ kfree(sev->snp_certs_data);
+
return 0;
}
@@ -3153,6 +3167,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
case SVM_VMGEXIT_HV_FEATURES:
case SVM_VMGEXIT_PSC:
+ case SVM_VMGEXIT_GUEST_REQUEST:
+ case SVM_VMGEXIT_EXT_GUEST_REQUEST:
break;
default:
reason = GHCB_ERR_INVALID_EVENT;
@@ -3384,6 +3400,149 @@ static int snp_complete_psc(struct kvm_vcpu *vcpu)
return 1;
}
+static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm,
+ struct sev_data_snp_guest_request *data,
+ gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ kvm_pfn_t req_pfn, resp_pfn;
+ struct kvm_sev_info *sev;
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE))
+ return SEV_RET_INVALID_PARAM;
+
+ req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
+ if (is_error_noslot_pfn(req_pfn))
+ return SEV_RET_INVALID_ADDRESS;
+
+ resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
+ if (is_error_noslot_pfn(resp_pfn))
+ return SEV_RET_INVALID_ADDRESS;
+
+ if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
+ return SEV_RET_INVALID_ADDRESS;
+
+ data->gctx_paddr = __psp_pa(sev->snp_context);
+ data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
+ data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
+
+ return 0;
+}
+
+static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data, unsigned long *rc)
+{
+ u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
+ int ret;
+
+ ret = snp_page_reclaim(pfn);
+ if (ret)
+ *rc = SEV_RET_INVALID_ADDRESS;
+
+ ret = rmp_make_shared(pfn, PG_LEVEL_4K);
+ if (ret)
+ *rc = SEV_RET_INVALID_ADDRESS;
+}
+
+static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct sev_data_snp_guest_request data = {0};
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_sev_info *sev;
+ unsigned long rc;
+ int err;
+
+ if (!sev_snp_guest(vcpu->kvm)) {
+ rc = SEV_RET_INVALID_GUEST;
+ goto e_fail;
+ }
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ mutex_lock(&sev->guest_req_lock);
+
+ rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
+ if (rc)
+ goto unlock;
+
+ rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
+ if (rc)
+ /* use the firmware error code */
+ rc = err;
+
+ snp_cleanup_guest_buf(&data, &rc);
+
+unlock:
+ mutex_unlock(&sev->guest_req_lock);
+
+e_fail:
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
+}
+
+static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct sev_data_snp_guest_request req = {0};
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ unsigned long data_npages;
+ struct kvm_sev_info *sev;
+ unsigned long rc, err;
+ u64 data_gpa;
+
+ if (!sev_snp_guest(vcpu->kvm)) {
+ rc = SEV_RET_INVALID_GUEST;
+ goto e_fail;
+ }
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
+ data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
+
+ if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
+ rc = SEV_RET_INVALID_ADDRESS;
+ goto e_fail;
+ }
+
+ mutex_lock(&sev->guest_req_lock);
+
+ rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
+ if (rc)
+ goto unlock;
+
+ rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
+ &data_npages, &err);
+ if (rc) {
+ /*
+ * If buffer length is small then return the expected
+ * length in rbx.
+ */
+ if (err == SNP_GUEST_REQ_INVALID_LEN)
+ vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
+
+ /* pass the firmware error code */
+ rc = err;
+ goto cleanup;
+ }
+
+ /* Copy the certificate blob in the guest memory */
+ if (data_npages &&
+ kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
+ rc = SEV_RET_INVALID_ADDRESS;
+
+cleanup:
+ snp_cleanup_guest_buf(&req, &rc);
+
+unlock:
+ mutex_unlock(&sev->guest_req_lock);
+
+e_fail:
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3633,6 +3792,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
vcpu->run->vmgexit.ghcb_msr = ghcb_gpa;
vcpu->arch.complete_userspace_io = snp_complete_psc;
break;
+ case SVM_VMGEXIT_GUEST_REQUEST: {
+ snp_handle_guest_request(svm, control->exit_info_1, control->exit_info_2);
+
+ ret = 1;
+ break;
+ }
+ case SVM_VMGEXIT_EXT_GUEST_REQUEST: {
+ snp_handle_ext_guest_request(svm,
+ control->exit_info_1,
+ control->exit_info_2);
+
+ ret = 1;
+ break;
+ }
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 13b00233b315..4a9ffb7e5139 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -93,6 +93,8 @@ struct kvm_sev_info {
atomic_t migration_in_progress;
u64 snp_init_flags;
void *snp_context; /* SNP guest context page */
+ void *snp_certs_data;
+ struct mutex guest_req_lock; /* Lock for guest request handling */
};
struct kvm_svm {
--
2.25.1
From: Brijesh Singh <[email protected]>
Add a module parameter than can be used to enable or disable the SEV-SNP
feature. Now that KVM contains the support for the SNP set the GHCB
hypervisor feature flag to indicate that SNP is supported.
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 7 ++++---
arch/x86/kvm/svm/svm.h | 2 +-
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index bedec90d034f..70d5650d8d95 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -55,14 +55,15 @@ module_param_named(sev, sev_enabled, bool, 0444);
/* enable/disable SEV-ES support */
static bool sev_es_enabled = true;
module_param_named(sev_es, sev_es_enabled, bool, 0444);
+
+/* enable/disable SEV-SNP support */
+static bool sev_snp_enabled = true;
+module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
#else
#define sev_enabled false
#define sev_es_enabled false
#endif /* CONFIG_KVM_AMD_SEV */
-/* enable/disable SEV-SNP support */
-static bool sev_snp_enabled;
-
#define AP_RESET_HOLD_NONE 0
#define AP_RESET_HOLD_NAE_EVENT 1
#define AP_RESET_HOLD_MSR_PROTO 2
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 857b674e68f0..221b38d3c845 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -694,7 +694,7 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu);
#define GHCB_VERSION_MAX 2ULL
#define GHCB_VERSION_MIN 1ULL
-#define GHCB_HV_FT_SUPPORTED 0
+#define GHCB_HV_FT_SUPPORTED (GHCB_HV_FT_SNP | GHCB_HV_FT_SNP_AP_CREATION)
extern unsigned int max_sev_asid;
--
2.25.1
From: Ashish Kalra <[email protected]>
Add a new IOMMU API interface amd_iommu_snp_disable() to transition
IOMMU pages to Hypervisor state from Reclaim state after SNP_SHUTDOWN_EX
command. Invoke this API from the CCP driver after SNP_SHUTDOWN_EX
command.
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 20 ++++++++++++++
drivers/iommu/amd/init.c | 53 ++++++++++++++++++++++++++++++++++++
include/linux/amd-iommu.h | 1 +
3 files changed, 74 insertions(+)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index bf5167b2acfc..7ded2f9111e0 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -24,6 +24,7 @@
#include <linux/cpufeature.h>
#include <linux/fs.h>
#include <linux/fs_struct.h>
+#include <linux/amd-iommu.h>
#include <asm/smp.h>
#include <asm/e820/types.h>
@@ -1503,6 +1504,25 @@ static int __sev_snp_shutdown_locked(int *error)
return ret;
}
+ /*
+ * SNP_SHUTDOWN_EX with IOMMU_SNP_SHUTDOWN set to 1 disables SNP
+ * enforcement by the IOMMU and also transitions all pages
+ * associated with the IOMMU to the Reclaim state.
+ * Firmware was transitioning the IOMMU pages to Hypervisor state
+ * before version 1.53. But, accounting for the number of assigned
+ * 4kB pages in a 2M page was done incorrectly by not transitioning
+ * to the Reclaim state. This resulted in RMP #PF when later accessing
+ * the 2M page containing those pages during kexec boot. Hence, the
+ * firmware now transitions these pages to Reclaim state and hypervisor
+ * needs to transition these pages to shared state. SNP Firmware
+ * version 1.53 and above are needed for kexec boot.
+ */
+ ret = amd_iommu_snp_disable();
+ if (ret) {
+ dev_err(sev->dev, "SNP IOMMU shutdown failed\n");
+ return ret;
+ }
+
sev->snp_initialized = false;
dev_dbg(sev->dev, "SEV-SNP firmware shutdown\n");
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 1a2d425bf568..d1270e3c5baf 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -30,6 +30,7 @@
#include <asm/io_apic.h>
#include <asm/irq_remapping.h>
#include <asm/set_memory.h>
+#include <asm/sev.h>
#include <linux/crash_dump.h>
@@ -3651,4 +3652,56 @@ int amd_iommu_snp_enable(void)
return 0;
}
+
+static int iommu_page_make_shared(void *page)
+{
+ unsigned long pfn;
+
+ pfn = iommu_virt_to_phys(page) >> PAGE_SHIFT;
+ return rmp_make_shared(pfn, PG_LEVEL_4K);
+}
+
+static int iommu_make_shared(void *va, size_t size)
+{
+ void *page;
+ int ret;
+
+ if (!va)
+ return 0;
+
+ for (page = va; page < (va + size); page += PAGE_SIZE) {
+ ret = iommu_page_make_shared(page);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+int amd_iommu_snp_disable(void)
+{
+ struct amd_iommu *iommu;
+ int ret;
+
+ if (!amd_iommu_snp_en)
+ return 0;
+
+ for_each_iommu(iommu) {
+ ret = iommu_make_shared(iommu->evt_buf, EVT_BUFFER_SIZE);
+ if (ret)
+ return ret;
+
+ ret = iommu_make_shared(iommu->ppr_log, PPR_LOG_SIZE);
+ if (ret)
+ return ret;
+
+ ret = iommu_make_shared((void *)iommu->cmd_sem, PAGE_SIZE);
+ if (ret)
+ return ret;
+ }
+
+ amd_iommu_snp_en = false;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(amd_iommu_snp_disable);
#endif
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 953e6f12fa1c..a1b33b838842 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -208,6 +208,7 @@ struct amd_iommu *get_amd_iommu(unsigned int idx);
#ifdef CONFIG_AMD_MEM_ENCRYPT
int amd_iommu_snp_enable(void);
+int amd_iommu_snp_disable(void);
#endif
#endif /* _ASM_X86_AMD_IOMMU_H */
--
2.25.1
From: Vishal Annapurve <[email protected]>
This change adds handling of HVA ranges to copy contents
to private memory while doing sev launch update data.
mem_attr array is updated during LAUNCH_UPDATE_DATA to ensure
that encrypted memory is marked as private.
Signed-off-by: Vishal Annapurve <[email protected]>
[mdr: Use gfn_to_hva_memslot_prot() for shared GFN handler to deal with
read-only slots for ROMs. Split kvm_vm_set_region_attr into separate
patch.]
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 103 +++++++++++++++++++++++++++++++++++++----
virt/kvm/kvm_main.c | 2 +
2 files changed, 96 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 273cba809328..fad7fb34ef9e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -494,23 +494,26 @@ static unsigned long get_num_contig_pages(unsigned long idx,
return pages;
}
-static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
+static int sev_launch_update_shared_gfn_handler(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ struct kvm_sev_cmd *argp)
{
unsigned long vaddr, vaddr_end, next_vaddr, npages, pages, size, i;
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
- struct kvm_sev_launch_update_data params;
struct sev_data_launch_update_data data;
struct page **inpages;
int ret;
- if (!sev_guest(kvm))
- return -ENOTTY;
-
- if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
- return -EFAULT;
+ vaddr = gfn_to_hva_memslot_prot(range->slot, range->start, NULL);
+ pr_debug("%s: shared GFN: %llx, slot.id: %d, slot.base_gfn: %llx, slot.userspace_addr: %lx, slot.flags: %x, vaddr: %lx\n",
+ __func__, range->start, range->slot->id, range->slot->base_gfn,
+ range->slot->userspace_addr, range->slot->flags, vaddr);
+ if (kvm_is_error_hva(vaddr)) {
+ pr_err("vaddr is erroneous 0x%lx\n", vaddr);
+ return -EINVAL;
+ }
- vaddr = params.uaddr;
- size = params.len;
+ size = (range->end - range->start) << PAGE_SHIFT;
vaddr_end = vaddr + size;
/* Lock the user memory. */
@@ -562,6 +565,88 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
}
+static int sev_launch_update_priv_gfn_handler(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ struct kvm_sev_cmd *argp)
+{
+ struct sev_data_launch_update_data data;
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ gfn_t gfn;
+ kvm_pfn_t pfn;
+ struct kvm_memory_slot *memslot = range->slot;
+ int ret = 0;
+
+ data.reserved = 0;
+ data.handle = sev->handle;
+
+ for (gfn = range->start; gfn < range->end; gfn++) {
+ int order;
+ void *kvaddr;
+
+ ret = kvm_restrictedmem_get_pfn(memslot, gfn, &pfn, &order);
+ if (ret)
+ goto e_ret;
+
+ kvaddr = pfn_to_kaddr(pfn);
+ if (!virt_addr_valid(kvaddr)) {
+ pr_debug("%s: Invalid kvaddr 0x%llx\n", __func__, (uint64_t)kvaddr);
+ ret = -EINVAL;
+ goto e_ret;
+ }
+
+ ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
+ if (ret) {
+ pr_debug("%s: Guest read failed 0x%x\n", __func__, ret);
+ goto e_ret;
+ }
+
+ if (!cpu_feature_enabled(X86_FEATURE_SME_COHERENT))
+ clflush_cache_range(kvaddr, PAGE_SIZE);
+
+ data.len = PAGE_SIZE;
+ data.address = __sme_set(pfn << PAGE_SHIFT);
+ ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA, &data, &argp->error);
+ if (ret)
+ goto e_ret;
+ kvm_release_pfn_clean(pfn);
+ }
+
+ /*
+ * Memory attribute updates via KVM_SET_MEMORY_ATTRIBUTES are serialized
+ * via kvm->slots_lock, so use the same protocol for updating them here.
+ */
+ mutex_lock(&kvm->slots_lock);
+ kvm_vm_set_region_attr(kvm, range->start, range->end, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+ mutex_unlock(&kvm->slots_lock);
+e_ret:
+ return ret;
+}
+
+static int sev_launch_update_gfn_handler(struct kvm *kvm, struct kvm_gfn_range *range,
+ void *data)
+{
+ struct kvm_sev_cmd *argp = (struct kvm_sev_cmd *)data;
+
+ if (kvm_slot_can_be_private(range->slot))
+ return sev_launch_update_priv_gfn_handler(kvm, range, argp);
+
+ return sev_launch_update_shared_gfn_handler(kvm, range, argp);
+}
+
+static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_launch_update_data params;
+
+ if (!sev_guest(kvm))
+ return -ENOTTY;
+
+ if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ return kvm_vm_do_hva_range_op(kvm, params.uaddr, params.uaddr + params.len,
+ sev_launch_update_gfn_handler, argp);
+}
+
static int sev_es_sync_vmsa(struct vcpu_svm *svm)
{
struct sev_es_save_area *save = svm->sev_es.vmsa;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c740b56d6ba4..003cb199ba4b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -689,6 +689,7 @@ int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
return ret;
}
+EXPORT_SYMBOL_GPL(kvm_vm_do_hva_range_op);
static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
@@ -2850,6 +2851,7 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot,
return hva;
}
+EXPORT_SYMBOL_GPL(gfn_to_hva_memslot_prot);
unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable)
{
--
2.25.1
AMD_MEM_ENCRYPT implies SEV support, which now relies on support
provided by the KVM_PROTECTED_VM config option.
An argument can be made that SEV running in non-protected-VM-mode is
still possible, and so this should be configurable, but AMD_MEM_ENCRYPT
will also imply SEV-SNP, for which KVM_PROTECTED_VM is required in all
cases.
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..f0d8f6bbc1a7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1546,6 +1546,7 @@ config AMD_MEM_ENCRYPT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select KVM_PROTECTED_VM
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
--
2.25.1
From: Nikunj A Dadhania <[email protected]>
Rename sev_{pin|unpin}_memory to sev_memory_{get|put}_pages. Apart
from pinning the pages, sev_pin_memory also populates the pages array
which is used by its callers. SEV guest using restricted memfd do not
to pin the memory but will require the pages array to be populated.
Rename the function appropriately.
No functional change intended.
Signed-off-by: Nikunj A Dadhania <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/sev.c | 62 ++++++++++++++++++++++--------------------
1 file changed, 33 insertions(+), 29 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index fad7fb34ef9e..523c78bbff3f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -383,9 +383,13 @@ static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
}
-static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
- unsigned long ulen, unsigned long *n,
- int write)
+/*
+ * Legacy SEV guest pin the pages and return the array populated with pinned
+ * pages.
+ */
+static struct page **sev_memory_get_pages(struct kvm *kvm, unsigned long uaddr,
+ unsigned long ulen, unsigned long *n,
+ int write)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
unsigned long npages, size;
@@ -446,8 +450,8 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
return ERR_PTR(ret);
}
-static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
- unsigned long npages)
+static void sev_memory_put_pages(struct kvm *kvm, struct page **pages,
+ unsigned long npages)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -517,7 +521,7 @@ static int sev_launch_update_shared_gfn_handler(struct kvm *kvm,
vaddr_end = vaddr + size;
/* Lock the user memory. */
- inpages = sev_pin_memory(kvm, vaddr, size, &npages, 1);
+ inpages = sev_memory_get_pages(kvm, vaddr, size, &npages, 1);
if (IS_ERR(inpages))
return PTR_ERR(inpages);
@@ -548,20 +552,20 @@ static int sev_launch_update_shared_gfn_handler(struct kvm *kvm,
data.address = __sme_page_pa(inpages[i]) + offset;
ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_DATA, &data, &argp->error);
if (ret)
- goto e_unpin;
+ goto e_put_pages;
size -= len;
next_vaddr = vaddr + len;
}
-e_unpin:
+e_put_pages:
/* content of memory is updated, mark pages dirty */
for (i = 0; i < npages; i++) {
set_page_dirty_lock(inpages[i]);
mark_page_accessed(inpages[i]);
}
/* unlock the user pages */
- sev_unpin_memory(kvm, inpages, npages);
+ sev_memory_put_pages(kvm, inpages, npages);
return ret;
}
@@ -1028,13 +1032,13 @@ static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
int len, s_off, d_off;
/* lock userspace source and destination page */
- src_p = sev_pin_memory(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0);
+ src_p = sev_memory_get_pages(kvm, vaddr & PAGE_MASK, PAGE_SIZE, &n, 0);
if (IS_ERR(src_p))
return PTR_ERR(src_p);
- dst_p = sev_pin_memory(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1);
+ dst_p = sev_memory_get_pages(kvm, dst_vaddr & PAGE_MASK, PAGE_SIZE, &n, 1);
if (IS_ERR(dst_p)) {
- sev_unpin_memory(kvm, src_p, n);
+ sev_memory_put_pages(kvm, src_p, n);
return PTR_ERR(dst_p);
}
@@ -1068,8 +1072,8 @@ static int sev_dbg_crypt(struct kvm *kvm, struct kvm_sev_cmd *argp, bool dec)
(void __user *)dst_vaddr,
len, &argp->error);
- sev_unpin_memory(kvm, src_p, n);
- sev_unpin_memory(kvm, dst_p, n);
+ sev_memory_put_pages(kvm, src_p, n);
+ sev_memory_put_pages(kvm, dst_p, n);
if (ret)
goto err;
@@ -1098,7 +1102,7 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
return -EFAULT;
- pages = sev_pin_memory(kvm, params.guest_uaddr, params.guest_len, &n, 1);
+ pages = sev_memory_get_pages(kvm, params.guest_uaddr, params.guest_len, &n, 1);
if (IS_ERR(pages))
return PTR_ERR(pages);
@@ -1114,7 +1118,7 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
*/
if (get_num_contig_pages(0, pages, n) != n) {
ret = -EINVAL;
- goto e_unpin_memory;
+ goto e_put_pages;
}
memset(&data, 0, sizeof(data));
@@ -1126,7 +1130,7 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
blob = psp_copy_user_blob(params.trans_uaddr, params.trans_len);
if (IS_ERR(blob)) {
ret = PTR_ERR(blob);
- goto e_unpin_memory;
+ goto e_put_pages;
}
data.trans_address = __psp_pa(blob);
@@ -1147,13 +1151,13 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
e_free_blob:
kfree(blob);
-e_unpin_memory:
+e_put_pages:
/* content of memory is updated, mark pages dirty */
for (i = 0; i < n; i++) {
set_page_dirty_lock(pages[i]);
mark_page_accessed(pages[i]);
}
- sev_unpin_memory(kvm, pages, n);
+ sev_memory_put_pages(kvm, pages, n);
return ret;
}
@@ -1383,8 +1387,8 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
return -EINVAL;
/* Pin guest memory */
- guest_page = sev_pin_memory(kvm, params.guest_uaddr & PAGE_MASK,
- PAGE_SIZE, &n, 0);
+ guest_page = sev_memory_get_pages(kvm, params.guest_uaddr & PAGE_MASK,
+ PAGE_SIZE, &n, 0);
if (IS_ERR(guest_page))
return PTR_ERR(guest_page);
@@ -1392,7 +1396,7 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
ret = -ENOMEM;
hdr = kzalloc(params.hdr_len, GFP_KERNEL_ACCOUNT);
if (!hdr)
- goto e_unpin;
+ goto e_put_pages;
trans_data = kzalloc(params.trans_len, GFP_KERNEL_ACCOUNT);
if (!trans_data)
@@ -1431,8 +1435,8 @@ static int sev_send_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
kfree(trans_data);
e_free_hdr:
kfree(hdr);
-e_unpin:
- sev_unpin_memory(kvm, guest_page, n);
+e_put_pages:
+ sev_memory_put_pages(kvm, guest_page, n);
return ret;
}
@@ -1579,8 +1583,8 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
data.trans_len = params.trans_len;
/* Pin guest memory */
- guest_page = sev_pin_memory(kvm, params.guest_uaddr & PAGE_MASK,
- PAGE_SIZE, &n, 1);
+ guest_page = sev_memory_get_pages(kvm, params.guest_uaddr & PAGE_MASK,
+ PAGE_SIZE, &n, 1);
if (IS_ERR(guest_page)) {
ret = PTR_ERR(guest_page);
goto e_free_trans;
@@ -1602,7 +1606,7 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
ret = sev_issue_cmd(kvm, SEV_CMD_RECEIVE_UPDATE_DATA, &data,
&argp->error);
- sev_unpin_memory(kvm, guest_page, n);
+ sev_memory_put_pages(kvm, guest_page, n);
e_free_trans:
kfree(trans);
@@ -2037,7 +2041,7 @@ int sev_mem_enc_register_region(struct kvm *kvm,
return -ENOMEM;
mutex_lock(&kvm->lock);
- region->pages = sev_pin_memory(kvm, range->addr, range->size, ®ion->npages, 1);
+ region->pages = sev_memory_get_pages(kvm, range->addr, range->size, ®ion->npages, 1);
if (IS_ERR(region->pages)) {
ret = PTR_ERR(region->pages);
mutex_unlock(&kvm->lock);
@@ -2084,7 +2088,7 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
static void __unregister_enc_region_locked(struct kvm *kvm,
struct enc_region *region)
{
- sev_unpin_memory(kvm, region->pages, region->npages);
+ sev_memory_put_pages(kvm, region->pages, region->npages);
list_del(®ion->list);
kfree(region);
}
--
2.25.1
This will be useful to other callers that need to update memory
attributes for things like setting up the initial private memory payload
for a guest.
Signed-off-by: Michael Roth <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 26 ++++++++++++++++++--------
2 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c615650ed256..57d56cd09a61 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -993,6 +993,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module);
void kvm_exit(void);
void kvm_get_kvm(struct kvm *kvm);
+int kvm_vm_set_region_attr(struct kvm *kvm, gfn_t start, gfn_t end, u64 attributes);
bool kvm_get_kvm_safe(struct kvm *kvm);
void kvm_put_kvm(struct kvm *kvm);
bool file_is_kvm(struct file *file);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4ccd655dd5af..c740b56d6ba4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2645,12 +2645,28 @@ static void kvm_post_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
}
}
+int kvm_vm_set_region_attr(struct kvm *kvm, gfn_t start, gfn_t end,
+ u64 attributes)
+{
+ gfn_t index;
+ void *entry;
+
+ entry = attributes ? xa_mk_value(attributes) : NULL;
+
+ for (index = start; index < end; index++)
+ if (xa_err(xa_store(&kvm->mem_attr_array, index, entry,
+ GFP_KERNEL_ACCOUNT)))
+ break;
+
+ return index;
+}
+EXPORT_SYMBOL_GPL(kvm_vm_set_region_attr);
+
static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
struct kvm_memory_attributes *attrs)
{
gfn_t start, end;
unsigned long i;
- void *entry;
/* flags is currently not used. */
if (attrs->flags)
@@ -2665,8 +2681,6 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
start = attrs->address >> PAGE_SHIFT;
end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
- entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
-
mutex_lock(&kvm->slots_lock);
KVM_MMU_LOCK(kvm);
@@ -2674,11 +2688,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
kvm_mmu_invalidate_range_add(kvm, start, end);
KVM_MMU_UNLOCK(kvm);
- for (i = start; i < end; i++)
- if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
- GFP_KERNEL_ACCOUNT)))
- break;
-
+ i = kvm_vm_set_region_attr(kvm, start, end, attrs->attributes);
KVM_MMU_LOCK(kvm);
if (i > start)
--
2.25.1
On Mon, 20 Feb 2023, Michael Roth wrote:
> From: Hugh Dickins <[email protected]>
No.
>
> When the address is backed by a memfd, the code to split the page does
> nothing more than remove the PMD from the page tables. So immediately
> install a PTE to ensure that any other pages in that 2MB region are
> brought back as in 4K pages.
>
> Signed-off-by: Hugh Dickins <[email protected]>
No. Suggested-by would be okay.
> Cc: Hugh Dickins <[email protected]>
Thanks. I'm really sorry to be such a jobsworth,
and have nothing more constructive to say than I did before in
https://lore.kernel.org/linux-mm/[email protected]/
(please re-read), but adding a Signed-off-by where none was given is wrong;
and I'm not ever going to comprehend enough of the context to give it.
Best of luck for the series,
Hugh
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> mm/memory.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e68da7e403c6..33c9020ba1f8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4999,6 +4999,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> static int handle_split_page_fault(struct vm_fault *vmf)
> {
> __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> + /*
> + * Install a PTE immediately to ensure that any other pages in
> + * this 2MB region are brought back in as 4K pages.
> + */
> + __pte_alloc(vmf->vma->vm_mm, vmf->pmd);
> return 0;
> }
>
> --
> 2.25.1
On Mon, 20 Feb 2023 12:38:02 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> The memory integrity guarantees of SEV-SNP are enforced through a new
> structure called the Reverse Map Table (RMP). The RMP is a single data
> structure shared across the system that contains one entry for every 4K
> page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
> track the owner of each page of memory. Pages of memory can be owned by
> the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
> section 15.36.3 for more detail on RMP.
>
> The RMP table is used to enforce access control to memory. The table
> itself is not directly writable by the software. New CPU instructions
> (RMPUPDATE, PVALIDATE, RMPADJUST) are used to manipulate the RMP
> entries.
>
> Based on the platform configuration, the BIOS reserves the memory used
> for the RMP table. The start and end address of the RMP table must be
> queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
> RMP_END are not set then disable the SEV-SNP feature.
>
> The SEV-SNP feature is enabled only after the RMP table is successfully
> initialized.
>
> Also set SYSCFG.MFMD when enabling SNP as SEV-SNP FW >= 1.51 requires
> that SYSCFG.MFMD must be se
>
^ unfinished sentence.
> RMP table entry format is non-architectural and it can vary by processor
> and is defined by the PPR. Restrict SNP support on the known CPU model
> and family for which the RMP table entry format is currently defined
> for.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/msr-index.h | 11 +-
> arch/x86/kernel/sev.c | 175 +++++++++++++++++++++++
> 3 files changed, 192 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 33d2cd04d254..9b5a2cc8064a 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -87,6 +87,12 @@
> # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
> #endif
>
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +# define DISABLE_SEV_SNP 0
> +#else
> +# define DISABLE_SEV_SNP (1 << (X86_FEATURE_SEV_SNP & 31))
> +#endif
> +
> /*
> * Make sure to add features to the correct mask
> */
> @@ -110,7 +116,7 @@
> DISABLE_ENQCMD)
> #define DISABLED_MASK17 0
> #define DISABLED_MASK18 0
> -#define DISABLED_MASK19 0
> +#define DISABLED_MASK19 (DISABLE_SEV_SNP)
> #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
>
> #endif /* _ASM_X86_DISABLED_FEATURES_H */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 10ac52705892..35100c630617 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -565,6 +565,8 @@
> #define MSR_AMD64_SEV_ENABLED BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
> #define MSR_AMD64_SEV_ES_ENABLED BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
> #define MSR_AMD64_SEV_SNP_ENABLED BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
> +#define MSR_AMD64_RMP_BASE 0xc0010132
> +#define MSR_AMD64_RMP_END 0xc0010133
>
> #define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f
>
> @@ -649,7 +651,14 @@
> #define MSR_K8_TOP_MEM2 0xc001001d
> #define MSR_AMD64_SYSCFG 0xc0010010
> #define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
> -#define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_EN_BIT 24
> +#define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
> +#define MSR_AMD64_SYSCFG_MFDM_BIT 19
> +#define MSR_AMD64_SYSCFG_MFDM BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
^ an extra tab?
> +
> #define MSR_K8_INT_PENDING_MSG 0xc0010055
> /* C1E active bits in int pending message */
> #define K8_INTP_C1E_ACTIVE_MASK 0x18000000
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index a428c62330d3..e54e412c9916 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -22,6 +22,9 @@
> #include <linux/efi.h>
> #include <linux/platform_device.h>
> #include <linux/io.h>
> +#include <linux/cpumask.h>
> +#include <linux/iommu.h>
> +#include <linux/amd-iommu.h>
>
> #include <asm/cpu_entry_area.h>
> #include <asm/stacktrace.h>
> @@ -38,6 +41,7 @@
> #include <asm/apic.h>
> #include <asm/cpuid.h>
> #include <asm/cmdline.h>
> +#include <asm/iommu.h>
>
> #define DR7_RESET_VALUE 0x400
>
> @@ -57,6 +61,12 @@
> #define AP_INIT_CR0_DEFAULT 0x60000010
> #define AP_INIT_MXCSR_DEFAULT 0x1f80
>
> +/*
> + * The first 16KB from the RMP_BASE is used by the processor for the
> + * bookkeeping, the range needs to be added during the RMP entry lookup.
> + */
> +#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
> +
> /* For early boot hypervisor communication in SEV-ES enabled guests */
> static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>
> @@ -69,6 +79,9 @@ static struct ghcb *boot_ghcb __section(".data");
> /* Bitmap of SEV features supported by the hypervisor */
> static u64 sev_hv_features __ro_after_init;
>
> +static unsigned long rmptable_start __ro_after_init;
> +static unsigned long rmptable_end __ro_after_init;
> +
> /* #VC handler runtime per-CPU data */
> struct sev_es_runtime_data {
> struct ghcb ghcb_page;
> @@ -2260,3 +2273,165 @@ static int __init snp_init_platform_device(void)
> return 0;
> }
> device_initcall(snp_init_platform_device);
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "SEV-SNP: " fmt
> +
> +static int __mfd_enable(unsigned int cpu)
> +{
> + u64 val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> + val |= MSR_AMD64_SYSCFG_MFDM;
> +
> + wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> + return 0;
> +}
> +
> +static __init void mfd_enable(void *arg)
> +{
> + __mfd_enable(smp_processor_id());
> +}
> +
> +static int __snp_enable(unsigned int cpu)
> +{
> + u64 val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> + val |= MSR_AMD64_SYSCFG_SNP_EN;
> + val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
> +
> + wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> + return 0;
> +}
> +
> +static __init void snp_enable(void *arg)
> +{
> + __snp_enable(smp_processor_id());
> +}
> +
> +static bool get_rmptable_info(u64 *start, u64 *len)
> +{
> + u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
> +
> + rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
> + rdmsrl(MSR_AMD64_RMP_END, rmp_end);
> +
> + if (!rmp_base || !rmp_end) {
> + pr_err("Memory for the RMP table has not been reserved by BIOS\n");
> + return false;
> + }
> +
> + rmp_sz = rmp_end - rmp_base + 1;
> +
> + /*
> + * Calculate the amount the memory that must be reserved by the BIOS to
> + * address the whole RAM. The reserved memory should also cover the
> + * RMP table itself.
> + */
> + calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + totalram_pages()) << 4)
> + + RMPTABLE_CPU_BOOKKEEPING_SZ;
> +
> + if (calc_rmp_sz > rmp_sz) {
> + pr_err("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
> + calc_rmp_sz, rmp_sz);
> + return false;
> + }
> +
> + *start = rmp_base;
> + *len = rmp_sz;
> +
> + pr_info("RMP table physical address [0x%016llx - 0x%016llx]\n", rmp_base, rmp_end);
> +
> + return true;
> +}
> +
> +static __init int snp_rmptable_init(void)
> +{
> + u64 rmp_base, sz;
> + void *start;
> + u64 val;
> +
> + if (!get_rmptable_info(&rmp_base, &sz))
> + return 1;
> +
> + start = memremap(rmp_base, sz, MEMREMAP_WB);
> + if (!start) {
> + pr_err("Failed to map RMP table addr 0x%llx size 0x%llx\n", rmp_base, sz);
> + return 1;
> + }
> +
> + /*
> + * Check if SEV-SNP is already enabled, this can happen in case of
> + * kexec boot.
> + */
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> + if (val & MSR_AMD64_SYSCFG_SNP_EN)
> + goto skip_enable;
> +
> + memset(start, 0, sz);
> +
> + /* Flush the caches to ensure that data is written before SNP is enabled. */
> + wbinvd_on_all_cpus();
> +
> + /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
> + on_each_cpu(mfd_enable, NULL, 1);
> +
> + /* Enable SNP on all CPUs. */
> + on_each_cpu(snp_enable, NULL, 1);
> +
> +skip_enable:
> + rmptable_start = (unsigned long)start;
> + rmptable_end = rmptable_start + sz - 1;
> +
> + return 0;
> +}
> +
> +static int __init snp_host_init(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + /*
> + * RMP table entry format is not architectural and it can vary by processor and
> + * is defined by the per-processor PPR. Restrict SNP support on the known CPU
> + * model and family for which the RMP table entry format is currently defined for.
> + */
> + if (boot_cpu_data.x86 != 0x19 || boot_cpu_data.x86_model > 0xaf)
> + goto nosnp;
> +
> + if (amd_iommu_snp_enable())
> + goto nosnp;
> +
> + if (snp_rmptable_init())
> + goto nosnp;
> +
> + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
> +
> + return 0;
> +
> +nosnp:
> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> + return -ENODEV;
> +}
> +
> +/*
> + * This must be called after the PCI subsystem. This is because amd_iommu_snp_enable()
> + * is called to ensure the IOMMU supports the SEV-SNP feature, which can only be
> + * called after subsys_initcall().
> + *
> + * NOTE: IOMMU is enforced by SNP to ensure that hypervisor cannot program DMA
> + * directly into guest private memory. In case of SNP, the IOMMU ensures that
> + * the page(s) used for DMA are hypervisor owned.
> + */
> +fs_initcall(snp_host_init);
On Mon, Feb 20, 2023 at 11:57:59AM -0800, Hugh Dickins wrote:
> On Mon, 20 Feb 2023, Michael Roth wrote:
>
> > From: Hugh Dickins <[email protected]>
>
> No.
>
> >
> > When the address is backed by a memfd, the code to split the page does
> > nothing more than remove the PMD from the page tables. So immediately
> > install a PTE to ensure that any other pages in that 2MB region are
> > brought back as in 4K pages.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> No. Suggested-by would be okay.
>
> > Cc: Hugh Dickins <[email protected]>
>
> Thanks. I'm really sorry to be such a jobsworth,
> and have nothing more constructive to say than I did before in
> https://lore.kernel.org/linux-mm/[email protected]/
> (please re-read), but adding a Signed-off-by where none was given is wrong;
> and I'm not ever going to comprehend enough of the context to give it.
Hi Hugh,
Sorry for the mix-up, I'd intended to remove this patch before submitting. I'll
make sure to remove it from future postings.
>
> Best of luck for the series,
Thank you!
-Mike
> Hugh
>
> > Signed-off-by: Ashish Kalra <[email protected]>
> > Signed-off-by: Michael Roth <[email protected]>
> > ---
> > mm/memory.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e68da7e403c6..33c9020ba1f8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4999,6 +4999,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> > static int handle_split_page_fault(struct vm_fault *vmf)
> > {
> > __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> > + /*
> > + * Install a PTE immediately to ensure that any other pages in
> > + * this 2MB region are brought back in as 4K pages.
> > + */
> > + __pte_alloc(vmf->vma->vm_mm, vmf->pmd);
> > return 0;
> > }
> >
> > --
> > 2.25.1
On Mon, 20 Feb 2023 12:38:10 -0600
Michael Roth <[email protected]> wrote:
> From: Ashish Kalra <[email protected]>
>
> Return pfn from dump_pagetable() to do SEV-specific
> fault handling. Used for handling SNP RMP page fault.
>
It would be better to merge this patch with PATCH 13.
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/mm/fault.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index afd4cde17001..f2b16dcfbd9a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -311,7 +311,7 @@ static bool low_pfn(unsigned long pfn)
> return pfn < max_low_pfn;
> }
>
> -static void dump_pagetable(unsigned long address)
> +static unsigned long dump_pagetable(unsigned long address)
> {
> pgd_t *base = __va(read_cr3_pa());
> pgd_t *pgd = &base[pgd_index(address)];
> @@ -345,8 +345,10 @@ static void dump_pagetable(unsigned long address)
>
> pte = pte_offset_kernel(pmd, address);
> pr_cont("*pte = %0*Lx ", sizeof(*pte) * 2, (u64)pte_val(*pte));
> + return 0;
> out:
> pr_cont("\n");
> + return 0;
> }
>
> #else /* CONFIG_X86_64: */
> @@ -367,10 +369,11 @@ static int bad_address(void *p)
> return get_kernel_nofault(dummy, (unsigned long *)p);
> }
>
> -static void dump_pagetable(unsigned long address)
> +static unsigned long dump_pagetable(unsigned long address)
> {
> pgd_t *base = __va(read_cr3_pa());
> pgd_t *pgd = base + pgd_index(address);
> + unsigned long pfn;
> p4d_t *p4d;
> pud_t *pud;
> pmd_t *pmd;
> @@ -388,6 +391,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(p4d))
> goto bad;
>
> + pfn = p4d_pfn(*p4d);
> pr_cont("P4D %lx ", p4d_val(*p4d));
> if (!p4d_present(*p4d) || p4d_large(*p4d))
> goto out;
> @@ -396,6 +400,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pud))
> goto bad;
>
> + pfn = pud_pfn(*pud);
> pr_cont("PUD %lx ", pud_val(*pud));
> if (!pud_present(*pud) || pud_large(*pud))
> goto out;
> @@ -404,6 +409,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pmd))
> goto bad;
>
> + pfn = pmd_pfn(*pmd);
> pr_cont("PMD %lx ", pmd_val(*pmd));
> if (!pmd_present(*pmd) || pmd_large(*pmd))
> goto out;
> @@ -412,13 +418,14 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pte))
> goto bad;
>
> + pfn = pte_pfn(*pte);
> pr_cont("PTE %lx", pte_val(*pte));
> out:
> pr_cont("\n");
> -
> - return;
> + return pfn;
> bad:
> pr_info("BAD\n");
> + return -1;
> }
>
> #endif /* CONFIG_X86_64 */
On Mon, 20 Feb 2023 12:37:55 -0600
Michael Roth <[email protected]> wrote:
> From: Vishal Annapurve <[email protected]>
>
> Introduce HVA range operator so that other KVM subsystems
> can operate on HVA range.
>
> Signed-off-by: Vishal Annapurve <[email protected]>
> [mdr: minor checkpatch alignment fixups]
> Signed-off-by: Michael Roth <[email protected]>
> ---
> include/linux/kvm_host.h | 6 +++++
> virt/kvm/kvm_main.c | 48 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 54 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4d542060cd93..c615650ed256 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1402,6 +1402,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm);
> void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> void kvm_mmu_invalidate_end(struct kvm *kvm);
>
> +typedef int (*kvm_hva_range_op_t)(struct kvm *kvm,
> + struct kvm_gfn_range *range, void *data);
> +
> +int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
> + unsigned long hva_end, kvm_hva_range_op_t handler, void *data);
> +
> long kvm_arch_dev_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg);
> long kvm_arch_vcpu_ioctl(struct file *filp,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f7e00593cc5d..4ccd655dd5af 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -642,6 +642,54 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> return (int)ret;
> }
>
Below function seems a reduced duplicate of __kvm_handle_hva_range()
in virt/kvm/kvm_main.c. It would be nice to factor __kvm_handle_hva_range().
> +int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
> + unsigned long hva_end, kvm_hva_range_op_t handler, void *data)
> +{
> + int ret = 0;
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + int i, idx;
> +
> + if (WARN_ON_ONCE(hva_end <= hva_start))
> + return -EINVAL;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> + struct interval_tree_node *node;
> +
> + slots = __kvm_memslots(kvm, i);
> + kvm_for_each_memslot_in_hva_range(node, slots,
> + hva_start, hva_end - 1) {
> + unsigned long start, end;
> +
> + slot = container_of(node, struct kvm_memory_slot,
> + hva_node[slots->node_idx]);
> + start = max(hva_start, slot->userspace_addr);
> + end = min(hva_end, slot->userspace_addr +
> + (slot->npages << PAGE_SHIFT));
> +
> + /*
> + * {gfn(page) | page intersects with [hva_start, hva_end)} =
> + * {gfn_start, gfn_start+1, ..., gfn_end-1}.
> + */
> + gfn_range.start = hva_to_gfn_memslot(start, slot);
> + gfn_range.end = hva_to_gfn_memslot(end + PAGE_SIZE - 1, slot);
> + gfn_range.slot = slot;
> +
> + ret = handler(kvm, &gfn_range, data);
> + if (ret)
> + goto e_ret;
> + }
> + }
> +
> +e_ret:
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return ret;
> +}
> +
> static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> unsigned long start,
> unsigned long end,
On Mon, 20 Feb 2023 12:38:15 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> The behavior and requirement for the SEV-legacy command is altered when
> the SNP firmware is in the INIT state. See SEV-SNP firmware specification
> for more details.
>
> Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region
> when SNP is enabled to satisfy new requirements for the SNP. Continue
> allocating a 1mb region for !SNP configuration.
>
> While at it, provide API that can be used by others to allocate a page
> that can be used by the firmware. The immediate user for this API will
> be the KVM driver. The KVM driver to need to allocate a firmware context
> page during the guest creation. The context page need to be updated
> by the firmware. See the SEV-SNP specification for further details.
>
> Co-developed-by: Ashish Kalra <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 148 +++++++++++++++++++++++++++++++++--
> include/linux/psp-sev.h | 9 +++
> 2 files changed, 149 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index eca4e59b0f44..4c12e98a1219 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -94,6 +94,13 @@ static void *sev_init_ex_buffer;
> */
> struct sev_data_range_list *snp_range_list;
>
> +/* When SEV-SNP is enabled the TMR needs to be 2MB aligned and 2MB size. */
> +#define SEV_SNP_ES_TMR_SIZE (2 * 1024 * 1024)
It would be better to re-use the kernel size definition macros. E.g. SZ_2MB.
> +
> +static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;
> +
> +static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
> +
> static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
> {
> struct sev_device *sev = psp_master->sev_data;
> @@ -216,11 +223,134 @@ void snp_mark_pages_offline(unsigned long pfn, unsigned int npages)
> }
> EXPORT_SYMBOL_GPL(snp_mark_pages_offline);
>
> +static int snp_reclaim_pages(unsigned long paddr, unsigned int npages, bool locked)
> +{
> + /* Cbit maybe set in the paddr */
This is confusing.
I suppose C-bit is treated as a attribute of PTE in the kernel not part of the
PA. It means only a PTE might carry a C-bit.
The paddr is from __pa(page_address()). It is not extracted from a PTE. Thus, the
return from them should never have a C-bit.
BTW: Wouldn't it be better to have pfn as input param instead of paddr?
The caller has struct page, calling snp_reclaim_pages(page_to_pfn(page), xxxxx)
would be much clearer than the current conversion:
page_address() (struct page is converted to VA), __pa() (VA is converted to PA)
in the caller and then PA is converted to pfn here.
> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
> + int ret, err, i, n = 0;
> +
should be unsigned int i, n; as the input param npage is unsigned int.
> + if (!pfn_valid(pfn)) {
> + pr_err("%s: Invalid PFN %lx\n", __func__, pfn);
> + return 0;
> + }
> +
> + for (i = 0; i < npages; i++, pfn++, n++) {
> + paddr = pfn << PAGE_SHIFT;
> +
> + if (locked)
> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
> + else
> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
> +
> + if (ret)
> + goto cleanup;
> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + goto cleanup;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /*
> + * If failed to reclaim the page then page is no longer safe to
> + * be release back to the system, leak it.
> + */
> + snp_mark_pages_offline(pfn, npages - n);
> + return ret;
> +}
> +
> +static int rmp_mark_pages_firmware(unsigned long paddr, unsigned int npages, bool locked)
The same comment as above. Better take pfn or page instead of paddr with
redundant conversions.
> +{
> + /* Cbit maybe set in the paddr */
> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
> + int rc, n = 0, i;
> +
> + for (i = 0; i < npages; i++, n++, pfn++) {
> + rc = rmp_make_private(pfn, 0, PG_LEVEL_4K, 0, true);
> + if (rc)
> + goto cleanup;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /*
> + * Try unrolling the firmware state changes by
> + * reclaiming the pages which were already changed to the
> + * firmware state.
> + */
> + snp_reclaim_pages(paddr, n, locked);
> +
> + return rc;
> +}
> +
> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
> +{
> + unsigned long npages = 1ul << order, paddr;
> + struct sev_device *sev;
> + struct page *page;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return NULL;
> +
> + page = alloc_pages(gfp_mask, order);
> + if (!page)
> + return NULL;
> +
> + /* If SEV-SNP is initialized then add the page in RMP table. */
> + sev = psp_master->sev_data;
> + if (!sev->snp_initialized)
> + return page;
> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (rmp_mark_pages_firmware(paddr, npages, locked))
> + return NULL;
> +
> + return page;
> +}
> +
> +void *snp_alloc_firmware_page(gfp_t gfp_mask)
> +{
> + struct page *page;
> +
> + page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
> +
> + return page ? page_address(page) : NULL;
> +}
> +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
> +
> +static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
> +{
> + struct sev_device *sev = psp_master->sev_data;
> + unsigned long paddr, npages = 1ul << order;
> +
> + if (!page)
> + return;
> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (sev->snp_initialized &&
> + snp_reclaim_pages(paddr, npages, locked))
> + return;
> +
> + __free_pages(page, order);
> +}
> +
> +void snp_free_firmware_page(void *addr)
> +{
> + if (!addr)
> + return;
> +
> + __snp_free_firmware_pages(virt_to_page(addr), 0, false);
> +}
> +EXPORT_SYMBOL_GPL(snp_free_firmware_page);
> +
> static void *sev_fw_alloc(unsigned long len)
> {
> struct page *page;
>
> - page = alloc_pages(GFP_KERNEL, get_order(len));
> + page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(len), false);
> if (!page)
> return NULL;
>
> @@ -468,7 +598,7 @@ static int __sev_init_locked(int *error)
> data.tmr_address = __pa(sev_es_tmr);
>
> data.flags |= SEV_INIT_FLAGS_SEV_ES;
> - data.tmr_len = SEV_ES_TMR_SIZE;
> + data.tmr_len = sev_es_tmr_size;
> }
>
> return __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
> @@ -491,7 +621,7 @@ static int __sev_init_ex_locked(int *error)
> data.tmr_address = __pa(sev_es_tmr);
>
> data.flags |= SEV_INIT_FLAGS_SEV_ES;
> - data.tmr_len = SEV_ES_TMR_SIZE;
> + data.tmr_len = sev_es_tmr_size;
> }
>
> return __sev_do_cmd_locked(SEV_CMD_INIT_EX, &data, error);
> @@ -982,6 +1112,8 @@ static int __sev_snp_init_locked(int *error)
> sev->snp_initialized = true;
> dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
>
> + sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
> +
> return rc;
> }
>
> @@ -1499,8 +1631,9 @@ static void sev_firmware_shutdown(struct sev_device *sev)
> /* The TMR area was encrypted, flush it from the cache */
> wbinvd_on_all_cpus();
>
> - free_pages((unsigned long)sev_es_tmr,
> - get_order(SEV_ES_TMR_SIZE));
> + __snp_free_firmware_pages(virt_to_page(sev_es_tmr),
> + get_order(sev_es_tmr_size),
> + false);
> sev_es_tmr = NULL;
> }
>
> @@ -1511,8 +1644,7 @@ static void sev_firmware_shutdown(struct sev_device *sev)
> }
>
> if (snp_range_list) {
> - free_pages((unsigned long)snp_range_list,
> - get_order(PAGE_SIZE));
> + snp_free_firmware_page(snp_range_list);
> snp_range_list = NULL;
> }
>
> @@ -1593,7 +1725,7 @@ void sev_pci_init(void)
> }
>
> /* Obtain the TMR memory area for SEV-ES use */
> - sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
> + sev_es_tmr = sev_fw_alloc(sev_es_tmr_size);
> if (!sev_es_tmr)
> dev_warn(sev->dev,
> "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 8edf5c548fbf..d19744807471 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -922,6 +922,8 @@ int sev_guest_decommission(struct sev_data_decommission *data, int *error);
> int sev_do_cmd(int cmd, void *data, int *psp_ret);
>
> void *psp_copy_user_blob(u64 uaddr, u32 len);
> +void *snp_alloc_firmware_page(gfp_t mask);
> +void snp_free_firmware_page(void *addr);
>
> /**
> * sev_mark_pages_offline - insert non-reclaimed firmware/guest pages
> @@ -959,6 +961,13 @@ static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_P
>
> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
>
> +static inline void *snp_alloc_firmware_page(gfp_t mask)
> +{
> + return NULL;
> +}
> +
> +static inline void snp_free_firmware_page(void *addr) { }
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
Hi Mike,
On 20/02/2023 20:38, Michael Roth wrote:
> From: Dionna Glaze <[email protected]>
>
> The /dev/sev device has the ability to store host-wide certificates for
> the key used by the AMD-SP for SEV-SNP attestation report signing,
> but for hosts that want to specify additional certificates that are
> specific to the image launched in a VM, a different way is needed to
> communicate those certificates.
>
> Add two new KVM ioctl to handle this: KVM_SEV_SNP_{GET,SET}_CERTS
>
> The certificates that are set with this command are expected to follow
> the same format as the host certificates, but that format is opaque
> to the kernel.
>
> The new behavior for custom certificates is that the extended guest
> request command will now return the overridden certificates if they
> were installed for the instance. The error condition for a too small
> data buffer is changed to return the overridden certificate data size
> if there is an overridden certificate set installed.
>
> Setting a 0 length certificate returns the system state to only return
> the host certificates on an extended guest request.
>
> Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space
> for an extra certificate.
>
> Cc: Tom Lendacky <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
>
> Signed-off-by: Dionna Glaze <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> [mdr: remove used of "we" and "this patch" in commit log]
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 111 ++++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/svm/svm.h | 1 +
> include/linux/psp-sev.h | 2 +-
> include/uapi/linux/kvm.h | 12 +++++
> 4 files changed, 123 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 70d5650d8d95..18b64b7005e7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2089,6 +2089,7 @@ static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> goto e_free;
>
> sev->snp_certs_data = certs_data;
> + sev->snp_certs_len = 0;
>
> return context;
>
> @@ -2404,6 +2405,86 @@ static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return ret;
> }
>
> +static int snp_get_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct kvm_sev_snp_get_certs params;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data,
> + sizeof(params)))
> + return -EFAULT;
> +
> + /* No instance certs set. */
> + if (!sev->snp_certs_len)
> + return -ENOENT;
> +
> + if (params.certs_len < sev->snp_certs_len) {
> + /* Output buffer too small. Return the required size. */
> + params.certs_len = sev->snp_certs_len;
> +
> + if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms,
> + sizeof(params)))
> + return -EFAULT;
> +
> + return -EINVAL;
> + }
> +
> + if (copy_to_user((void __user *)(uintptr_t)params.certs_uaddr,
> + sev->snp_certs_data, sev->snp_certs_len))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int snp_set_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + unsigned long length = SEV_FW_BLOB_MAX_SIZE;
> + void *to_certs = sev->snp_certs_data;
> + struct kvm_sev_snp_set_certs params;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data,
> + sizeof(params)))
> + return -EFAULT;
> +
> + if (params.certs_len > SEV_FW_BLOB_MAX_SIZE)
> + return -EINVAL;
> +
> + /*
> + * Setting a length of 0 is the same as "uninstalling" instance-
> + * specific certificates.
> + */
> + if (params.certs_len == 0) {
> + sev->snp_certs_len = 0;
> + return 0;
> + }
> +
> + /* Page-align the length */
> + length = (params.certs_len + PAGE_SIZE - 1) & PAGE_MASK;
> +
In comments on v7 [1] Dionna agreed adding cleanup here:
/* The size could shrink and leave garbage at the end. */
memset(sev->snp_certs_data, 0, SEV_FW_BLOB_MAX_SIZE);
(we can use 'to_certs' in the first argument)
but it wasn't added in v8.
-Dov
[1] https://lore.kernel.org/linux-coco/CAAH4kHYOtzgqSTZQFcRiZwPLCkLAThjsCMdjUCdsBTiP=W0Vxw@mail.gmail.com/
> + if (copy_from_user(to_certs,
> + (void __user *)(uintptr_t)params.certs_uaddr,
> + params.certs_len)) {
> + return -EFAULT;
> + }
> +
> + sev->snp_certs_len = length;
> +
> + return 0;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2503,6 +2584,12 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_FINISH:
> r = snp_launch_finish(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_GET_CERTS:
> + r = snp_get_instance_certs(kvm, &sev_cmd);
> + break;
> + case KVM_SEV_SNP_SET_CERTS:
> + r = snp_set_instance_certs(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -3550,8 +3637,28 @@ static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gp
> if (rc)
> goto unlock;
>
> - rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> - &data_npages, &err);
> + /*
> + * If the VMM has overridden the certs, then change the error message
> + * if the size is inappropriate for the override. Otherwise, use a
> + * regular guest request and copy back the instance certs.
> + */
> + if (sev->snp_certs_len) {
> + if ((data_npages << PAGE_SHIFT) < sev->snp_certs_len) {
> + rc = -EINVAL;
> + err = SNP_GUEST_REQ_INVALID_LEN;
> + goto datalen;
> + }
> + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &req,
> + (int *)&err);
> + } else {
> + rc = snp_guest_ext_guest_request(&req,
> + (unsigned long)sev->snp_certs_data,
> + &data_npages, &err);
> + }
> +datalen:
> + if (sev->snp_certs_len)
> + data_npages = sev->snp_certs_len >> PAGE_SHIFT;
> +
> if (rc) {
> /*
> * If buffer length is small then return the expected
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 221b38d3c845..dced46559508 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -94,6 +94,7 @@ struct kvm_sev_info {
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> void *snp_certs_data;
> + unsigned int snp_certs_len; /* Size of instance override for certs */
> struct mutex guest_req_lock; /* Lock for guest request handling */
>
> u64 sev_features; /* Features set at VMSA creation */
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 92116e2b74fd..3b28b78938f6 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -22,7 +22,7 @@
> #define __psp_pa(x) __pa(x)
> #endif
>
> -#define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */
> +#define SEV_FW_BLOB_MAX_SIZE 0x5000 /* 20KB */
>
> /**
> * SEV platform state
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6e684bf5f723..ad7e24e43547 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1928,6 +1928,8 @@ enum sev_cmd_id {
> KVM_SEV_SNP_LAUNCH_START,
> KVM_SEV_SNP_LAUNCH_UPDATE,
> KVM_SEV_SNP_LAUNCH_FINISH,
> + KVM_SEV_SNP_GET_CERTS,
> + KVM_SEV_SNP_SET_CERTS,
>
> KVM_SEV_NR_MAX,
> };
> @@ -2075,6 +2077,16 @@ struct kvm_sev_snp_launch_finish {
> __u8 pad[6];
> };
>
> +struct kvm_sev_snp_get_certs {
> + __u64 certs_uaddr;
> + __u64 certs_len;
> +};
> +
> +struct kvm_sev_snp_set_certs {
> + __u64 certs_uaddr;
> + __u64 certs_len;
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
>> +static int snp_reclaim_pages(unsigned long paddr, unsigned int npages, bool locked)
>> +{
>> + /* Cbit maybe set in the paddr */
>
> This is confusing.
>
> I suppose C-bit is treated as a attribute of PTE in the kernel not part of the
> PA. It means only a PTE might carry a C-bit.
>
snp_reclaim_pages() is also called for reclaiming guest memory, in which
case the (guest) paddr will have the C-bit set. Hence this C-bit
handling is done within snp_reclaim_pages() so that the callers don't
need to handle it explicitly.
> The paddr is from __pa(page_address()). It is not extracted from a PTE. Thus, the
> return from them should never have a C-bit.
>
> BTW: Wouldn't it be better to have pfn as input param instead of paddr?
>
> The caller has struct page, calling snp_reclaim_pages(page_to_pfn(page), xxxxx)
> would be much clearer than the current conversion:
> page_address() (struct page is converted to VA), __pa() (VA is converted to PA)
> in the caller and then PA is converted to pfn here.
>
>> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
>> + int ret, err, i, n = 0;
>> +
>
> should be unsigned int i, n; as the input param npage is unsigned int.
>
>> + if (!pfn_valid(pfn)) {
>> + pr_err("%s: Invalid PFN %lx\n", __func__, pfn);
>> + return 0;
>> + }
>> +
>> + for (i = 0; i < npages; i++, pfn++, n++) {
>> + paddr = pfn << PAGE_SHIFT;
>> +
>> + if (locked)
>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
>> + else
>> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
>> +
>> + if (ret)
>> + goto cleanup;
>> +
>> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
>> + if (ret)
>> + goto cleanup;
>> + }
>> +
>> + return 0;
>> +
>> +cleanup:
>> + /*
>> + * If failed to reclaim the page then page is no longer safe to
>> + * be release back to the system, leak it.
>> + */
>> + snp_mark_pages_offline(pfn, npages - n);
>> + return ret;
>> +}
>> +
>> +static int rmp_mark_pages_firmware(unsigned long paddr, unsigned int npages, bool locked)
>
> The same comment as above. Better take pfn or page instead of paddr with
> redundant conversions.
>
Again, the paddr can point to guest memory so it can have C-bit set.
Thanks,
Ashish
>> +{
>> + /* Cbit maybe set in the paddr */
>> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
>> + int rc, n = 0, i;
>> +
>> + for (i = 0; i < npages; i++, n++, pfn++) {
>> + rc = rmp_make_private(pfn, 0, PG_LEVEL_4K, 0, true);
>> + if (rc)
>> + goto cleanup;
>> + }
>> +
>> + return 0;
>> +
>> +cleanup:
>> + /*
>> + * Try unrolling the firmware state changes by
>> + * reclaiming the pages which were already changed to the
>> + * firmware state.
>> + */
>> + snp_reclaim_pages(paddr, n, locked);
>> +
>> + return rc;
>> +}
>> +
On Tue, 21 Feb 2023 09:31:01 -0600
"Kalra, Ashish" <[email protected]> wrote:
> >> +static int snp_reclaim_pages(unsigned long paddr, unsigned int npages, bool locked)
> >> +{
> >> + /* Cbit maybe set in the paddr */
> >
> > This is confusing.
> >
> > I suppose C-bit is treated as a attribute of PTE in the kernel not part of the
> > PA. It means only a PTE might carry a C-bit.
> >
>
> snp_reclaim_pages() is also called for reclaiming guest memory, in which
> case the (guest) paddr will have the C-bit set. Hence this C-bit
> handling is done within snp_reclaim_pages() so that the callers don't
> need to handle it explicitly.
Thanks for the explanation.
Do you mean it will be used like that in the later patch? Sorry if it is in the
later patch as I was making progress slowly. It is quite a big patch set.
At least, I don't see that kind of usage in the current patch. Feel free to
correct me if I am wrong.
The call chains:
__snp_free_firmware_page()
snp_reclaim_pages();
As __snp_free_firmware_page() takes struct page*, all the follwing coversion
from it would not carry C-bit.
__snp_alloc_firmware_pages()
rmp_mark_pages_firmware()
snp_reclaim_pages()
As __snp_alloc_firmware_page() allocates page with struct page*, the same
conclusion as above.
>
>
> > The paddr is from __pa(page_address()). It is not extracted from a PTE. Thus, the
> > return from them should never have a C-bit.
> >
> > BTW: Wouldn't it be better to have pfn as input param instead of paddr?
> >
> > The caller has struct page, calling snp_reclaim_pages(page_to_pfn(page), xxxxx)
> > would be much clearer than the current conversion:
> > page_address() (struct page is converted to VA), __pa() (VA is converted to PA)
> > in the caller and then PA is converted to pfn here.
> >
> >> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
> >> + int ret, err, i, n = 0;
> >> +
> >
> > should be unsigned int i, n; as the input param npage is unsigned int.
> >
> >> + if (!pfn_valid(pfn)) {
> >> + pr_err("%s: Invalid PFN %lx\n", __func__, pfn);
> >> + return 0;
> >> + }
> >> +
> >> + for (i = 0; i < npages; i++, pfn++, n++) {
> >> + paddr = pfn << PAGE_SHIFT;
> >> +
> >> + if (locked)
> >> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
> >> + else
> >> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
> >> +
> >> + if (ret)
> >> + goto cleanup;
> >> +
> >> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> >> + if (ret)
> >> + goto cleanup;
> >> + }
> >> +
> >> + return 0;
> >> +
> >> +cleanup:
> >> + /*
> >> + * If failed to reclaim the page then page is no longer safe to
> >> + * be release back to the system, leak it.
> >> + */
> >> + snp_mark_pages_offline(pfn, npages - n);
> >> + return ret;
> >> +}
> >> +
> >> +static int rmp_mark_pages_firmware(unsigned long paddr, unsigned int npages, bool locked)
> >
> > The same comment as above. Better take pfn or page instead of paddr with
> > redundant conversions.
> >
>
> Again, the paddr can point to guest memory so it can have C-bit set.
>
> Thanks,
> Ashish
>
> >> +{
> >> + /* Cbit maybe set in the paddr */
> >> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
> >> + int rc, n = 0, i;
> >> +
> >> + for (i = 0; i < npages; i++, n++, pfn++) {
> >> + rc = rmp_make_private(pfn, 0, PG_LEVEL_4K, 0, true);
> >> + if (rc)
> >> + goto cleanup;
> >> + }
> >> +
> >> + return 0;
> >> +
> >> +cleanup:
> >> + /*
> >> + * Try unrolling the firmware state changes by
> >> + * reclaiming the pages which were already changed to the
> >> + * firmware state.
> >> + */
> >> + snp_reclaim_pages(paddr, n, locked);
> >> +
> >> + return rc;
> >> +}
> >> +
On 2/21/2023 3:15 PM, Zhi Wang wrote:
> On Tue, 21 Feb 2023 09:31:01 -0600
> "Kalra, Ashish" <[email protected]> wrote:
>
>>>> +static int snp_reclaim_pages(unsigned long paddr, unsigned int npages, bool locked)
>>>> +{
>>>> + /* Cbit maybe set in the paddr */
>>>
>>> This is confusing.
>>>
>>> I suppose C-bit is treated as a attribute of PTE in the kernel not part of the
>>> PA. It means only a PTE might carry a C-bit.
>>>
>>
>> snp_reclaim_pages() is also called for reclaiming guest memory, in which
>> case the (guest) paddr will have the C-bit set. Hence this C-bit
>> handling is done within snp_reclaim_pages() so that the callers don't
>> need to handle it explicitly.
>
> Thanks for the explanation.
>
> Do you mean it will be used like that in the later patch? Sorry if it is in the
> later patch as I was making progress slowly. It is quite a big patch set.
>
Yes, these are callers in later patches, like the following code path in
patch 25:
static int unmap_firmware_writeable(u64 *paddr, u32 len, bool guest,
struct snp_host_map *map)
{
unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
...
/* If paddr points to a guest memory then restore the page
state to hypervisor. */
if (guest) {
if (snp_reclaim_pages(*paddr, npages, true))
return -EFAULT;
goto done;
}
...
...
Or, the following as part of patch 52:
int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn,
int *error)
{
...
data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
/* The destination page must be in the firmware state. */
if (rmp_mark_pages_firmware(data.dst_addr, 1, false))
return -EIO;
ret = sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, &data, error);
/* Restore the page state */
if (snp_reclaim_pages(data.dst_addr, 1, false))
...
...
Thanks,
Ashish
> At least, I don't see that kind of usage in the current patch. Feel free to
> correct me if I am wrong.
>
> The call chains:
>
> __snp_free_firmware_page()
> snp_reclaim_pages();
>
> As __snp_free_firmware_page() takes struct page*, all the follwing coversion
> from it would not carry C-bit.
>
> __snp_alloc_firmware_pages()
> rmp_mark_pages_firmware()
> snp_reclaim_pages()
>
> As __snp_alloc_firmware_page() allocates page with struct page*, the same
> conclusion as above.
>
>>
>>
>>> The paddr is from __pa(page_address()). It is not extracted from a PTE. Thus, the
>>> return from them should never have a C-bit.
>>>
>>> BTW: Wouldn't it be better to have pfn as input param instead of paddr?
>>>
>>> The caller has struct page, calling snp_reclaim_pages(page_to_pfn(page), xxxxx)
>>> would be much clearer than the current conversion:
>>> page_address() (struct page is converted to VA), __pa() (VA is converted to PA)
>>> in the caller and then PA is converted to pfn here.
>>>
>>>> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
>>>> + int ret, err, i, n = 0;
>>>> +
>>>
>>> should be unsigned int i, n; as the input param npage is unsigned int.
>>>
>>>> + if (!pfn_valid(pfn)) {
>>>> + pr_err("%s: Invalid PFN %lx\n", __func__, pfn);
>>>> + return 0;
>>>> + }
>>>> +
>>>> + for (i = 0; i < npages; i++, pfn++, n++) {
>>>> + paddr = pfn << PAGE_SHIFT;
>>>> +
>>>> + if (locked)
>>>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
>>>> + else
>>>> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &paddr, &err);
>>>> +
>>>> + if (ret)
>>>> + goto cleanup;
>>>> +
>>>> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
>>>> + if (ret)
>>>> + goto cleanup;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +
>>>> +cleanup:
>>>> + /*
>>>> + * If failed to reclaim the page then page is no longer safe to
>>>> + * be release back to the system, leak it.
>>>> + */
>>>> + snp_mark_pages_offline(pfn, npages - n);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int rmp_mark_pages_firmware(unsigned long paddr, unsigned int npages, bool locked)
>>>
>>> The same comment as above. Better take pfn or page instead of paddr with
>>> redundant conversions.
>>>
>>
>> Again, the paddr can point to guest memory so it can have C-bit set.
>>
>> Thanks,
>> Ashish
>>
>>>> +{
>>>> + /* Cbit maybe set in the paddr */
>>>> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
>>>> + int rc, n = 0, i;
>>>> +
>>>> + for (i = 0; i < npages; i++, n++, pfn++) {
>>>> + rc = rmp_make_private(pfn, 0, PG_LEVEL_4K, 0, true);
>>>> + if (rc)
>>>> + goto cleanup;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +
>>>> +cleanup:
>>>> + /*
>>>> + * Try unrolling the firmware state changes by
>>>> + * reclaiming the pages which were already changed to the
>>>> + * firmware state.
>>>> + */
>>>> + snp_reclaim_pages(paddr, n, locked);
>>>> +
>>>> + return rc;
>>>> +}
>>>> +
>
On Mon, 20 Feb 2023 12:38:18 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
> system-wide configuration value for SNP guests. The information includes
> the TCB version string to be reported in guest attestation reports.
>
> Version 2 of the GHCB specification adds an NAE (SNP extended guest
> request) that a guest can use to query the reports that include additional
> certificates.
>
> In both cases, userspace provided additional data is included in the
> attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
> command to give the certificate blob and the reported TCB version string
> at once. Note that the specification defines certificate blob with a
> specific GUID format; the userspace is responsible for building the
> proper certificate blob. The ioctl treats it an opaque blob.
>
> While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
> command that can be used to obtain the data programmed through the
> SNP_SET_EXT_CONFIG.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> Documentation/virt/coco/sev-guest.rst | 27 ++++++
> drivers/crypto/ccp/sev-dev.c | 123 ++++++++++++++++++++++++++
> drivers/crypto/ccp/sev-dev.h | 4 +
> include/uapi/linux/psp-sev.h | 17 ++++
> 4 files changed, 171 insertions(+)
>
> diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
> index 11ea67c944df..6cad4226c348 100644
> --- a/Documentation/virt/coco/sev-guest.rst
> +++ b/Documentation/virt/coco/sev-guest.rst
> @@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
> status includes API major, minor version and more. See the SEV-SNP
> specification for further details.
>
> +2.5 SNP_SET_EXT_CONFIG
> +----------------------
> +:Technology: sev-snp
> +:Type: hypervisor ioctl cmd
> +:Parameters (in): struct sev_data_snp_ext_config
> +:Returns (out): 0 on success, -negative on error
> +
> +The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
> +reported TCB version in the attestation report. The command is similar to
> +SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
> +command also accepts an additional certificate blob defined in the GHCB
> +specification.
> +
> +If the certs_address is zero, then the previous certificate blob will deleted.
> +For more information on the certificate blob layout, see the GHCB spec
> +(extended guest request message).
> +
> +2.6 SNP_GET_EXT_CONFIG
> +----------------------
> +:Technology: sev-snp
> +:Type: hypervisor ioctl cmd
> +:Parameters (in): struct sev_data_snp_ext_config
> +:Returns (out): 0 on success, -negative on error
> +
> +The SNP_GET_EXT_CONFIG is used to query the system-wide configuration set
> +through the SNP_SET_EXT_CONFIG.
> +
> 3. SEV-SNP CPUID Enforcement
> ============================
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 65e13a562f3b..b56b00ca2cd4 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1481,6 +1481,10 @@ static int __sev_snp_shutdown_locked(int *error)
> data.length = sizeof(data);
> data.iommu_snp_shutdown = 1;
>
> + /* Free the memory used for caching the certificate data */
> + kfree(sev->snp_certs_data);
> + sev->snp_certs_data = NULL;
> +
> wbinvd_on_all_cpus();
>
> retry:
> @@ -1793,6 +1797,118 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
> return ret;
> }
>
> +static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
> +{
> + struct sev_device *sev = psp_master->sev_data;
> + struct sev_user_data_ext_snp_config input;
> + int ret;
> +
> + if (!sev->snp_initialized || !argp->data)
> + return -EINVAL;
> +
> + memset(&input, 0, sizeof(input));
> +
> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
> + return -EFAULT;
> +
> + /* Copy the TCB version programmed through the SET_CONFIG to userspace */
> + if (input.config_address) {
> + if (copy_to_user((void * __user)input.config_address,
> + &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
> + return -EFAULT;
> + }
> +
> + /* Copy the extended certs programmed through the SNP_SET_CONFIG */
> + if (input.certs_address && sev->snp_certs_data) {
> + if (input.certs_len < sev->snp_certs_len) {
> + /* Return the certs length to userspace */
> + input.certs_len = sev->snp_certs_len;
> +
> + ret = -ENOSR;
> + goto e_done;
> + }
> +
What about if input.certs_len > sev->snp_certs_len? Is it possbile for the
userspace to know the length of data in the buffer? (I guess it might be able
to know the certs len through the blob data, but a comment here would be nice)
> + if (copy_to_user((void * __user)input.certs_address,
> + sev->snp_certs_data, sev->snp_certs_len))
> + return -EFAULT;
> + }
> +
> + ret = 0;
> +
> +e_done:
> + if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
> + ret = -EFAULT;
> +
> + return ret;
> +}
> +
> +static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
> +{
> + struct sev_device *sev = psp_master->sev_data;
> + struct sev_user_data_ext_snp_config input;
> + struct sev_user_data_snp_config config;
> + void *certs = NULL;
> + int ret = 0;
> +
> + if (!sev->snp_initialized || !argp->data)
> + return -EINVAL;
> +
> + if (!writable)
> + return -EPERM;
> +
> + memset(&input, 0, sizeof(input));
> +
> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
> + return -EFAULT;
> +
> + /* Copy the certs from userspace */
> + if (input.certs_address) {
> + if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
> + return -EINVAL;
> +
> + certs = psp_copy_user_blob(input.certs_address, input.certs_len);
> + if (IS_ERR(certs))
> + return PTR_ERR(certs);
> + }
> +
> + /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
> + if (input.config_address) {
> + memset(&config, 0, sizeof(config));
> + if (copy_from_user(&config,
> + (void __user *)input.config_address, sizeof(config))) {
> + ret = -EFAULT;
> + goto e_free;
> + }
> +
> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
> + if (ret)
> + goto e_free;
> +
> + memcpy(&sev->snp_config, &config, sizeof(config));
> + }
> +
> + /*
> + * If the new certs are passed then cache it else free the old certs.
> + */
> + mutex_lock(&sev->snp_certs_lock);
> + if (certs) {
> + kfree(sev->snp_certs_data);
> + sev->snp_certs_data = certs;
> + sev->snp_certs_len = input.certs_len;
> + } else {
> + kfree(sev->snp_certs_data);
> + sev->snp_certs_data = NULL;
> + sev->snp_certs_len = 0;
> + }
> + mutex_unlock(&sev->snp_certs_lock);
> +
> + return 0;
> +
> +e_free:
> + kfree(certs);
> + return ret;
> +}
> +
> static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> {
> void __user *argp = (void __user *)arg;
> @@ -1847,6 +1963,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> case SNP_PLATFORM_STATUS:
> ret = sev_ioctl_snp_platform_status(&input);
> break;
> + case SNP_SET_EXT_CONFIG:
> + ret = sev_ioctl_snp_set_config(&input, writable);
> + break;
> + case SNP_GET_EXT_CONFIG:
> + ret = sev_ioctl_snp_get_config(&input);
> + break;
> default:
> ret = -EINVAL;
> goto out;
> @@ -1962,6 +2084,7 @@ int sev_dev_init(struct psp_device *psp)
> goto e_sev;
>
> sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
> + mutex_init(&sev->snp_certs_lock);
>
> psp->sev_data = sev;
>
> diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
> index 19d79f9d4212..41d5353d5bab 100644
> --- a/drivers/crypto/ccp/sev-dev.h
> +++ b/drivers/crypto/ccp/sev-dev.h
> @@ -66,6 +66,10 @@ struct sev_device {
>
> bool snp_initialized;
> struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
> + void *snp_certs_data;
> + u32 snp_certs_len;
> + struct mutex snp_certs_lock;
> + struct sev_user_data_snp_config snp_config;
> };
>
> int sev_dev_init(struct psp_device *psp);
> diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
> index 5adfaea7df97..c20d37586d21 100644
> --- a/include/uapi/linux/psp-sev.h
> +++ b/include/uapi/linux/psp-sev.h
> @@ -29,6 +29,8 @@ enum {
> SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
> SEV_GET_ID2,
> SNP_PLATFORM_STATUS,
> + SNP_SET_EXT_CONFIG,
> + SNP_GET_EXT_CONFIG,
>
> SEV_MAX,
> };
> @@ -192,6 +194,21 @@ struct sev_user_data_snp_config {
> __u8 rsvd1[52];
> } __packed;
>
> +/**
> + * struct sev_data_snp_ext_config - system wide configuration value for SNP.
> + *
> + * @config_address: address of the struct sev_user_data_snp_config or 0 when
> + * reported_tcb does not need to be updated.
> + * @certs_address: address of extended guest request certificate chain or
> + * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
> + * @certs_len: length of the certs
> + */
> +struct sev_user_data_ext_snp_config {
> + __u64 config_address; /* In */
> + __u64 certs_address; /* In */
> + __u32 certs_len; /* In */
> +};
> +
> /**
> * struct sev_issue_cmd - SEV ioctl parameters
> *
On 2/22/23 06:32, Zhi Wang wrote:
> On Mon, 20 Feb 2023 12:38:18 -0600
> Michael Roth <[email protected]> wrote:
>
>> From: Brijesh Singh <[email protected]>
>>
>> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
>> system-wide configuration value for SNP guests. The information includes
>> the TCB version string to be reported in guest attestation reports.
>>
>> Version 2 of the GHCB specification adds an NAE (SNP extended guest
>> request) that a guest can use to query the reports that include additional
>> certificates.
>>
>> In both cases, userspace provided additional data is included in the
>> attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
>> command to give the certificate blob and the reported TCB version string
>> at once. Note that the specification defines certificate blob with a
>> specific GUID format; the userspace is responsible for building the
>> proper certificate blob. The ioctl treats it an opaque blob.
>>
>> While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
>> command that can be used to obtain the data programmed through the
>> SNP_SET_EXT_CONFIG.
>>
>> Signed-off-by: Brijesh Singh <[email protected]>
>> Signed-off-by: Ashish Kalra <[email protected]>
>> Signed-off-by: Michael Roth <[email protected]>
>> ---
>> Documentation/virt/coco/sev-guest.rst | 27 ++++++
>> drivers/crypto/ccp/sev-dev.c | 123 ++++++++++++++++++++++++++
>> drivers/crypto/ccp/sev-dev.h | 4 +
>> include/uapi/linux/psp-sev.h | 17 ++++
>> 4 files changed, 171 insertions(+)
>>
>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>> index 65e13a562f3b..b56b00ca2cd4 100644
>> --- a/drivers/crypto/ccp/sev-dev.c
>> +++ b/drivers/crypto/ccp/sev-dev.c
>> @@ -1481,6 +1481,10 @@ static int __sev_snp_shutdown_locked(int *error)
>> data.length = sizeof(data);
>> data.iommu_snp_shutdown = 1;
>>
>> + /* Free the memory used for caching the certificate data */
>> + kfree(sev->snp_certs_data);
>> + sev->snp_certs_data = NULL;
>> +
>> wbinvd_on_all_cpus();
>>
>> retry:
>> @@ -1793,6 +1797,118 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
>> return ret;
>> }
>>
>> +static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
>> +{
>> + struct sev_device *sev = psp_master->sev_data;
>> + struct sev_user_data_ext_snp_config input;
>> + int ret;
>> +
>> + if (!sev->snp_initialized || !argp->data)
>> + return -EINVAL;
>> +
>> + memset(&input, 0, sizeof(input));
>> +
>> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
>> + return -EFAULT;
>> +
>> + /* Copy the TCB version programmed through the SET_CONFIG to userspace */
>> + if (input.config_address) {
>> + if (copy_to_user((void * __user)input.config_address,
>> + &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
>> + return -EFAULT;
>> + }
>> +
>> + /* Copy the extended certs programmed through the SNP_SET_CONFIG */
>> + if (input.certs_address && sev->snp_certs_data) {
>> + if (input.certs_len < sev->snp_certs_len) {
>> + /* Return the certs length to userspace */
>> + input.certs_len = sev->snp_certs_len;
>> +
>> + ret = -ENOSR;
We should be consistent with the other SEV ioctls that return required
lengths and return -EIO here instead -ENOSR.
Thanks,
Tom
>> + goto e_done;
>> + }
>> +
>
> What about if input.certs_len > sev->snp_certs_len? Is it possbile for the
> userspace to know the length of data in the buffer? (I guess it might be able
> to know the certs len through the blob data, but a comment here would be nice)
>
>> + if (copy_to_user((void * __user)input.certs_address,
>> + sev->snp_certs_data, sev->snp_certs_len))
>> + return -EFAULT;
>> + }
>> +
>> + ret = 0;
>> +
>> +e_done:
>> + if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
>> + ret = -EFAULT;
>> +
>> + return ret;
>> +}
On Mon, 20 Feb 2023 12:38:19 -0600
Michael Roth <[email protected]> wrote:
It seems in the discussion:
https://lore.kernel.org/lkml/[email protected]/,
this API is going to be removed. Will that fix land in this patch series or not?
If not, It would be better to mention it in the comment message of this one
or patch 45.
If yes, I guess this patch is not needed.
> From: Brijesh Singh <[email protected]>
>
> Version 2 of the GHCB specification defines VMGEXIT that is used to get
> the extended attestation report. The extended attestation report includes
> the certificate blobs provided through the SNP_SET_EXT_CONFIG.
>
> The snp_guest_ext_guest_request() will be used by the hypervisor to get
> the extended attestation report. See the GHCB specification for more
> details.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 47 ++++++++++++++++++++++++++++++++++++
> include/linux/psp-sev.h | 33 +++++++++++++++++++++++++
> 2 files changed, 80 insertions(+)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index b56b00ca2cd4..e65563bc8298 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -2017,6 +2017,53 @@ int sev_guest_df_flush(int *error)
> }
> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>
> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
> +{
> + unsigned long expected_npages;
> + struct sev_device *sev;
> + int rc;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return -ENODEV;
> +
> + sev = psp_master->sev_data;
> +
> + if (!sev->snp_initialized)
> + return -EINVAL;
> +
> + mutex_lock(&sev->snp_certs_lock);
> + /*
> + * Check if there is enough space to copy the certificate chain. Otherwise
> + * return ERROR code defined in the GHCB specification.
> + */
> + expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
> + if (*npages < expected_npages) {
> + *npages = expected_npages;
> + *fw_err = SNP_GUEST_REQ_INVALID_LEN;
> + mutex_unlock(&sev->snp_certs_lock);
> + return -EINVAL;
> + }
> +
> + rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)fw_err);
> + if (rc) {
> + mutex_unlock(&sev->snp_certs_lock);
> + return rc;
> + }
> +
> + /* Copy the certificate blob */
> + if (sev->snp_certs_data) {
> + *npages = expected_npages;
> + memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
> + } else {
> + *npages = 0;
> + }
> +
> + mutex_unlock(&sev->snp_certs_lock);
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
> +
> static void sev_exit(struct kref *ref)
> {
> misc_deregister(&misc_dev->misc);
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index d19744807471..81bafc049eca 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -931,6 +931,32 @@ void snp_free_firmware_page(void *addr);
> */
> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages);
>
> +/**
> + * snp_guest_ext_guest_request - perform the SNP extended guest request command
> + * defined in the GHCB specification.
> + *
> + * @data: the input guest request structure
> + * @vaddr: address where the certificate blob need to be copied.
> + * @npages: number of pages for the certificate blob.
> + * If the specified page count is less than the certificate blob size, then the
> + * required page count is returned with error code defined in the GHCB spec.
> + * If the specified page count is more than the certificate blob size, then
> + * page count is updated to reflect the amount of valid data copied in the
> + * vaddr.
> + *
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV if the sev device is not available
> + * -%ENOTSUPP if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO if the sev returned a non-zero return code
> + */
> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *npages,
> + unsigned long *error);
> +
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> static inline int
> @@ -968,6 +994,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
>
> static inline void snp_free_firmware_page(void *addr) { }
>
> +static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *n,
> + unsigned long *error)
> +{
> + return -ENODEV;
> +}
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
On Mon, 20 Feb 2023 12:38:22 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> Implement a workaround for an SNP erratum where the CPU will incorrectly
> signal an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
> RMP entry of a VMCB, VMSA or AVIC backing page.
>
> When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
> backing pages as "in-use" in the RMP after a successful VMRUN. This
Is this "in-use" bit part of an RMP entry? If yes, better list its name
in APM.
> is done for _all_ VMs, not just SNP-Active VMs.
_All_ VMs? Do you mean SEV VMs and SEVSNP VMs? I guess legacy VM is not
affected, right?
>
> If the hypervisor accesses an in-use page through a writable
> translation, the CPU will throw an RMP violation #PF. On early SNP
> hardware, if an in-use page is 2mb aligned and software accesses any
> part of the associated 2mb region with a hupage, the CPU will
^hugepage
> incorrectly treat the entire 2mb region as in-use and signal a spurious
> RMP violation #PF.
>
> The recommended is to not use the hugepage for the VMCB, VMSA or
> AVIC backing page. Add a generic allocator that will ensure that the
> page returns is not hugepage (2mb or 1gb) and is safe to be used when
> SEV-SNP is enabled.
>
> Co-developed-by: Marc Orr <[email protected]>
> Signed-off-by: Marc Orr <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/lapic.c | 5 ++++-
> arch/x86/kvm/svm/sev.c | 33 ++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.c | 15 ++++++++++++--
> arch/x86/kvm/svm/svm.h | 1 +
> 6 files changed, 54 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 6a885f024a00..e116405cbb5f 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
> KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> +KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
> KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
> KVM_X86_OP_OPTIONAL(invalidate_restricted_mem)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 37c92412035f..a9363a6f779d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1729,6 +1729,8 @@ struct kvm_x86_ops {
> * Returns vCPU specific APICv inhibit reasons
> */
> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
> +
> + void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 80f92cbc4029..72e46d5b4201 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2740,7 +2740,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>
> vcpu->arch.apic = apic;
>
> - apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> + if (kvm_x86_ops.alloc_apic_backing_page)
> + apic->regs = static_call(kvm_x86_alloc_apic_backing_page)(vcpu);
> + else
> + apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> if (!apic->regs) {
> printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
> vcpu->vcpu_id);
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index c1f0d4898ce3..9e9efb42a766 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3241,3 +3241,36 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> break;
> }
> }
> +
> +struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
> +{
> + unsigned long pfn;
> + struct page *p;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +
> + /*
> + * Allocate an SNP safe page to workaround the SNP erratum where
> + * the CPU will incorrectly signal an RMP violation #PF if a
> + * hugepage (2mb or 1gb) collides with the RMP entry of VMCB, VMSA
> + * or AVIC backing page. The recommeded workaround is to not use the
> + * hugepage.
> + *
> + * Allocate one extra page, use a page which is not 2mb aligned
> + * and free the other.
> + */
> + p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> + if (!p)
> + return NULL;
> +
> + split_page(p, 1);
> +
> + pfn = page_to_pfn(p);
> + if (IS_ALIGNED(pfn, PTRS_PER_PMD))
> + __free_page(p++);
> + else
> + __free_page(p + 1);
> +
> + return p;
> +}
The duplicate allocation routine in snp_alloc_vmsa_page() in sev.c can
be replaced with snp_safe_alloc_page().
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 213593dbd7a1..1061aaf66f0a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1372,7 +1372,7 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
> svm = to_svm(vcpu);
>
> err = -ENOMEM;
> - vmcb01_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> + vmcb01_page = snp_safe_alloc_page(vcpu);
> if (!vmcb01_page)
> goto out;
>
> @@ -1381,7 +1381,7 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
> * SEV-ES guests require a separate VMSA page used to contain
> * the encrypted register state of the guest.
> */
> - vmsa_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> + vmsa_page = snp_safe_alloc_page(vcpu);
> if (!vmsa_page)
> goto error_free_vmcb_page;
>
> @@ -4696,6 +4696,16 @@ static int svm_vm_init(struct kvm *kvm)
> return 0;
> }
>
> +static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
> +{
> + struct page *page = snp_safe_alloc_page(vcpu);
> +
> + if (!page)
> + return NULL;
> +
> + return page_address(page);
> +}
> +
> static struct kvm_x86_ops svm_x86_ops __initdata = {
> .name = KBUILD_MODNAME,
>
> @@ -4824,6 +4834,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>
> .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
> .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
> + .alloc_apic_backing_page = svm_alloc_apic_backing_page,
> };
>
> /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index c249c360fe36..5efcf036ccad 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -692,6 +692,7 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm);
> void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
> void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
> void sev_es_unmap_ghcb(struct vcpu_svm *svm);
> +struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
>
> /* vmenter.S */
>
On 2/22/2023 2:24 PM, Zhi Wang wrote:
> On Mon, 20 Feb 2023 12:38:19 -0600
> Michael Roth <[email protected]> wrote:
>
> It seems in the discussion:
> https://lore.kernel.org/lkml/[email protected]/,
> this API is going to be removed. Will that fix land in this patch series or not?
> If not, It would be better to mention it in the comment message of this one
> or patch 45.
> If yes, I guess this patch is not needed.
>
This API is definitely not going to be removed.
There will be some fixes and optimizations added to the API
implementation (as per the discussions) and that will be included in v9.
Thanks,
Ashish
>> From: Brijesh Singh <[email protected]>
>>
>> Version 2 of the GHCB specification defines VMGEXIT that is used to get
>> the extended attestation report. The extended attestation report includes
>> the certificate blobs provided through the SNP_SET_EXT_CONFIG.
>>
>> The snp_guest_ext_guest_request() will be used by the hypervisor to get
>> the extended attestation report. See the GHCB specification for more
>> details.
>>
>> Signed-off-by: Brijesh Singh <[email protected]>
>> Signed-off-by: Ashish Kalra <[email protected]>
>> Signed-off-by: Michael Roth <[email protected]>
>> ---
>> drivers/crypto/ccp/sev-dev.c | 47 ++++++++++++++++++++++++++++++++++++
>> include/linux/psp-sev.h | 33 +++++++++++++++++++++++++
>> 2 files changed, 80 insertions(+)
>>
>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>> index b56b00ca2cd4..e65563bc8298 100644
>> --- a/drivers/crypto/ccp/sev-dev.c
>> +++ b/drivers/crypto/ccp/sev-dev.c
>> @@ -2017,6 +2017,53 @@ int sev_guest_df_flush(int *error)
>> }
>> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>>
>> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
>> + unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
>> +{
>> + unsigned long expected_npages;
>> + struct sev_device *sev;
>> + int rc;
>> +
>> + if (!psp_master || !psp_master->sev_data)
>> + return -ENODEV;
>> +
>> + sev = psp_master->sev_data;
>> +
>> + if (!sev->snp_initialized)
>> + return -EINVAL;
>> +
>> + mutex_lock(&sev->snp_certs_lock);
>> + /*
>> + * Check if there is enough space to copy the certificate chain. Otherwise
>> + * return ERROR code defined in the GHCB specification.
>> + */
>> + expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
>> + if (*npages < expected_npages) {
>> + *npages = expected_npages;
>> + *fw_err = SNP_GUEST_REQ_INVALID_LEN;
>> + mutex_unlock(&sev->snp_certs_lock);
>> + return -EINVAL;
>> + }
>> +
>> + rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)fw_err);
>> + if (rc) {
>> + mutex_unlock(&sev->snp_certs_lock);
>> + return rc;
>> + }
>> +
>> + /* Copy the certificate blob */
>> + if (sev->snp_certs_data) {
>> + *npages = expected_npages;
>> + memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
>> + } else {
>> + *npages = 0;
>> + }
>> +
>> + mutex_unlock(&sev->snp_certs_lock);
>> + return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
>> +
>> static void sev_exit(struct kref *ref)
>> {
>> misc_deregister(&misc_dev->misc);
>> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
>> index d19744807471..81bafc049eca 100644
>> --- a/include/linux/psp-sev.h
>> +++ b/include/linux/psp-sev.h
>> @@ -931,6 +931,32 @@ void snp_free_firmware_page(void *addr);
>> */
>> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages);
>>
>> +/**
>> + * snp_guest_ext_guest_request - perform the SNP extended guest request command
>> + * defined in the GHCB specification.
>> + *
>> + * @data: the input guest request structure
>> + * @vaddr: address where the certificate blob need to be copied.
>> + * @npages: number of pages for the certificate blob.
>> + * If the specified page count is less than the certificate blob size, then the
>> + * required page count is returned with error code defined in the GHCB spec.
>> + * If the specified page count is more than the certificate blob size, then
>> + * page count is updated to reflect the amount of valid data copied in the
>> + * vaddr.
>> + *
>> + * @sev_ret: sev command return code
>> + *
>> + * Returns:
>> + * 0 if the sev successfully processed the command
>> + * -%ENODEV if the sev device is not available
>> + * -%ENOTSUPP if the sev does not support SEV
>> + * -%ETIMEDOUT if the sev command timed out
>> + * -%EIO if the sev returned a non-zero return code
>> + */
>> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
>> + unsigned long vaddr, unsigned long *npages,
>> + unsigned long *error);
>> +
>> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>>
>> static inline int
>> @@ -968,6 +994,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
>>
>> static inline void snp_free_firmware_page(void *addr) { }
>>
>> +static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
>> + unsigned long vaddr, unsigned long *n,
>> + unsigned long *error)
>> +{
>> + return -ENODEV;
>> +}
>> +
>> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>>
>> #endif /* __PSP_SEV_H__ */
>
On 2/22/2023 6:32 AM, Zhi Wang wrote:
> On Mon, 20 Feb 2023 12:38:18 -0600
> Michael Roth <[email protected]> wrote:
>
>> From: Brijesh Singh <[email protected]>
>>
>> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
>> system-wide configuration value for SNP guests. The information includes
>> the TCB version string to be reported in guest attestation reports.
>>
>> Version 2 of the GHCB specification adds an NAE (SNP extended guest
>> request) that a guest can use to query the reports that include additional
>> certificates.
>>
>> In both cases, userspace provided additional data is included in the
>> attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
>> command to give the certificate blob and the reported TCB version string
>> at once. Note that the specification defines certificate blob with a
>> specific GUID format; the userspace is responsible for building the
>> proper certificate blob. The ioctl treats it an opaque blob.
>>
>> While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
>> command that can be used to obtain the data programmed through the
>> SNP_SET_EXT_CONFIG.
>>
>> Signed-off-by: Brijesh Singh <[email protected]>
>> Signed-off-by: Ashish Kalra <[email protected]>
>> Signed-off-by: Michael Roth <[email protected]>
>> ---
>> Documentation/virt/coco/sev-guest.rst | 27 ++++++
>> drivers/crypto/ccp/sev-dev.c | 123 ++++++++++++++++++++++++++
>> drivers/crypto/ccp/sev-dev.h | 4 +
>> include/uapi/linux/psp-sev.h | 17 ++++
>> 4 files changed, 171 insertions(+)
>>
>> diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
>> index 11ea67c944df..6cad4226c348 100644
>> --- a/Documentation/virt/coco/sev-guest.rst
>> +++ b/Documentation/virt/coco/sev-guest.rst
>> @@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
>> status includes API major, minor version and more. See the SEV-SNP
>> specification for further details.
>>
>> +2.5 SNP_SET_EXT_CONFIG
>> +----------------------
>> +:Technology: sev-snp
>> +:Type: hypervisor ioctl cmd
>> +:Parameters (in): struct sev_data_snp_ext_config
>> +:Returns (out): 0 on success, -negative on error
>> +
>> +The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
>> +reported TCB version in the attestation report. The command is similar to
>> +SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
>> +command also accepts an additional certificate blob defined in the GHCB
>> +specification.
>> +
>> +If the certs_address is zero, then the previous certificate blob will deleted.
>> +For more information on the certificate blob layout, see the GHCB spec
>> +(extended guest request message).
>> +
>> +2.6 SNP_GET_EXT_CONFIG
>> +----------------------
>> +:Technology: sev-snp
>> +:Type: hypervisor ioctl cmd
>> +:Parameters (in): struct sev_data_snp_ext_config
>> +:Returns (out): 0 on success, -negative on error
>> +
>> +The SNP_GET_EXT_CONFIG is used to query the system-wide configuration set
>> +through the SNP_SET_EXT_CONFIG.
>> +
>> 3. SEV-SNP CPUID Enforcement
>> ============================
>>
>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>> index 65e13a562f3b..b56b00ca2cd4 100644
>> --- a/drivers/crypto/ccp/sev-dev.c
>> +++ b/drivers/crypto/ccp/sev-dev.c
>> @@ -1481,6 +1481,10 @@ static int __sev_snp_shutdown_locked(int *error)
>> data.length = sizeof(data);
>> data.iommu_snp_shutdown = 1;
>>
>> + /* Free the memory used for caching the certificate data */
>> + kfree(sev->snp_certs_data);
>> + sev->snp_certs_data = NULL;
>> +
>> wbinvd_on_all_cpus();
>>
>> retry:
>> @@ -1793,6 +1797,118 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
>> return ret;
>> }
>>
>> +static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
>> +{
>> + struct sev_device *sev = psp_master->sev_data;
>> + struct sev_user_data_ext_snp_config input;
>> + int ret;
>> +
>> + if (!sev->snp_initialized || !argp->data)
>> + return -EINVAL;
>> +
>> + memset(&input, 0, sizeof(input));
>> +
>> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
>> + return -EFAULT;
>> +
>> + /* Copy the TCB version programmed through the SET_CONFIG to userspace */
>> + if (input.config_address) {
>> + if (copy_to_user((void * __user)input.config_address,
>> + &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
>> + return -EFAULT;
>> + }
>> +
>> + /* Copy the extended certs programmed through the SNP_SET_CONFIG */
>> + if (input.certs_address && sev->snp_certs_data) {
>> + if (input.certs_len < sev->snp_certs_len) {
>> + /* Return the certs length to userspace */
>> + input.certs_len = sev->snp_certs_len;
>> +
>> + ret = -ENOSR;
>> + goto e_done;
>> + }
>> +
>
> What about if input.certs_len > sev->snp_certs_len? Is it possbile for the
> userspace to know the length of data in the buffer? (I guess it might be able
> to know the certs len through the blob data, but a comment here would be nice)
>
If userspace provides an input buffer/length smaller then snp_certs_len,
then the above returns the "required" certs length back to userspace.
And what is the issue if input.certs_len > sev->snp_certs_len, the
buffer returned back to userspace is sev->snp_certs_len as below.
Thanks,
Ashish
>> + if (copy_to_user((void * __user)input.certs_address,
>> + sev->snp_certs_data, sev->snp_certs_len))
>> + return -EFAULT;
>> + }
>> +
>> + ret = 0;
>> +
>> +e_done:
>> + if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
>> + ret = -EFAULT;
>> +
>> + return ret;
>> +}
>> +
>> +static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
>> +{
>> + struct sev_device *sev = psp_master->sev_data;
>> + struct sev_user_data_ext_snp_config input;
>> + struct sev_user_data_snp_config config;
>> + void *certs = NULL;
>> + int ret = 0;
>> +
>> + if (!sev->snp_initialized || !argp->data)
>> + return -EINVAL;
>> +
>> + if (!writable)
>> + return -EPERM;
>> +
>> + memset(&input, 0, sizeof(input));
>> +
>> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
>> + return -EFAULT;
>> +
>> + /* Copy the certs from userspace */
>> + if (input.certs_address) {
>> + if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
>> + return -EINVAL;
>> +
>> + certs = psp_copy_user_blob(input.certs_address, input.certs_len);
>> + if (IS_ERR(certs))
>> + return PTR_ERR(certs);
>> + }
>> +
>> + /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
>> + if (input.config_address) {
>> + memset(&config, 0, sizeof(config));
>> + if (copy_from_user(&config,
>> + (void __user *)input.config_address, sizeof(config))) {
>> + ret = -EFAULT;
>> + goto e_free;
>> + }
>> +
>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
>> + if (ret)
>> + goto e_free;
>> +
>> + memcpy(&sev->snp_config, &config, sizeof(config));
>> + }
>> +
>> + /*
>> + * If the new certs are passed then cache it else free the old certs.
>> + */
>> + mutex_lock(&sev->snp_certs_lock);
>> + if (certs) {
>> + kfree(sev->snp_certs_data);
>> + sev->snp_certs_data = certs;
>> + sev->snp_certs_len = input.certs_len;
>> + } else {
>> + kfree(sev->snp_certs_data);
>> + sev->snp_certs_data = NULL;
>> + sev->snp_certs_len = 0;
>> + }
>> + mutex_unlock(&sev->snp_certs_lock);
>> +
>> + return 0;
>> +
>> +e_free:
>> + kfree(certs);
>> + return ret;
>> +}
>> +
>> static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
>> {
>> void __user *argp = (void __user *)arg;
>> @@ -1847,6 +1963,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
>> case SNP_PLATFORM_STATUS:
>> ret = sev_ioctl_snp_platform_status(&input);
>> break;
>> + case SNP_SET_EXT_CONFIG:
>> + ret = sev_ioctl_snp_set_config(&input, writable);
>> + break;
>> + case SNP_GET_EXT_CONFIG:
>> + ret = sev_ioctl_snp_get_config(&input);
>> + break;
>> default:
>> ret = -EINVAL;
>> goto out;
>> @@ -1962,6 +2084,7 @@ int sev_dev_init(struct psp_device *psp)
>> goto e_sev;
>>
>> sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
>> + mutex_init(&sev->snp_certs_lock);
>>
>> psp->sev_data = sev;
>>
>> diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
>> index 19d79f9d4212..41d5353d5bab 100644
>> --- a/drivers/crypto/ccp/sev-dev.h
>> +++ b/drivers/crypto/ccp/sev-dev.h
>> @@ -66,6 +66,10 @@ struct sev_device {
>>
>> bool snp_initialized;
>> struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
>> + void *snp_certs_data;
>> + u32 snp_certs_len;
>> + struct mutex snp_certs_lock;
>> + struct sev_user_data_snp_config snp_config;
>> };
>>
>> int sev_dev_init(struct psp_device *psp);
>> diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
>> index 5adfaea7df97..c20d37586d21 100644
>> --- a/include/uapi/linux/psp-sev.h
>> +++ b/include/uapi/linux/psp-sev.h
>> @@ -29,6 +29,8 @@ enum {
>> SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
>> SEV_GET_ID2,
>> SNP_PLATFORM_STATUS,
>> + SNP_SET_EXT_CONFIG,
>> + SNP_GET_EXT_CONFIG,
>>
>> SEV_MAX,
>> };
>> @@ -192,6 +194,21 @@ struct sev_user_data_snp_config {
>> __u8 rsvd1[52];
>> } __packed;
>>
>> +/**
>> + * struct sev_data_snp_ext_config - system wide configuration value for SNP.
>> + *
>> + * @config_address: address of the struct sev_user_data_snp_config or 0 when
>> + * reported_tcb does not need to be updated.
>> + * @certs_address: address of extended guest request certificate chain or
>> + * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
>> + * @certs_len: length of the certs
>> + */
>> +struct sev_user_data_ext_snp_config {
>> + __u64 config_address; /* In */
>> + __u64 certs_address; /* In */
>> + __u32 certs_len; /* In */
>> +};
>> +
>> /**
>> * struct sev_issue_cmd - SEV ioctl parameters
>> *
>
On Wed, 22 Feb 2023 16:43:54 -0600
"Kalra, Ashish" <[email protected]> wrote:
> On 2/22/2023 6:32 AM, Zhi Wang wrote:
> > On Mon, 20 Feb 2023 12:38:18 -0600
> > Michael Roth <[email protected]> wrote:
> >
> >> From: Brijesh Singh <[email protected]>
> >>
> >> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
> >> system-wide configuration value for SNP guests. The information includes
> >> the TCB version string to be reported in guest attestation reports.
> >>
> >> Version 2 of the GHCB specification adds an NAE (SNP extended guest
> >> request) that a guest can use to query the reports that include additional
> >> certificates.
> >>
> >> In both cases, userspace provided additional data is included in the
> >> attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
> >> command to give the certificate blob and the reported TCB version string
> >> at once. Note that the specification defines certificate blob with a
> >> specific GUID format; the userspace is responsible for building the
> >> proper certificate blob. The ioctl treats it an opaque blob.
> >>
> >> While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
> >> command that can be used to obtain the data programmed through the
> >> SNP_SET_EXT_CONFIG.
> >>
> >> Signed-off-by: Brijesh Singh <[email protected]>
> >> Signed-off-by: Ashish Kalra <[email protected]>
> >> Signed-off-by: Michael Roth <[email protected]>
> >> ---
> >> Documentation/virt/coco/sev-guest.rst | 27 ++++++
> >> drivers/crypto/ccp/sev-dev.c | 123 ++++++++++++++++++++++++++
> >> drivers/crypto/ccp/sev-dev.h | 4 +
> >> include/uapi/linux/psp-sev.h | 17 ++++
> >> 4 files changed, 171 insertions(+)
> >>
> >> diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
> >> index 11ea67c944df..6cad4226c348 100644
> >> --- a/Documentation/virt/coco/sev-guest.rst
> >> +++ b/Documentation/virt/coco/sev-guest.rst
> >> @@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
> >> status includes API major, minor version and more. See the SEV-SNP
> >> specification for further details.
> >>
> >> +2.5 SNP_SET_EXT_CONFIG
> >> +----------------------
> >> +:Technology: sev-snp
> >> +:Type: hypervisor ioctl cmd
> >> +:Parameters (in): struct sev_data_snp_ext_config
> >> +:Returns (out): 0 on success, -negative on error
> >> +
> >> +The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
> >> +reported TCB version in the attestation report. The command is similar to
> >> +SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
> >> +command also accepts an additional certificate blob defined in the GHCB
> >> +specification.
> >> +
> >> +If the certs_address is zero, then the previous certificate blob will deleted.
> >> +For more information on the certificate blob layout, see the GHCB spec
> >> +(extended guest request message).
> >> +
> >> +2.6 SNP_GET_EXT_CONFIG
> >> +----------------------
> >> +:Technology: sev-snp
> >> +:Type: hypervisor ioctl cmd
> >> +:Parameters (in): struct sev_data_snp_ext_config
> >> +:Returns (out): 0 on success, -negative on error
> >> +
> >> +The SNP_GET_EXT_CONFIG is used to query the system-wide configuration set
> >> +through the SNP_SET_EXT_CONFIG.
> >> +
> >> 3. SEV-SNP CPUID Enforcement
> >> ============================
> >>
> >> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> >> index 65e13a562f3b..b56b00ca2cd4 100644
> >> --- a/drivers/crypto/ccp/sev-dev.c
> >> +++ b/drivers/crypto/ccp/sev-dev.c
> >> @@ -1481,6 +1481,10 @@ static int __sev_snp_shutdown_locked(int *error)
> >> data.length = sizeof(data);
> >> data.iommu_snp_shutdown = 1;
> >>
> >> + /* Free the memory used for caching the certificate data */
> >> + kfree(sev->snp_certs_data);
> >> + sev->snp_certs_data = NULL;
> >> +
> >> wbinvd_on_all_cpus();
> >>
> >> retry:
> >> @@ -1793,6 +1797,118 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
> >> return ret;
> >> }
> >>
> >> +static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
> >> +{
> >> + struct sev_device *sev = psp_master->sev_data;
> >> + struct sev_user_data_ext_snp_config input;
> >> + int ret;
> >> +
> >> + if (!sev->snp_initialized || !argp->data)
> >> + return -EINVAL;
> >> +
> >> + memset(&input, 0, sizeof(input));
> >> +
> >> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
> >> + return -EFAULT;
> >> +
> >> + /* Copy the TCB version programmed through the SET_CONFIG to userspace */
> >> + if (input.config_address) {
> >> + if (copy_to_user((void * __user)input.config_address,
> >> + &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
> >> + return -EFAULT;
> >> + }
> >> +
> >> + /* Copy the extended certs programmed through the SNP_SET_CONFIG */
> >> + if (input.certs_address && sev->snp_certs_data) {
> >> + if (input.certs_len < sev->snp_certs_len) {
> >> + /* Return the certs length to userspace */
> >> + input.certs_len = sev->snp_certs_len;
> >> +
> >> + ret = -ENOSR;
> >> + goto e_done;
> >> + }
> >> +
> >
> > What about if input.certs_len > sev->snp_certs_len? Is it possbile for the
> > userspace to know the length of data in the buffer? (I guess it might be able
> > to know the certs len through the blob data, but a comment here would be nice)
> >
>
> If userspace provides an input buffer/length smaller then snp_certs_len,
> then the above returns the "required" certs length back to userspace.
>
> And what is the issue if input.certs_len > sev->snp_certs_len, the
> buffer returned back to userspace is sev->snp_certs_len as below.
>
My point is: How can the userspace know the length of return data is shorter
than input.certs_len when input.certs_len > sev->snp_serts_len? as the length
is only returned when input.certs_len < sev->snp_certs_len.
> Thanks,
> Ashish
>
> >> + if (copy_to_user((void * __user)input.certs_address,
> >> + sev->snp_certs_data, sev->snp_certs_len))
> >> + return -EFAULT;
> >> + }
> >> +
> >> + ret = 0;
> >> +
> >> +e_done:
> >> + if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
> >> + ret = -EFAULT;
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
> >> +{
> >> + struct sev_device *sev = psp_master->sev_data;
> >> + struct sev_user_data_ext_snp_config input;
> >> + struct sev_user_data_snp_config config;
> >> + void *certs = NULL;
> >> + int ret = 0;
> >> +
> >> + if (!sev->snp_initialized || !argp->data)
> >> + return -EINVAL;
> >> +
> >> + if (!writable)
> >> + return -EPERM;
> >> +
> >> + memset(&input, 0, sizeof(input));
> >> +
> >> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
> >> + return -EFAULT;
> >> +
> >> + /* Copy the certs from userspace */
> >> + if (input.certs_address) {
> >> + if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
> >> + return -EINVAL;
> >> +
> >> + certs = psp_copy_user_blob(input.certs_address, input.certs_len);
> >> + if (IS_ERR(certs))
> >> + return PTR_ERR(certs);
> >> + }
> >> +
> >> + /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
> >> + if (input.config_address) {
> >> + memset(&config, 0, sizeof(config));
> >> + if (copy_from_user(&config,
> >> + (void __user *)input.config_address, sizeof(config))) {
> >> + ret = -EFAULT;
> >> + goto e_free;
> >> + }
> >> +
> >> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
> >> + if (ret)
> >> + goto e_free;
> >> +
> >> + memcpy(&sev->snp_config, &config, sizeof(config));
> >> + }
> >> +
> >> + /*
> >> + * If the new certs are passed then cache it else free the old certs.
> >> + */
> >> + mutex_lock(&sev->snp_certs_lock);
> >> + if (certs) {
> >> + kfree(sev->snp_certs_data);
> >> + sev->snp_certs_data = certs;
> >> + sev->snp_certs_len = input.certs_len;
> >> + } else {
> >> + kfree(sev->snp_certs_data);
> >> + sev->snp_certs_data = NULL;
> >> + sev->snp_certs_len = 0;
> >> + }
> >> + mutex_unlock(&sev->snp_certs_lock);
> >> +
> >> + return 0;
> >> +
> >> +e_free:
> >> + kfree(certs);
> >> + return ret;
> >> +}
> >> +
> >> static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> >> {
> >> void __user *argp = (void __user *)arg;
> >> @@ -1847,6 +1963,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> >> case SNP_PLATFORM_STATUS:
> >> ret = sev_ioctl_snp_platform_status(&input);
> >> break;
> >> + case SNP_SET_EXT_CONFIG:
> >> + ret = sev_ioctl_snp_set_config(&input, writable);
> >> + break;
> >> + case SNP_GET_EXT_CONFIG:
> >> + ret = sev_ioctl_snp_get_config(&input);
> >> + break;
> >> default:
> >> ret = -EINVAL;
> >> goto out;
> >> @@ -1962,6 +2084,7 @@ int sev_dev_init(struct psp_device *psp)
> >> goto e_sev;
> >>
> >> sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
> >> + mutex_init(&sev->snp_certs_lock);
> >>
> >> psp->sev_data = sev;
> >>
> >> diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
> >> index 19d79f9d4212..41d5353d5bab 100644
> >> --- a/drivers/crypto/ccp/sev-dev.h
> >> +++ b/drivers/crypto/ccp/sev-dev.h
> >> @@ -66,6 +66,10 @@ struct sev_device {
> >>
> >> bool snp_initialized;
> >> struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
> >> + void *snp_certs_data;
> >> + u32 snp_certs_len;
> >> + struct mutex snp_certs_lock;
> >> + struct sev_user_data_snp_config snp_config;
> >> };
> >>
> >> int sev_dev_init(struct psp_device *psp);
> >> diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
> >> index 5adfaea7df97..c20d37586d21 100644
> >> --- a/include/uapi/linux/psp-sev.h
> >> +++ b/include/uapi/linux/psp-sev.h
> >> @@ -29,6 +29,8 @@ enum {
> >> SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
> >> SEV_GET_ID2,
> >> SNP_PLATFORM_STATUS,
> >> + SNP_SET_EXT_CONFIG,
> >> + SNP_GET_EXT_CONFIG,
> >>
> >> SEV_MAX,
> >> };
> >> @@ -192,6 +194,21 @@ struct sev_user_data_snp_config {
> >> __u8 rsvd1[52];
> >> } __packed;
> >>
> >> +/**
> >> + * struct sev_data_snp_ext_config - system wide configuration value for SNP.
> >> + *
> >> + * @config_address: address of the struct sev_user_data_snp_config or 0 when
> >> + * reported_tcb does not need to be updated.
> >> + * @certs_address: address of extended guest request certificate chain or
> >> + * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
> >> + * @certs_len: length of the certs
> >> + */
> >> +struct sev_user_data_ext_snp_config {
> >> + __u64 config_address; /* In */
> >> + __u64 certs_address; /* In */
> >> + __u32 certs_len; /* In */
> >> +};
> >> +
> >> /**
> >> * struct sev_issue_cmd - SEV ioctl parameters
> >> *
> >
On Wed, 22 Feb 2023 16:35:43 -0600
"Kalra, Ashish" <[email protected]> wrote:
> On 2/22/2023 2:24 PM, Zhi Wang wrote:
> > On Mon, 20 Feb 2023 12:38:19 -0600
> > Michael Roth <[email protected]> wrote:
> >
> > It seems in the discussion:
> > https://lore.kernel.org/lkml/[email protected]/,
> > this API is going to be removed. Will that fix land in this patch series or not?
> > If not, It would be better to mention it in the comment message of this one
> > or patch 45.
> > If yes, I guess this patch is not needed.
> >
>
> This API is definitely not going to be removed.
>
> There will be some fixes and optimizations added to the API
> implementation (as per the discussions) and that will be included in v9.
>
Thanks.
I should use the term "this API is going to be refined" as
snp_guest_ext_guest_request() is going to be renamed and refined. I gave
this comment because when digging this patch, I found this API was going to be
changed in the discussion based on v7 when digging this patch. It would be
really nice to mention it in the v8 so that some review efforts can be saved.
For example, some people might choose to skip reviewing this one in v8 and get
back on it in the next version when it is ready. Or people can also evaluate
the possible changes in v9 when reviewing this part.
> Thanks,
> Ashish
>
> >> From: Brijesh Singh <[email protected]>
> >>
> >> Version 2 of the GHCB specification defines VMGEXIT that is used to get
> >> the extended attestation report. The extended attestation report includes
> >> the certificate blobs provided through the SNP_SET_EXT_CONFIG.
> >>
> >> The snp_guest_ext_guest_request() will be used by the hypervisor to get
> >> the extended attestation report. See the GHCB specification for more
> >> details.
> >>
> >> Signed-off-by: Brijesh Singh <[email protected]>
> >> Signed-off-by: Ashish Kalra <[email protected]>
> >> Signed-off-by: Michael Roth <[email protected]>
> >> ---
> >> drivers/crypto/ccp/sev-dev.c | 47 ++++++++++++++++++++++++++++++++++++
> >> include/linux/psp-sev.h | 33 +++++++++++++++++++++++++
> >> 2 files changed, 80 insertions(+)
> >>
> >> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> >> index b56b00ca2cd4..e65563bc8298 100644
> >> --- a/drivers/crypto/ccp/sev-dev.c
> >> +++ b/drivers/crypto/ccp/sev-dev.c
> >> @@ -2017,6 +2017,53 @@ int sev_guest_df_flush(int *error)
> >> }
> >> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
> >>
> >> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> >> + unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
> >> +{
> >> + unsigned long expected_npages;
> >> + struct sev_device *sev;
> >> + int rc;
> >> +
> >> + if (!psp_master || !psp_master->sev_data)
> >> + return -ENODEV;
> >> +
> >> + sev = psp_master->sev_data;
> >> +
> >> + if (!sev->snp_initialized)
> >> + return -EINVAL;
> >> +
> >> + mutex_lock(&sev->snp_certs_lock);
> >> + /*
> >> + * Check if there is enough space to copy the certificate chain. Otherwise
> >> + * return ERROR code defined in the GHCB specification.
> >> + */
> >> + expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
> >> + if (*npages < expected_npages) {
> >> + *npages = expected_npages;
> >> + *fw_err = SNP_GUEST_REQ_INVALID_LEN;
> >> + mutex_unlock(&sev->snp_certs_lock);
> >> + return -EINVAL;
> >> + }
> >> +
> >> + rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)fw_err);
> >> + if (rc) {
> >> + mutex_unlock(&sev->snp_certs_lock);
> >> + return rc;
> >> + }
> >> +
> >> + /* Copy the certificate blob */
> >> + if (sev->snp_certs_data) {
> >> + *npages = expected_npages;
> >> + memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
> >> + } else {
> >> + *npages = 0;
> >> + }
> >> +
> >> + mutex_unlock(&sev->snp_certs_lock);
> >> + return rc;
> >> +}
> >> +EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
> >> +
> >> static void sev_exit(struct kref *ref)
> >> {
> >> misc_deregister(&misc_dev->misc);
> >> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> >> index d19744807471..81bafc049eca 100644
> >> --- a/include/linux/psp-sev.h
> >> +++ b/include/linux/psp-sev.h
> >> @@ -931,6 +931,32 @@ void snp_free_firmware_page(void *addr);
> >> */
> >> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages);
> >>
> >> +/**
> >> + * snp_guest_ext_guest_request - perform the SNP extended guest request command
> >> + * defined in the GHCB specification.
> >> + *
> >> + * @data: the input guest request structure
> >> + * @vaddr: address where the certificate blob need to be copied.
> >> + * @npages: number of pages for the certificate blob.
> >> + * If the specified page count is less than the certificate blob size, then the
> >> + * required page count is returned with error code defined in the GHCB spec.
> >> + * If the specified page count is more than the certificate blob size, then
> >> + * page count is updated to reflect the amount of valid data copied in the
> >> + * vaddr.
> >> + *
> >> + * @sev_ret: sev command return code
> >> + *
> >> + * Returns:
> >> + * 0 if the sev successfully processed the command
> >> + * -%ENODEV if the sev device is not available
> >> + * -%ENOTSUPP if the sev does not support SEV
> >> + * -%ETIMEDOUT if the sev command timed out
> >> + * -%EIO if the sev returned a non-zero return code
> >> + */
> >> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> >> + unsigned long vaddr, unsigned long *npages,
> >> + unsigned long *error);
> >> +
> >> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
> >>
> >> static inline int
> >> @@ -968,6 +994,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
> >>
> >> static inline void snp_free_firmware_page(void *addr) { }
> >>
> >> +static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> >> + unsigned long vaddr, unsigned long *n,
> >> + unsigned long *error)
> >> +{
> >> + return -ENODEV;
> >> +}
> >> +
> >> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
> >>
> >> #endif /* __PSP_SEV_H__ */
> >
On 2/23/23 00:38, Zhi Wang wrote:
> On Wed, 22 Feb 2023 16:43:54 -0600
> "Kalra, Ashish" <[email protected]> wrote:
>
>> On 2/22/2023 6:32 AM, Zhi Wang wrote:
>>> On Mon, 20 Feb 2023 12:38:18 -0600
>>> Michael Roth <[email protected]> wrote:
>>>
>>>> From: Brijesh Singh <[email protected]>
>>>>
>>>> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
>>>> system-wide configuration value for SNP guests. The information includes
>>>> the TCB version string to be reported in guest attestation reports.
>>>>
>>>> Version 2 of the GHCB specification adds an NAE (SNP extended guest
>>>> request) that a guest can use to query the reports that include additional
>>>> certificates.
>>>>
>>>> In both cases, userspace provided additional data is included in the
>>>> attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
>>>> command to give the certificate blob and the reported TCB version string
>>>> at once. Note that the specification defines certificate blob with a
>>>> specific GUID format; the userspace is responsible for building the
>>>> proper certificate blob. The ioctl treats it an opaque blob.
>>>>
>>>> While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
>>>> command that can be used to obtain the data programmed through the
>>>> SNP_SET_EXT_CONFIG.
>>>>
>>>> Signed-off-by: Brijesh Singh <[email protected]>
>>>> Signed-off-by: Ashish Kalra <[email protected]>
>>>> Signed-off-by: Michael Roth <[email protected]>
>>>> ---
>>>> Documentation/virt/coco/sev-guest.rst | 27 ++++++
>>>> drivers/crypto/ccp/sev-dev.c | 123 ++++++++++++++++++++++++++
>>>> drivers/crypto/ccp/sev-dev.h | 4 +
>>>> include/uapi/linux/psp-sev.h | 17 ++++
>>>> 4 files changed, 171 insertions(+)
>>>>
>>>> diff --git a/Documentation/virt/coco/sev-guest.rst b/Documentation/virt/coco/sev-guest.rst
>>>> index 11ea67c944df..6cad4226c348 100644
>>>> --- a/Documentation/virt/coco/sev-guest.rst
>>>> +++ b/Documentation/virt/coco/sev-guest.rst
>>>> @@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
>>>> status includes API major, minor version and more. See the SEV-SNP
>>>> specification for further details.
>>>>
>>>> +2.5 SNP_SET_EXT_CONFIG
>>>> +----------------------
>>>> +:Technology: sev-snp
>>>> +:Type: hypervisor ioctl cmd
>>>> +:Parameters (in): struct sev_data_snp_ext_config
>>>> +:Returns (out): 0 on success, -negative on error
>>>> +
>>>> +The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
>>>> +reported TCB version in the attestation report. The command is similar to
>>>> +SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
>>>> +command also accepts an additional certificate blob defined in the GHCB
>>>> +specification.
>>>> +
>>>> +If the certs_address is zero, then the previous certificate blob will deleted.
>>>> +For more information on the certificate blob layout, see the GHCB spec
>>>> +(extended guest request message).
>>>> +
>>>> +2.6 SNP_GET_EXT_CONFIG
>>>> +----------------------
>>>> +:Technology: sev-snp
>>>> +:Type: hypervisor ioctl cmd
>>>> +:Parameters (in): struct sev_data_snp_ext_config
>>>> +:Returns (out): 0 on success, -negative on error
>>>> +
>>>> +The SNP_GET_EXT_CONFIG is used to query the system-wide configuration set
>>>> +through the SNP_SET_EXT_CONFIG.
>>>> +
>>>> 3. SEV-SNP CPUID Enforcement
>>>> ============================
>>>>
>>>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>>>> index 65e13a562f3b..b56b00ca2cd4 100644
>>>> --- a/drivers/crypto/ccp/sev-dev.c
>>>> +++ b/drivers/crypto/ccp/sev-dev.c
>>>> @@ -1481,6 +1481,10 @@ static int __sev_snp_shutdown_locked(int *error)
>>>> data.length = sizeof(data);
>>>> data.iommu_snp_shutdown = 1;
>>>>
>>>> + /* Free the memory used for caching the certificate data */
>>>> + kfree(sev->snp_certs_data);
>>>> + sev->snp_certs_data = NULL;
>>>> +
>>>> wbinvd_on_all_cpus();
>>>>
>>>> retry:
>>>> @@ -1793,6 +1797,118 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
>>>> return ret;
>>>> }
>>>>
>>>> +static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
>>>> +{
>>>> + struct sev_device *sev = psp_master->sev_data;
>>>> + struct sev_user_data_ext_snp_config input;
>>>> + int ret;
>>>> +
>>>> + if (!sev->snp_initialized || !argp->data)
>>>> + return -EINVAL;
>>>> +
>>>> + memset(&input, 0, sizeof(input));
>>>> +
>>>> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
>>>> + return -EFAULT;
>>>> +
>>>> + /* Copy the TCB version programmed through the SET_CONFIG to userspace */
>>>> + if (input.config_address) {
>>>> + if (copy_to_user((void * __user)input.config_address,
>>>> + &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
>>>> + return -EFAULT;
>>>> + }
>>>> +
>>>> + /* Copy the extended certs programmed through the SNP_SET_CONFIG */
>>>> + if (input.certs_address && sev->snp_certs_data) {
>>>> + if (input.certs_len < sev->snp_certs_len) {
>>>> + /* Return the certs length to userspace */
>>>> + input.certs_len = sev->snp_certs_len;
>>>> +
>>>> + ret = -ENOSR;
>>>> + goto e_done;
>>>> + }
>>>> +
>>>
>>> What about if input.certs_len > sev->snp_certs_len? Is it possbile for the
>>> userspace to know the length of data in the buffer? (I guess it might be able
>>> to know the certs len through the blob data, but a comment here would be nice)
>>>
>>
>> If userspace provides an input buffer/length smaller then snp_certs_len,
>> then the above returns the "required" certs length back to userspace.
>>
>> And what is the issue if input.certs_len > sev->snp_certs_len, the
>> buffer returned back to userspace is sev->snp_certs_len as below.
>>
>
> My point is: How can the userspace know the length of return data is shorter
> than input.certs_len when input.certs_len > sev->snp_serts_len? as the length
> is only returned when input.certs_len < sev->snp_certs_len.
The returned data has a defined format that can be used to calculate the
overall length.
Thanks,
Tom
>
>> Thanks,
>> Ashish
>>
>>>> + if (copy_to_user((void * __user)input.certs_address,
>>>> + sev->snp_certs_data, sev->snp_certs_len))
>>>> + return -EFAULT;
>>>> + }
>>>> +
>>>> + ret = 0;
>>>> +
>>>> +e_done:
>>>> + if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
>>>> + ret = -EFAULT;
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
>>>> +{
>>>> + struct sev_device *sev = psp_master->sev_data;
>>>> + struct sev_user_data_ext_snp_config input;
>>>> + struct sev_user_data_snp_config config;
>>>> + void *certs = NULL;
>>>> + int ret = 0;
>>>> +
>>>> + if (!sev->snp_initialized || !argp->data)
>>>> + return -EINVAL;
>>>> +
>>>> + if (!writable)
>>>> + return -EPERM;
>>>> +
>>>> + memset(&input, 0, sizeof(input));
>>>> +
>>>> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
>>>> + return -EFAULT;
>>>> +
>>>> + /* Copy the certs from userspace */
>>>> + if (input.certs_address) {
>>>> + if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
>>>> + return -EINVAL;
>>>> +
>>>> + certs = psp_copy_user_blob(input.certs_address, input.certs_len);
>>>> + if (IS_ERR(certs))
>>>> + return PTR_ERR(certs);
>>>> + }
>>>> +
>>>> + /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
>>>> + if (input.config_address) {
>>>> + memset(&config, 0, sizeof(config));
>>>> + if (copy_from_user(&config,
>>>> + (void __user *)input.config_address, sizeof(config))) {
>>>> + ret = -EFAULT;
>>>> + goto e_free;
>>>> + }
>>>> +
>>>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
>>>> + if (ret)
>>>> + goto e_free;
>>>> +
>>>> + memcpy(&sev->snp_config, &config, sizeof(config));
>>>> + }
>>>> +
>>>> + /*
>>>> + * If the new certs are passed then cache it else free the old certs.
>>>> + */
>>>> + mutex_lock(&sev->snp_certs_lock);
>>>> + if (certs) {
>>>> + kfree(sev->snp_certs_data);
>>>> + sev->snp_certs_data = certs;
>>>> + sev->snp_certs_len = input.certs_len;
>>>> + } else {
>>>> + kfree(sev->snp_certs_data);
>>>> + sev->snp_certs_data = NULL;
>>>> + sev->snp_certs_len = 0;
>>>> + }
>>>> + mutex_unlock(&sev->snp_certs_lock);
>>>> +
>>>> + return 0;
>>>> +
>>>> +e_free:
>>>> + kfree(certs);
>>>> + return ret;
>>>> +}
>>>> +
>>>> static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
>>>> {
>>>> void __user *argp = (void __user *)arg;
>>>> @@ -1847,6 +1963,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
>>>> case SNP_PLATFORM_STATUS:
>>>> ret = sev_ioctl_snp_platform_status(&input);
>>>> break;
>>>> + case SNP_SET_EXT_CONFIG:
>>>> + ret = sev_ioctl_snp_set_config(&input, writable);
>>>> + break;
>>>> + case SNP_GET_EXT_CONFIG:
>>>> + ret = sev_ioctl_snp_get_config(&input);
>>>> + break;
>>>> default:
>>>> ret = -EINVAL;
>>>> goto out;
>>>> @@ -1962,6 +2084,7 @@ int sev_dev_init(struct psp_device *psp)
>>>> goto e_sev;
>>>>
>>>> sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
>>>> + mutex_init(&sev->snp_certs_lock);
>>>>
>>>> psp->sev_data = sev;
>>>>
>>>> diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
>>>> index 19d79f9d4212..41d5353d5bab 100644
>>>> --- a/drivers/crypto/ccp/sev-dev.h
>>>> +++ b/drivers/crypto/ccp/sev-dev.h
>>>> @@ -66,6 +66,10 @@ struct sev_device {
>>>>
>>>> bool snp_initialized;
>>>> struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
>>>> + void *snp_certs_data;
>>>> + u32 snp_certs_len;
>>>> + struct mutex snp_certs_lock;
>>>> + struct sev_user_data_snp_config snp_config;
>>>> };
>>>>
>>>> int sev_dev_init(struct psp_device *psp);
>>>> diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
>>>> index 5adfaea7df97..c20d37586d21 100644
>>>> --- a/include/uapi/linux/psp-sev.h
>>>> +++ b/include/uapi/linux/psp-sev.h
>>>> @@ -29,6 +29,8 @@ enum {
>>>> SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
>>>> SEV_GET_ID2,
>>>> SNP_PLATFORM_STATUS,
>>>> + SNP_SET_EXT_CONFIG,
>>>> + SNP_GET_EXT_CONFIG,
>>>>
>>>> SEV_MAX,
>>>> };
>>>> @@ -192,6 +194,21 @@ struct sev_user_data_snp_config {
>>>> __u8 rsvd1[52];
>>>> } __packed;
>>>>
>>>> +/**
>>>> + * struct sev_data_snp_ext_config - system wide configuration value for SNP.
>>>> + *
>>>> + * @config_address: address of the struct sev_user_data_snp_config or 0 when
>>>> + * reported_tcb does not need to be updated.
>>>> + * @certs_address: address of extended guest request certificate chain or
>>>> + * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
>>>> + * @certs_len: length of the certs
>>>> + */
>>>> +struct sev_user_data_ext_snp_config {
>>>> + __u64 config_address; /* In */
>>>> + __u64 certs_address; /* In */
>>>> + __u32 certs_len; /* In */
>>>> +};
>>>> +
>>>> /**
>>>> * struct sev_issue_cmd - SEV ioctl parameters
>>>> *
>>>
>
On Mon, 20 Feb 2023 12:38:23 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> The next generation of SEV is called SEV-SNP (Secure Nested Paging).
> SEV-SNP builds upon existing SEV and SEV-ES functionality while adding new
> hardware based security protection. SEV-SNP adds strong memory encryption
> integrity protection to help prevent malicious hypervisor-based attacks
> such as data replay, memory re-mapping, and more, to create an isolated
> execution environment.
>
> The SNP feature is added incrementally, the later patches adds a new module
> parameters that can be used to enabled SEV-SNP in the KVM.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 10 +++++++++-
> arch/x86/kvm/svm/svm.h | 8 ++++++++
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 9e9efb42a766..51db01b282eb 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -58,6 +58,9 @@ module_param_named(sev_es, sev_es_enabled, bool, 0444);
> #define sev_es_enabled false
> #endif /* CONFIG_KVM_AMD_SEV */
>
> +/* enable/disable SEV-SNP support */
> +static bool sev_snp_enabled;
> +
> #define AP_RESET_HOLD_NONE 0
> #define AP_RESET_HOLD_NAE_EVENT 1
> #define AP_RESET_HOLD_MSR_PROTO 2
> @@ -2306,6 +2309,7 @@ void __init sev_hardware_setup(void)
> {
> #ifdef CONFIG_KVM_AMD_SEV
> unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
> + bool sev_snp_supported = false;
> bool sev_es_supported = false;
> bool sev_supported = false;
>
> @@ -2385,12 +2389,16 @@ void __init sev_hardware_setup(void)
> if (misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count))
> goto out;
>
> - pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
> sev_es_supported = true;
> + sev_snp_supported = sev_snp_enabled && cpu_feature_enabled(X86_FEATURE_SEV_SNP);
> +
> + pr_info("SEV-ES %ssupported: %u ASIDs\n",
> + sev_snp_supported ? "and SEV-SNP " : "", sev_es_asid_count);
>
> out:
> sev_enabled = sev_supported;
> sev_es_enabled = sev_es_supported;
> + sev_snp_enabled = sev_snp_supported;
> #endif
> }
>
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 5efcf036ccad..8eb1b51e92f5 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -76,6 +76,7 @@ enum {
> struct kvm_sev_info {
> bool active; /* SEV enabled guest */
> bool es_active; /* SEV-ES enabled guest */
> + bool snp_active; /* SEV-SNP enabled guest */
> unsigned int asid; /* ASID used for this guest */
> unsigned int handle; /* SEV firmware handle */
> int fd; /* SEV device fd */
> @@ -323,6 +324,13 @@ static __always_inline bool sev_es_guest(struct kvm *kvm)
> #endif
> }
>
> +static inline bool sev_snp_guest(struct kvm *kvm)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +
> + return sev_es_guest(kvm) && sev->snp_active;
> +}
> +
Maybe also use __always_inline like sev_es_guest() above?
It seems solved some warnings before:
https://lore.kernel.org/all/[email protected]/
> static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
> {
> vmcb->control.clean = 0;
On Mon, 20 Feb 2023 12:38:25 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
> The command initializes a cryptographic digest context used to construct
> the measurement of the guest. If the guest is expected to be migrated,
> the command also binds a migration agent (MA) to the guest.
>
> For more information see the SEV-SNP specification.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 24 ++++
> arch/x86/kvm/svm/sev.c | 121 +++++++++++++++++-
> arch/x86/kvm/svm/svm.h | 1 +
> include/uapi/linux/kvm.h | 10 ++
> 4 files changed, 153 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 2432213bd0ea..58971fc02a15 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -461,6 +461,30 @@ The flags bitmap is defined as::
> If the specified flags is not supported then return -EOPNOTSUPP, and the supported
> flags are returned.
>
> +19. KVM_SNP_LAUNCH_START
> +------------------------
> +
> +The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
> +context for the SEV-SNP guest. To create the encryption context, user must
> +provide a guest policy, migration agent (if any) and guest OS visible
> +workarounds value as defined SEV-SNP specification.
> +
> +Parameters (in): struct kvm_snp_launch_start
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_start {
> + __u64 policy; /* Guest policy to use. */
> + __u64 ma_uaddr; /* userspace address of migration agent */
> + __u8 ma_en; /* 1 if the migration agent is enabled */
> + __u8 imi_en; /* set IMI to 1. */
> + __u8 gosvw[16]; /* guest OS visible workarounds */
> + };
> +
> +See the SEV-SNP specification for further detail on the launch input.
> +
> References
> ==========
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a8efe1f6bf77..097bb2138360 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -22,6 +22,7 @@
> #include <asm/pkru.h>
> #include <asm/trapnr.h>
> #include <asm/fpu/xcr.h>
> +#include <asm/sev.h>
>
> #include "mmu.h"
> #include "x86.h"
> @@ -75,6 +76,8 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> +static int snp_decommission_context(struct kvm *kvm);
> +
> struct enc_region {
> struct list_head list;
> unsigned long npages;
> @@ -100,12 +103,17 @@ static int sev_flush_asids(int min_asid, int max_asid)
> down_write(&sev_deactivate_lock);
>
> wbinvd_on_all_cpus();
> - ret = sev_guest_df_flush(&error);
> +
> + if (sev_snp_enabled)
> + ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
> + else
> + ret = sev_guest_df_flush(&error);
>
> up_write(&sev_deactivate_lock);
>
> if (ret)
> - pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
> + pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
> + sev_snp_enabled ? "-SNP" : "", ret, error);
>
> return ret;
> }
> @@ -2011,6 +2019,80 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> return ret;
> }
>
> +/*
> + * The guest context contains all the information, keys and metadata
> + * associated with the guest that the firmware tracks to implement SEV
> + * and SNP features. The firmware stores the guest context in hypervisor
> + * provide page via the SNP_GCTX_CREATE command.
> + */
> +static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct sev_data_snp_addr data = {};
> + void *context;
> + int rc;
> +
> + /* Allocate memory for context page */
> + context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> + if (!context)
> + return NULL;
> +
> + data.gctx_paddr = __psp_pa(context);
> + rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
> + if (rc) {
> + snp_free_firmware_page(context);
> + return NULL;
> + }
> +
> + return context;
> +}
> +
> +static int snp_bind_asid(struct kvm *kvm, int *error)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_activate data = {0};
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> + data.asid = sev_get_asid(kvm);
> + return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
According to the SNP ABI specification[1] 8.10 SNP_ACTIVATE:
"The firmware checks that a DF_FLUSH is not required. If a DF_FLUSH is
required, the firmware returns DFFLUSH_REQUIRED. Note that all ASIDs are
marked to require a DF_FLUSH at reset."
Do we need a SNP_DF_FLUSH here before calling SNP_ACTIVATE or handle the
situation if the PSP firmware returns DFFLUSH_REQUIRED?
[1] https://www.amd.com/system/files/TechDocs/56860.pdf
> +}
> +
> +static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_start start = {0};
> + struct kvm_sev_snp_launch_start params;
> + int rc;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + sev->snp_context = snp_context_create(kvm, argp);
> + if (!sev->snp_context)
> + return -ENOTTY;
> +
> + start.gctx_paddr = __psp_pa(sev->snp_context);
> + start.policy = params.policy;
> + memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
> + rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
> + if (rc)
> + goto e_free_context;
> +
> + sev->fd = argp->sev_fd;
> + rc = snp_bind_asid(kvm, &argp->error);
> + if (rc)
> + goto e_free_context;
> +
> + return 0;
> +
> +e_free_context:
> + snp_decommission_context(kvm);
> +
> + return rc;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2101,6 +2183,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_RECEIVE_FINISH:
> r = sev_receive_finish(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_START:
> + r = snp_launch_start(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2292,6 +2377,28 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> return ret;
> }
>
> +static int snp_decommission_context(struct kvm *kvm)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_addr data = {};
> + int ret;
> +
> + /* If context is not created then do nothing */
> + if (!sev->snp_context)
> + return 0;
> +
> + data.gctx_paddr = __sme_pa(sev->snp_context);
> + ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
> + if (WARN_ONCE(ret, "failed to release guest context"))
> + return ret;
> +
> + /* free the context page now */
> + snp_free_firmware_page(sev->snp_context);
> + sev->snp_context = NULL;
> +
> + return 0;
> +}
> +
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -2333,7 +2440,15 @@ void sev_vm_destroy(struct kvm *kvm)
> }
> }
>
> - sev_unbind_asid(kvm, sev->handle);
> + if (sev_snp_guest(kvm)) {
> + if (snp_decommission_context(kvm)) {
> + WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
> + return;
> + }
> + } else {
> + sev_unbind_asid(kvm, sev->handle);
> + }
> +
> sev_asid_free(sev);
> }
>
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 56a5c96d8a36..740969b57425 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -92,6 +92,7 @@ struct kvm_sev_info {
> struct misc_cg *misc_cg; /* For misc cgroup accounting */
> atomic_t migration_in_progress;
> u64 snp_init_flags;
> + void *snp_context; /* SNP guest context page */
> };
>
> struct kvm_svm {
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 499cc323f793..cf19799ca5ce 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1919,6 +1919,7 @@ enum sev_cmd_id {
>
> /* SNP specific commands */
> KVM_SEV_SNP_INIT,
> + KVM_SEV_SNP_LAUNCH_START,
>
> KVM_SEV_NR_MAX,
> };
> @@ -2026,6 +2027,15 @@ struct kvm_snp_init {
> __u64 flags;
> };
>
> +struct kvm_sev_snp_launch_start {
> + __u64 policy;
> + __u64 ma_uaddr;
> + __u8 ma_en;
> + __u8 imi_en;
> + __u8 gosvw[16];
> + __u8 pad[6];
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
On 20.02.23 19:38, Michael Roth wrote:
> From: Brijesh Singh <[email protected]>
>
> Version 2 of GHCB specification added the support for two SNP Guest
> Request Message NAE events. The events allows for an SEV-SNP guest to
> make request to the SEV-SNP firmware through hypervisor using the
> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>
> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
> difference of an additional certificate blob that can be passed through
> the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
> provides snp_guest_ext_guest_request() that is used by the KVM to get
> both the report and certificate data at once.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 185 +++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.h | 2 +
> 2 files changed, 181 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 197b1f904567..92179614102e 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -327,6 +327,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (ret)
> goto e_free;
>
> + mutex_init(&sev->guest_req_lock);
> ret = sev_snp_init(&argp->error, false);
> } else {
> ret = sev_platform_init(&argp->error);
> @@ -2059,23 +2060,34 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> */
> static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> {
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> struct sev_data_snp_addr data = {};
> - void *context;
> + void *context, *certs_data;
> int rc;
>
> + /* Allocate memory used for the certs data in SNP guest request */
> + certs_data = kzalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
> + if (!certs_data)
> + return NULL;
I don't understand why this is part of the context creation, which again
is part of the KVM_SEV_SNP_LAUNCH_START op. Would you mind to create a
separate op for this and then check later on while you use the buffer
whether it was ever allocated?
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Mon, 20 Feb 2023 12:38:26 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
> guest's memory. The data is encrypted with the cryptographic context
> created with the KVM_SEV_SNP_LAUNCH_START.
>
> In addition to the inserting data, it can insert a two special pages
> into the guests memory: the secrets page and the CPUID page.
>
> While terminating the guest, reclaim the guest pages added in the RMP
> table. If the reclaim fails, then the page is no longer safe to be
> released back to the system and leak them.
>
> For more information see the SEV-SNP specification.
>
> Co-developed-by: Michael Roth <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 29 +++
> arch/x86/kvm/svm/sev.c | 190 ++++++++++++++++++
> include/uapi/linux/kvm.h | 19 ++
> 3 files changed, 238 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 58971fc02a15..c94be8e6d657 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -485,6 +485,35 @@ Returns: 0 on success, -negative on error
>
> See the SEV-SNP specification for further detail on the launch input.
>
> +20. KVM_SNP_LAUNCH_UPDATE
> +-------------------------
> +
> +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> +calculates a measurement of the memory contents. The measurement is a signature
> +of the memory contents that can be sent to the guest owner as an attestation
> +that the memory was encrypted correctly by the firmware.
> +
> +Parameters (in): struct kvm_snp_launch_update
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_update {
> + __u64 start_gfn; /* Guest page number to start from. */
> + __u64 uaddr; /* userspace address need to be encrypted */
> + __u32 len; /* length of memory region */
> + __u8 imi_page; /* 1 if memory is part of the IMI */
> + __u8 page_type; /* page type */
> + __u8 vmpl3_perms; /* VMPL3 permission mask */
> + __u8 vmpl2_perms; /* VMPL2 permission mask */
> + __u8 vmpl1_perms; /* VMPL1 permission mask */
> + };
> +
> +See the SEV-SNP spec for further details on how to build the VMPL permission
> +mask and page type.
> +
> +
> References
> ==========
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 097bb2138360..03dd227f6090 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -234,6 +234,37 @@ static void sev_decommission(unsigned int handle)
> sev_guest_decommission(&decommission, NULL);
> }
>
> +static int snp_page_reclaim(u64 pfn)
> +{
> + struct sev_data_snp_page_reclaim data = {0};
> + int err, rc;
> +
> + data.paddr = __sme_set(pfn << PAGE_SHIFT);
> + rc = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> + if (rc) {
> + /*
> + * If the reclaim failed, then page is no longer safe
> + * to use.
> + */
> + snp_mark_pages_offline(pfn,
> + page_level_size(PG_LEVEL_4K) >> PAGE_SHIFT);
> + }
> +
> + return rc;
> +}
> +
> +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
> +{
> + int rc;
> +
> + rc = rmp_make_shared(pfn, level);
> + if (rc && leak)
> + snp_mark_pages_offline(pfn,
> + page_level_size(level) >> PAGE_SHIFT);
> +
> + return rc;
> +}
> +
PATCH 24 has similar functions. It would be better to expose them.
> static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
> {
> struct sev_data_deactivate deactivate;
> @@ -2093,6 +2124,162 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return rc;
> }
>
> +static int snp_launch_update_gfn_handler(struct kvm *kvm,
> + struct kvm_gfn_range *range,
> + void *opaque)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct kvm_memory_slot *memslot = range->slot;
> + struct sev_data_snp_launch_update data = {0};
> + struct kvm_sev_snp_launch_update params;
> + struct kvm_sev_cmd *argp = opaque;
> + int *error = &argp->error;
> + int i, n = 0, ret = 0;
> + unsigned long npages;
> + kvm_pfn_t *pfns;
> + gfn_t gfn;
> +
> + if (!kvm_slot_can_be_private(memslot)) {
> + pr_err("SEV-SNP requires restricted memory.\n");
> + return -EINVAL;
> + }
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params))) {
> + pr_err("Failed to copy user parameters for SEV-SNP launch.\n");
> + return -EFAULT;
> + }
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> +
> + npages = range->end - range->start;
> + pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL_ACCOUNT);
> + if (!pfns)
> + return -ENOMEM;
> +
> + pr_debug("%s: GFN range 0x%llx-0x%llx, type %d\n", __func__,
> + range->start, range->end, params.page_type);
> +
> + for (gfn = range->start, i = 0; gfn < range->end; gfn++, i++) {
> + int order, level;
> + void *kvaddr;
> +
> + ret = kvm_restrictedmem_get_pfn(memslot, gfn, &pfns[i], &order);
> + if (ret)
> + goto e_release;
> +
> + n++;
> + ret = snp_lookup_rmpentry((u64)pfns[i], &level);
> + if (ret) {
> + pr_err("Failed to ensure GFN 0x%llx is in initial shared state, ret: %d\n",
> + gfn, ret);
> + return -EFAULT;
> + }
> +
> + kvaddr = pfn_to_kaddr(pfns[i]);
> + if (!virt_addr_valid(kvaddr)) {
> + pr_err("Invalid HVA 0x%llx for GFN 0x%llx\n", (uint64_t)kvaddr, gfn);
> + ret = -EINVAL;
> + goto e_release;
> + }
> +
> + ret = kvm_read_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> + if (ret) {
> + pr_err("Guest read failed, ret: 0x%x\n", ret);
> + goto e_release;
> + }
> +
> + ret = rmp_make_private(pfns[i], gfn << PAGE_SHIFT, PG_LEVEL_4K,
> + sev_get_asid(kvm), true);
> + if (ret) {
> + ret = -EFAULT;
> + goto e_release;
> + }
> +
> + data.address = __sme_set(pfns[i] << PAGE_SHIFT);
> + data.page_size = X86_TO_RMP_PG_LEVEL(PG_LEVEL_4K);
> + data.page_type = params.page_type;
> + data.vmpl3_perms = params.vmpl3_perms;
> + data.vmpl2_perms = params.vmpl2_perms;
> + data.vmpl1_perms = params.vmpl1_perms;
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> + &data, error);
> + if (ret) {
> + pr_err("SEV-SNP launch update failed, ret: 0x%x, fw_error: 0x%x\n",
> + ret, *error);
> + snp_page_reclaim(pfns[i]);
> +
> + /*
> + * When invalid CPUID function entries are detected, the firmware
> + * corrects these entries for debugging purpose and leaves the
> + * page unencrypted so it can be provided users for debugging
> + * and error-reporting.
> + *
> + * Copy the corrected CPUID page back to shared memory so
> + * userpsace can retrieve this information.
> + */
> + if (params.page_type == SNP_PAGE_TYPE_CPUID &&
> + *error == SEV_RET_INVALID_PARAM) {
> + int ret;
> +
> + host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> +
> + ret = kvm_write_guest_page(kvm, gfn, kvaddr, 0, PAGE_SIZE);
> + if (ret)
> + pr_err("Failed to write CPUID page back to userspace, ret: 0x%x\n",
> + ret);
> + }
> +
> +
> + goto e_release;
> + }
> + }
> +
> + /*
> + * Memory attribute updates via KVM_SET_MEMORY_ATTRIBUTES are serialized
> + * via kvm->slots_lock, so use the same protocol for updating them here.
> + */
> + mutex_lock(&kvm->slots_lock);
> + kvm_vm_set_region_attr(kvm, range->start, range->end, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> + mutex_unlock(&kvm->slots_lock);
> +
> +e_release:
> + /* Content of memory is updated, mark pages dirty */
> + for (i = 0; i < n; i++) {
> + set_page_dirty(pfn_to_page(pfns[i]));
> + mark_page_accessed(pfn_to_page(pfns[i]));
> +
> + /*
> + * If its an error, then update RMP entry to change page ownership
> + * to the hypervisor.
> + */
> + if (ret)
> + host_rmp_make_shared(pfns[i], PG_LEVEL_4K, true);
> +
> + put_page(pfn_to_page(pfns[i]));
> + }
> +
> + kvfree(pfns);
> + return ret;
> +}
> +
> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct kvm_sev_snp_launch_update params;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + return kvm_vm_do_hva_range_op(kvm, params.uaddr, params.uaddr + params.len,
> + snp_launch_update_gfn_handler, argp);
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2186,6 +2373,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_START:
> r = snp_launch_start(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_UPDATE:
> + r = snp_launch_update(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index cf19799ca5ce..4098bba17aa4 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1920,6 +1920,7 @@ enum sev_cmd_id {
> /* SNP specific commands */
> KVM_SEV_SNP_INIT,
> KVM_SEV_SNP_LAUNCH_START,
> + KVM_SEV_SNP_LAUNCH_UPDATE,
>
> KVM_SEV_NR_MAX,
> };
> @@ -2036,6 +2037,24 @@ struct kvm_sev_snp_launch_start {
> __u8 pad[6];
> };
>
> +#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
> +#define KVM_SEV_SNP_PAGE_TYPE_VMSA 0x2
> +#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
> +#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
> +#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
> +#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6
> +
> +struct kvm_sev_snp_launch_update {
> + __u64 start_gfn;
> + __u64 uaddr;
> + __u32 len;
> + __u8 imi_page;
> + __u8 page_type;
> + __u8 vmpl3_perms;
> + __u8 vmpl2_perms;
> + __u8 vmpl1_perms;
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
On 20.02.23 19:38, Michael Roth wrote:
> From: Tom Lendacky <[email protected]>
>
> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
> guests to alter the register state of the APs on their own. This allows
> the guest a way of simulating INIT-SIPI.
>
> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
> so as to avoid updating the VMSA pointer while the vCPU is running.
>
> For CREATE
> The guest supplies the GPA of the VMSA to be used for the vCPU with
> the specified APIC ID. The GPA is saved in the svm struct of the
> target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
> to the vCPU and then the vCPU is kicked.
>
> For CREATE_ON_INIT:
> The guest supplies the GPA of the VMSA to be used for the vCPU with
> the specified APIC ID the next time an INIT is performed. The GPA is
> saved in the svm struct of the target vCPU.
>
> For DESTROY:
> The guest indicates it wishes to stop the vCPU. The GPA is cleared
> from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
> added to vCPU and then the vCPU is kicked.
>
> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
> as a result of the event or as a result of an INIT. The handler sets the
> vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
> leave the vCPU as not runnable. Any previous VMSA pages that were
> installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
> a new VMSA is to be installed, the VMSA guest page is pinned and set as
> the VMSA in the vCPU VMCB and the vCPU state is set to
> KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
> cleared in the vCPU VMCB and the vCPU state is left as
> KVM_MP_STATE_UNINITIALIZED to prevent it from being run.
>
> Signed-off-by: Tom Lendacky <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> [mdr: add handling for restrictedmem]
> Signed-off-by: Michael Roth <[email protected]>
What is the intended boot sequence for SEV-SNP guests? FWIW with this
interface in place, guests will typically use in-guest VMSA pages to
hold secondary vcpu state. But that means we're now allocating 4kb of
memory for every vcpu that we create that will be for most of the
guest's lifetime superfluous.
Wouldn't it make more sense to have a model where we only allocate the
VMSA for the boot CPU and leave secondary allocation to the guest? We
already need firmware changes for SEV-SNP - may as well make this one more.
[...]
> +
> +static int sev_snp_ap_creation(struct vcpu_svm *svm)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm_vcpu *target_vcpu;
> + struct vcpu_svm *target_svm;
> + unsigned int request;
> + unsigned int apic_id;
> + bool kick;
> + int ret;
> +
> + request = lower_32_bits(svm->vmcb->control.exit_info_1);
> + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
> +
> + /* Validate the APIC ID */
> + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
Out of curiosity: The target CPU can be my own vCPU, right?
> + if (!target_vcpu) {
> + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
> + apic_id);
> + return -EINVAL;
> + }
> +
> + ret = 0;
> +
> + target_svm = to_svm(target_vcpu);
> +
> + /*
> + * The target vCPU is valid, so the vCPU will be kicked unless the
> + * request is for CREATE_ON_INIT. For any errors at this stage, the
> + * kick will place the vCPU in an non-runnable state.
> + */
> + kick = true;
> +
> + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
> +
> + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
> + target_svm->sev_es.snp_ap_create = true;
> +
> + /* Interrupt injection mode shouldn't change for AP creation */
> + if (request < SVM_VMGEXIT_AP_DESTROY) {
> + u64 sev_features;
> +
> + sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
> + sev_features ^= sev->sev_features;
> + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
> + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
> + vcpu->arch.regs[VCPU_REGS_RAX]);
> + ret = -EINVAL;
> + goto out;
> + }
> + }
> +
> + switch (request) {
> + case SVM_VMGEXIT_AP_CREATE_ON_INIT:
> + kick = false;
> + fallthrough;
> + case SVM_VMGEXIT_AP_CREATE:
> + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
> + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
> + svm->vmcb->control.exit_info_2);
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Malicious guest can RMPADJUST a large page into VMSA which
> + * will hit the SNP erratum where the CPU will incorrectly signal
> + * an RMP violation #PF if a hugepage collides with the RMP entry
> + * of VMSA page, reject the AP CREATE request if VMSA address from
> + * guest is 2M aligned.
This will break genuine current Linux kernels that just happen to
allocate a guest page, no? In fact, given enough vCPUs you're almost
guaranteed to hit an aligned structure somewhere. What is the guest
supposed to do in that situation?
> + */
> + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) {
> + vcpu_unimpl(vcpu,
> + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n",
> + svm->vmcb->control.exit_info_2);
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
> + break;
> + case SVM_VMGEXIT_AP_DESTROY:
I don't understand the destroy path. Why does this case destroy anything?
> + break;
> + default:
> + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
> + request);
> + ret = -EINVAL;
> + break;
> + }
> +
> +out:
> + if (kick) {
> + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)
> + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
What if the guest AP goes through a create -> destroy -> create cycle?
Will it stay runnable while destroyed?
Alex
> +
> + kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
> + kvm_vcpu_kick(target_vcpu);
> + }
> +
> + mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex);
> +
> + return ret;
> +}
> +
> static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> {
> struct vmcb_control_area *control = &svm->vmcb->control;
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Mon, 20 Feb 2023 12:38:32 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
> table to be private or shared using the Page State Change MSR protocol
> as defined in the GHCB specification.
>
> Forward these requests to userspace via KVM_EXIT_VMGEXIT so the VMM can
> issue the KVM ioctls to update the page state accordingly.
>
It would be better to describe the design purpose. Like, why should the
page state change VMGEIXT be forwarded to the userspace instead of being
handled in the kernel.
> Co-developed-by: Michael Roth <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> ---
> arch/x86/include/asm/sev-common.h | 9 ++++++++
> arch/x86/kvm/svm/sev.c | 25 +++++++++++++++++++++++
> arch/x86/kvm/trace.h | 34 +++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> 4 files changed, 69 insertions(+)
>
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index 0a9055cdfae2..ee38f7408470 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -93,6 +93,10 @@ enum psc_op {
> };
>
> #define GHCB_MSR_PSC_REQ 0x014
> +#define GHCB_MSR_PSC_GFN_POS 12
> +#define GHCB_MSR_PSC_GFN_MASK GENMASK_ULL(39, 0)
> +#define GHCB_MSR_PSC_OP_POS 52
> +#define GHCB_MSR_PSC_OP_MASK 0xf
> #define GHCB_MSR_PSC_REQ_GFN(gfn, op) \
> /* GHCBData[55:52] */ \
> (((u64)((op) & 0xf) << 52) | \
> @@ -102,6 +106,11 @@ enum psc_op {
> GHCB_MSR_PSC_REQ)
>
> #define GHCB_MSR_PSC_RESP 0x015
> +#define GHCB_MSR_PSC_ERROR_POS 32
> +#define GHCB_MSR_PSC_ERROR_MASK GENMASK_ULL(31, 0)
> +#define GHCB_MSR_PSC_ERROR GENMASK_ULL(31, 0)
> +#define GHCB_MSR_PSC_RSVD_POS 12
> +#define GHCB_MSR_PSC_RSVD_MASK GENMASK_ULL(19, 0)
> #define GHCB_MSR_PSC_RESP_VAL(val) \
> /* GHCBData[63:32] */ \
> (((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 2613311f4fcc..a1a2686dde7b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -30,6 +30,7 @@
> #include "svm_ops.h"
> #include "cpuid.h"
> #include "trace.h"
> +#include "mmu.h"
>
> #ifndef CONFIG_KVM_AMD_SEV
> /*
> @@ -3345,6 +3346,23 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
> svm->vmcb->control.ghcb_gpa = value;
> }
>
> +/*
> + * TODO: need to get the value set by userspace in vcpu->run->vmgexit.ghcb_msr
> + * and process that here accordingly.
> + */
> +static int snp_complete_psc_msr_protocol(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + set_ghcb_msr_bits(svm, 0,
> + GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS);
> +
> + set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS);
> + set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
> +
> + return 1; /* resume */
> +}
> +
> static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> {
> struct vmcb_control_area *control = &svm->vmcb->control;
> @@ -3445,6 +3463,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> GHCB_MSR_INFO_POS);
> break;
> }
> + case GHCB_MSR_PSC_REQ:
> + vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> + vcpu->run->vmgexit.ghcb_msr = control->ghcb_gpa;
> + vcpu->arch.complete_userspace_io = snp_complete_psc_msr_protocol;
> +
> + ret = -1;
> + break;
> case GHCB_MSR_TERM_REQ: {
> u64 reason_set, reason_code;
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 83843379813e..65861d2d086c 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -7,6 +7,7 @@
> #include <asm/svm.h>
> #include <asm/clocksource.h>
> #include <asm/pvclock-abi.h>
> +#include <asm/sev-common.h>
>
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM kvm
> @@ -1831,6 +1832,39 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
> __entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
> );
>
> +/*
> + * Tracepoint for the SEV-SNP page state change processing
> + */
> +#define psc_operation \
> + {SNP_PAGE_STATE_PRIVATE, "private"}, \
> + {SNP_PAGE_STATE_SHARED, "shared"} \
> +
> +TRACE_EVENT(kvm_snp_psc,
> + TP_PROTO(unsigned int vcpu_id, u64 pfn, u64 gpa, u8 op, int level),
> + TP_ARGS(vcpu_id, pfn, gpa, op, level),
> +
> + TP_STRUCT__entry(
> + __field(int, vcpu_id)
> + __field(u64, pfn)
> + __field(u64, gpa)
> + __field(u8, op)
> + __field(int, level)
> + ),
> +
> + TP_fast_assign(
> + __entry->vcpu_id = vcpu_id;
> + __entry->pfn = pfn;
> + __entry->gpa = gpa;
> + __entry->op = op;
> + __entry->level = level;
> + ),
> +
> + TP_printk("vcpu %u, pfn %llx, gpa %llx, op %s, level %d",
> + __entry->vcpu_id, __entry->pfn, __entry->gpa,
> + __print_symbolic(__entry->op, psc_operation),
> + __entry->level)
> +);
> +
> #endif /* _TRACE_KVM_H */
>
> #undef TRACE_INCLUDE_PATH
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 268c3d16894d..0154fc7a28c1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13515,6 +13515,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_snp_psc);
>
> static int __init kvm_x86_init(void)
> {
On 2/23/23 15:41, Zhi Wang wrote:
> On Mon, 20 Feb 2023 12:38:25 -0600
> Michael Roth <[email protected]> wrote:
>
>> From: Brijesh Singh <[email protected]>
>>
>> KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
>> The command initializes a cryptographic digest context used to construct
>> the measurement of the guest. If the guest is expected to be migrated,
>> the command also binds a migration agent (MA) to the guest.
>>
>> For more information see the SEV-SNP specification.
>>
>> Signed-off-by: Brijesh Singh <[email protected]>
>> Signed-off-by: Ashish Kalra <[email protected]>
>> Signed-off-by: Michael Roth <[email protected]>
>> ---
>> .../virt/kvm/x86/amd-memory-encryption.rst | 24 ++++
>> arch/x86/kvm/svm/sev.c | 121 +++++++++++++++++-
>> arch/x86/kvm/svm/svm.h | 1 +
>> include/uapi/linux/kvm.h | 10 ++
>> 4 files changed, 153 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>> index 2432213bd0ea..58971fc02a15 100644
>> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
>> @@ -461,6 +461,30 @@ The flags bitmap is defined as::
>> If the specified flags is not supported then return -EOPNOTSUPP, and the supported
>> flags are returned.
>>
>> +19. KVM_SNP_LAUNCH_START
>> +------------------------
>> +
>> +The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
>> +context for the SEV-SNP guest. To create the encryption context, user must
>> +provide a guest policy, migration agent (if any) and guest OS visible
>> +workarounds value as defined SEV-SNP specification.
>> +
>> +Parameters (in): struct kvm_snp_launch_start
>> +
>> +Returns: 0 on success, -negative on error
>> +
>> +::
>> +
>> + struct kvm_sev_snp_launch_start {
>> + __u64 policy; /* Guest policy to use. */
>> + __u64 ma_uaddr; /* userspace address of migration agent */
>> + __u8 ma_en; /* 1 if the migration agent is enabled */
>> + __u8 imi_en; /* set IMI to 1. */
>> + __u8 gosvw[16]; /* guest OS visible workarounds */
>> + };
>> +
>> +See the SEV-SNP specification for further detail on the launch input.
>> +
>> References
>> ==========
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index a8efe1f6bf77..097bb2138360 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -22,6 +22,7 @@
>> #include <asm/pkru.h>
>> #include <asm/trapnr.h>
>> #include <asm/fpu/xcr.h>
>> +#include <asm/sev.h>
>>
>> #include "mmu.h"
>> #include "x86.h"
>> @@ -75,6 +76,8 @@ static unsigned int nr_asids;
>> static unsigned long *sev_asid_bitmap;
>> static unsigned long *sev_reclaim_asid_bitmap;
>>
>> +static int snp_decommission_context(struct kvm *kvm);
>> +
>> struct enc_region {
>> struct list_head list;
>> unsigned long npages;
>> @@ -100,12 +103,17 @@ static int sev_flush_asids(int min_asid, int max_asid)
>> down_write(&sev_deactivate_lock);
>>
>> wbinvd_on_all_cpus();
>> - ret = sev_guest_df_flush(&error);
>> +
>> + if (sev_snp_enabled)
>> + ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
>> + else
>> + ret = sev_guest_df_flush(&error);
>>
>> up_write(&sev_deactivate_lock);
>>
>> if (ret)
>> - pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
>> + pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
>> + sev_snp_enabled ? "-SNP" : "", ret, error);
>>
>> return ret;
>> }
>> @@ -2011,6 +2019,80 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>> return ret;
>> }
>>
>> +/*
>> + * The guest context contains all the information, keys and metadata
>> + * associated with the guest that the firmware tracks to implement SEV
>> + * and SNP features. The firmware stores the guest context in hypervisor
>> + * provide page via the SNP_GCTX_CREATE command.
>> + */
>> +static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> +{
>> + struct sev_data_snp_addr data = {};
>> + void *context;
>> + int rc;
>> +
>> + /* Allocate memory for context page */
>> + context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
>> + if (!context)
>> + return NULL;
>> +
>> + data.gctx_paddr = __psp_pa(context);
>> + rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
>> + if (rc) {
>> + snp_free_firmware_page(context);
>> + return NULL;
>> + }
>> +
>> + return context;
>> +}
>> +
>> +static int snp_bind_asid(struct kvm *kvm, int *error)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct sev_data_snp_activate data = {0};
>> +
>> + data.gctx_paddr = __psp_pa(sev->snp_context);
>> + data.asid = sev_get_asid(kvm);
>> + return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
>
> According to the SNP ABI specification[1] 8.10 SNP_ACTIVATE:
>
> "The firmware checks that a DF_FLUSH is not required. If a DF_FLUSH is
> required, the firmware returns DFFLUSH_REQUIRED. Note that all ASIDs are
> marked to require a DF_FLUSH at reset."
>
> Do we need a SNP_DF_FLUSH here before calling SNP_ACTIVATE or handle the
> situation if the PSP firmware returns DFFLUSH_REQUIRED?
>
> [1] https://www.amd.com/system/files/TechDocs/56860.pdf
This is related to ASID use. An initial DF_FLUSH is done which allows any
SNP ASID to be used once without requiring a DF_FLUSH. Once an ASID has
been used, it cannot be re-used until a DF_FLUSH is performed. The ASID
recycling code takes care of that.
Thanks,
Tom
>
>> +}
>> +
>> +static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct sev_data_snp_launch_start start = {0};
>> + struct kvm_sev_snp_launch_start params;
>> + int rc;
>> +
>> + if (!sev_snp_guest(kvm))
>> + return -ENOTTY;
>> +
>> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data, sizeof(params)))
>> + return -EFAULT;
>> +
>> + sev->snp_context = snp_context_create(kvm, argp);
>> + if (!sev->snp_context)
>> + return -ENOTTY;
>> +
>> + start.gctx_paddr = __psp_pa(sev->snp_context);
>> + start.policy = params.policy;
>> + memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
>> + rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
>> + if (rc)
>> + goto e_free_context;
>> +
>> + sev->fd = argp->sev_fd;
>> + rc = snp_bind_asid(kvm, &argp->error);
>> + if (rc)
>> + goto e_free_context;
>> +
>> + return 0;
>> +
>> +e_free_context:
>> + snp_decommission_context(kvm);
>> +
>> + return rc;
>> +}
>> +
>> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>> {
>> struct kvm_sev_cmd sev_cmd;
>> @@ -2101,6 +2183,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>> case KVM_SEV_RECEIVE_FINISH:
>> r = sev_receive_finish(kvm, &sev_cmd);
>> break;
>> + case KVM_SEV_SNP_LAUNCH_START:
>> + r = snp_launch_start(kvm, &sev_cmd);
>> + break;
>> default:
>> r = -EINVAL;
>> goto out;
>> @@ -2292,6 +2377,28 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>> return ret;
>> }
>>
>> +static int snp_decommission_context(struct kvm *kvm)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct sev_data_snp_addr data = {};
>> + int ret;
>> +
>> + /* If context is not created then do nothing */
>> + if (!sev->snp_context)
>> + return 0;
>> +
>> + data.gctx_paddr = __sme_pa(sev->snp_context);
>> + ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
>> + if (WARN_ONCE(ret, "failed to release guest context"))
>> + return ret;
>> +
>> + /* free the context page now */
>> + snp_free_firmware_page(sev->snp_context);
>> + sev->snp_context = NULL;
>> +
>> + return 0;
>> +}
>> +
>> void sev_vm_destroy(struct kvm *kvm)
>> {
>> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> @@ -2333,7 +2440,15 @@ void sev_vm_destroy(struct kvm *kvm)
>> }
>> }
>>
>> - sev_unbind_asid(kvm, sev->handle);
>> + if (sev_snp_guest(kvm)) {
>> + if (snp_decommission_context(kvm)) {
>> + WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
>> + return;
>> + }
>> + } else {
>> + sev_unbind_asid(kvm, sev->handle);
>> + }
>> +
>> sev_asid_free(sev);
>> }
>>
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 56a5c96d8a36..740969b57425 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -92,6 +92,7 @@ struct kvm_sev_info {
>> struct misc_cg *misc_cg; /* For misc cgroup accounting */
>> atomic_t migration_in_progress;
>> u64 snp_init_flags;
>> + void *snp_context; /* SNP guest context page */
>> };
>>
>> struct kvm_svm {
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 499cc323f793..cf19799ca5ce 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1919,6 +1919,7 @@ enum sev_cmd_id {
>>
>> /* SNP specific commands */
>> KVM_SEV_SNP_INIT,
>> + KVM_SEV_SNP_LAUNCH_START,
>>
>> KVM_SEV_NR_MAX,
>> };
>> @@ -2026,6 +2027,15 @@ struct kvm_snp_init {
>> __u64 flags;
>> };
>>
>> +struct kvm_sev_snp_launch_start {
>> + __u64 policy;
>> + __u64 ma_uaddr;
>> + __u8 ma_en;
>> + __u8 imi_en;
>> + __u8 gosvw[16];
>> + __u8 pad[6];
>> +};
>> +
>> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
>> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
>> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
>
On Mon, Feb 20, 2023 at 12:38:10PM -0600, Michael Roth wrote:
> From: Ashish Kalra <[email protected]>
>
> Return pfn from dump_pagetable() to do SEV-specific
> fault handling. Used for handling SNP RMP page fault.
>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/mm/fault.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index afd4cde17001..f2b16dcfbd9a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -311,7 +311,7 @@ static bool low_pfn(unsigned long pfn)
> return pfn < max_low_pfn;
> }
>
> -static void dump_pagetable(unsigned long address)
> +static unsigned long dump_pagetable(unsigned long address)
> {
> pgd_t *base = __va(read_cr3_pa());
> pgd_t *pgd = &base[pgd_index(address)];
> @@ -345,8 +345,10 @@ static void dump_pagetable(unsigned long address)
>
> pte = pte_offset_kernel(pmd, address);
> pr_cont("*pte = %0*Lx ", sizeof(*pte) * 2, (u64)pte_val(*pte));
> + return 0;
> out:
> pr_cont("\n");
> + return 0;
> }
>
> #else /* CONFIG_X86_64: */
> @@ -367,10 +369,11 @@ static int bad_address(void *p)
> return get_kernel_nofault(dummy, (unsigned long *)p);
> }
>
> -static void dump_pagetable(unsigned long address)
> +static unsigned long dump_pagetable(unsigned long address)
> {
> pgd_t *base = __va(read_cr3_pa());
> pgd_t *pgd = base + pgd_index(address);
> + unsigned long pfn;
pfn should be initialized otherwise this function may return an
uninitialized value.
> p4d_t *p4d;
> pud_t *pud;
> pmd_t *pmd;
> @@ -388,6 +391,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(p4d))
> goto bad;
>
> + pfn = p4d_pfn(*p4d);
> pr_cont("P4D %lx ", p4d_val(*p4d));
> if (!p4d_present(*p4d) || p4d_large(*p4d))
> goto out;
> @@ -396,6 +400,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pud))
> goto bad;
>
> + pfn = pud_pfn(*pud);
> pr_cont("PUD %lx ", pud_val(*pud));
> if (!pud_present(*pud) || pud_large(*pud))
> goto out;
> @@ -404,6 +409,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pmd))
> goto bad;
>
> + pfn = pmd_pfn(*pmd);
> pr_cont("PMD %lx ", pmd_val(*pmd));
> if (!pmd_present(*pmd) || pmd_large(*pmd))
> goto out;
> @@ -412,13 +418,14 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pte))
> goto bad;
>
> + pfn = pte_pfn(*pte);
> pr_cont("PTE %lx", pte_val(*pte));
> out:
> pr_cont("\n");
> -
> - return;
> + return pfn;
> bad:
> pr_info("BAD\n");
> + return -1;
> }
>
> #endif /* CONFIG_X86_64 */
> --
> 2.25.1
>
>
>
>
On Mon, 20 Feb 2023 12:38:35 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled in the guest, the hardware places restrictions
> on all memory accesses based on the contents of the RMP table. When
> hardware encounters RMP check failure caused by the guest memory access
> it raises the #NPF. The error code contains additional information on
> the access type. See the APM volume 2 for additional information.
>
> Page state changes are handled by userspace, so if an RMP fault is
> triggered as a result of an RMP NPT fault, exit to userspace just like
> with explicit page-state change requests.
>
> RMP NPT faults can also occur if the guest pvalidates a 2M page as 4K,
> in which case the RMP entries need to be PSMASH'd. Handle this case
> immediately in the kernel.
>
> Co-developed-by: Michael Roth <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 84 ++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.c | 21 +++++++++--
> arch/x86/kvm/svm/svm.h | 1 +
> 3 files changed, 102 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 102966c43e28..197b1f904567 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3347,6 +3347,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
> svm->vmcb->control.ghcb_gpa = value;
> }
>
> +static int snp_rmptable_psmash(struct kvm *kvm, kvm_pfn_t pfn)
> +{
> + pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> +
> + return psmash(pfn);
> +}
> +
> /*
> * TODO: need to get the value set by userspace in vcpu->run->vmgexit.ghcb_msr
> * and process that here accordingly.
> @@ -3872,3 +3879,80 @@ void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *le
> pr_debug("%s: GFN: 0x%llx, PFN: 0x%llx, level: %d, rmp_level: %d, level_orig: %d, assigned: %d\n",
> __func__, gfn, pfn, *level, rmp_level, level_orig, assigned);
> }
> +
> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> +{
> + int order, rmp_level, assigned, ret;
> + struct kvm_memory_slot *slot;
> + struct kvm *kvm = vcpu->kvm;
> + kvm_pfn_t pfn;
> + gfn_t gfn;
> +
> + /*
> + * Private memslots punt handling of implicit page state changes to
^put
> + * userspace, so the only RMP faults expected here for
> + * PFERR_GUEST_SIZEM_MASK. Anything else suggests that the RMP table has
> + * gotten out of sync with the private memslot.
> + *
> + * TODO: However, this case has also been noticed when an access occurs
> + * to an NPT mapping that has just been split/PSMASHED, in which case
> + * PFERR_GUEST_SIZEM_MASK might not be set. In those cases it should be
> + * safe to ignore and let the guest retry, but log these just in case
> + * for now.
> + */
> + if (!(error_code & PFERR_GUEST_SIZEM_MASK)) {
> + pr_warn_ratelimited("Unexpected RMP fault for GPA 0x%llx, error_code 0x%llx",
> + gpa, error_code);
> + return;
> + }
> +
> + gfn = gpa >> PAGE_SHIFT;
> +
> + /*
> + * Only RMPADJUST/PVALIDATE should cause PFERR_GUEST_SIZEM.
> + *
> + * For PVALIDATE, this should only happen if a guest PVALIDATEs a 4K GFN
> + * that is backed by a huge page in the host whose RMP entry has the
> + * hugepage/assigned bits set. With UPM, that should only ever happen
> + * for private pages.
> + *
> + * For RMPADJUST, this assumption might not hold, in which case handling
> + * for obtaining the PFN from HVA-backed memory may be needed. For now,
> + * just print warnings.
> + */
> + if (!kvm_mem_is_private(kvm, gfn)) {
> + pr_warn_ratelimited("Unexpected RMP fault, size-mismatch for non-private GPA 0x%llx\n",
> + gpa);
> + return;
> + }
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!kvm_slot_can_be_private(slot)) {
> + pr_warn_ratelimited("Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
> + gpa);
> + return;
> + }
> +
> + ret = kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order);
> + if (ret) {
> + pr_warn_ratelimited("Unexpected RMP fault, no private backing page for GPA 0x%llx\n",
> + gpa);
> + return;
> + }
> +
> + assigned = snp_lookup_rmpentry(pfn, &rmp_level);
> + if (assigned != 1) {
> + pr_warn_ratelimited("Unexpected RMP fault, no assigned RMP entry for GPA 0x%llx\n",
> + gpa);
> + goto out;
> + }
> +
> + ret = snp_rmptable_psmash(kvm, pfn);
> + if (ret)
> + pr_err_ratelimited("Unable to split RMP entries for GPA 0x%llx PFN 0x%llx ret %d\n",
> + gpa, pfn, ret);
> +
> +out:
> + kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
> + put_page(pfn_to_page(pfn));
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 9eb750c8b04c..f9ab4bf6d245 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1976,15 +1976,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
> static int npf_interception(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> + int rc;
>
> u64 fault_address = svm->vmcb->control.exit_info_2;
> u64 error_code = svm->vmcb->control.exit_info_1;
>
> trace_kvm_page_fault(vcpu, fault_address, error_code);
> - return kvm_mmu_page_fault(vcpu, fault_address, error_code,
> - static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
> - svm->vmcb->control.insn_bytes : NULL,
> - svm->vmcb->control.insn_len);
> + rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
> + static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
> + svm->vmcb->control.insn_bytes : NULL,
> + svm->vmcb->control.insn_len);
> +
> + /*
> + * rc == 0 indicates a userspace exit is needed to handle page
> + * transitions, so do that first before updating the RMP table.
> + */
> + if (error_code & PFERR_GUEST_RMP_MASK) {
> + if (rc == 0)
> + return rc;
> + handle_rmp_page_fault(vcpu, fault_address, error_code);
> + }
> +
> + return rc;
> }
>
> static int db_interception(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 0c655a4d32d5..13b00233b315 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -714,6 +714,7 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
> void sev_es_unmap_ghcb(struct vcpu_svm *svm);
> struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
> void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level);
> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
>
> /* vmenter.S */
>
On Mon, 20 Feb 2023 12:38:36 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> Version 2 of GHCB specification added the support for two SNP Guest
> Request Message NAE events. The events allows for an SEV-SNP guest to
> make request to the SEV-SNP firmware through hypervisor using the
> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>
> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
> difference of an additional certificate blob that can be passed through
> the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
> provides snp_guest_ext_guest_request() that is used by the KVM to get
> both the report and certificate data at once.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 185 +++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.h | 2 +
> 2 files changed, 181 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 197b1f904567..92179614102e 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -327,6 +327,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (ret)
> goto e_free;
>
> + mutex_init(&sev->guest_req_lock);
> ret = sev_snp_init(&argp->error, false);
> } else {
> ret = sev_platform_init(&argp->error);
> @@ -2059,23 +2060,34 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> */
> static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> {
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> struct sev_data_snp_addr data = {};
> - void *context;
> + void *context, *certs_data;
> int rc;
>
> + /* Allocate memory used for the certs data in SNP guest request */
> + certs_data = kzalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
> + if (!certs_data)
> + return NULL;
> +
> /* Allocate memory for context page */
> context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> if (!context)
> - return NULL;
> + goto e_free;
>
> data.gctx_paddr = __psp_pa(context);
> rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
> - if (rc) {
> - snp_free_firmware_page(context);
> - return NULL;
> - }
> + if (rc)
> + goto e_free;
> +
> + sev->snp_certs_data = certs_data;
>
> return context;
> +
> +e_free:
> + snp_free_firmware_page(context);
> + kfree(certs_data);
> + return NULL;
> }
>
> static int snp_bind_asid(struct kvm *kvm, int *error)
> @@ -2693,6 +2705,8 @@ static int snp_decommission_context(struct kvm *kvm)
> snp_free_firmware_page(sev->snp_context);
> sev->snp_context = NULL;
>
> + kfree(sev->snp_certs_data);
> +
> return 0;
> }
>
> @@ -3153,6 +3167,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
> case SVM_VMGEXIT_UNSUPPORTED_EVENT:
> case SVM_VMGEXIT_HV_FEATURES:
> case SVM_VMGEXIT_PSC:
> + case SVM_VMGEXIT_GUEST_REQUEST:
> + case SVM_VMGEXIT_EXT_GUEST_REQUEST:
> break;
> default:
> reason = GHCB_ERR_INVALID_EVENT;
> @@ -3384,6 +3400,149 @@ static int snp_complete_psc(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> +static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm,
> + struct sev_data_snp_guest_request *data,
> + gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + kvm_pfn_t req_pfn, resp_pfn;
> + struct kvm_sev_info *sev;
> +
> + sev = &to_kvm_svm(kvm)->sev_info;
> +
> + if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE))
> + return SEV_RET_INVALID_PARAM;
> +
> + req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
> + if (is_error_noslot_pfn(req_pfn))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
> + if (is_error_noslot_pfn(resp_pfn))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + data->gctx_paddr = __psp_pa(sev->snp_context);
> + data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
> + data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
> +
> + return 0;
> +}
> +
> +static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data, unsigned long *rc)
> +{
> + u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
> + int ret;
> +
> + ret = snp_page_reclaim(pfn);
> + if (ret)
> + *rc = SEV_RET_INVALID_ADDRESS;
> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + *rc = SEV_RET_INVALID_ADDRESS;
> +}
> +
> +static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct sev_data_snp_guest_request data = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_sev_info *sev;
> + unsigned long rc;
> + int err;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;
> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
> + if (rc)
> + /* use the firmware error code */
> + rc = err;
> +
> + snp_cleanup_guest_buf(&data, &rc);
> +
I am curious about the reason of having a shared-private and private-shared
conversion before and after issuing the command to firmware.
Is it because the firmware requires the resp page has to be a private page?
while the req page is not. (I understand that the req/resp page should be
shared before returnning to guest due to GHCB spec)
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
> +}
> +
> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct sev_data_snp_guest_request req = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + unsigned long data_npages;
> + struct kvm_sev_info *sev;
> + unsigned long rc, err;
> + u64 data_gpa;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;
> +
> + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> + data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
> +
> + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> + rc = SEV_RET_INVALID_ADDRESS;
> + goto e_fail;
> + }
> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> + &data_npages, &err);
> + if (rc) {
> + /*
> + * If buffer length is small then return the expected
> + * length in rbx.
> + */
> + if (err == SNP_GUEST_REQ_INVALID_LEN)
> + vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
> +
> + /* pass the firmware error code */
> + rc = err;
> + goto cleanup;
> + }
> +
> + /* Copy the certificate blob in the guest memory */
> + if (data_npages &&
> + kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
> + rc = SEV_RET_INVALID_ADDRESS;
> +
> +cleanup:
> + snp_cleanup_guest_buf(&req, &rc);
> +
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc);
> +}
> +
> static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> {
> struct vmcb_control_area *control = &svm->vmcb->control;
> @@ -3633,6 +3792,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
> vcpu->run->vmgexit.ghcb_msr = ghcb_gpa;
> vcpu->arch.complete_userspace_io = snp_complete_psc;
> break;
> + case SVM_VMGEXIT_GUEST_REQUEST: {
> + snp_handle_guest_request(svm, control->exit_info_1, control->exit_info_2);
> +
> + ret = 1;
> + break;
> + }
> + case SVM_VMGEXIT_EXT_GUEST_REQUEST: {
> + snp_handle_ext_guest_request(svm,
> + control->exit_info_1,
> + control->exit_info_2);
> +
> + ret = 1;
> + break;
> + }
> case SVM_VMGEXIT_UNSUPPORTED_EVENT:
> vcpu_unimpl(vcpu,
> "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 13b00233b315..4a9ffb7e5139 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -93,6 +93,8 @@ struct kvm_sev_info {
> atomic_t migration_in_progress;
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> + void *snp_certs_data;
> + struct mutex guest_req_lock; /* Lock for guest request handling */
> };
>
> struct kvm_svm {
On Fri, 24 Feb 2023 13:37:48 +0100
Alexander Graf <[email protected]> wrote:
>
> On 20.02.23 19:38, Michael Roth wrote:
> > From: Tom Lendacky <[email protected]>
> >
> > Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
> > guests to alter the register state of the APs on their own. This allows
> > the guest a way of simulating INIT-SIPI.
> >
> > A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
> > so as to avoid updating the VMSA pointer while the vCPU is running.
> >
> > For CREATE
> > The guest supplies the GPA of the VMSA to be used for the vCPU with
> > the specified APIC ID. The GPA is saved in the svm struct of the
> > target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
> > to the vCPU and then the vCPU is kicked.
> >
> > For CREATE_ON_INIT:
> > The guest supplies the GPA of the VMSA to be used for the vCPU with
> > the specified APIC ID the next time an INIT is performed. The GPA is
> > saved in the svm struct of the target vCPU.
> >
> > For DESTROY:
> > The guest indicates it wishes to stop the vCPU. The GPA is cleared
> > from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
> > added to vCPU and then the vCPU is kicked.
> >
> > The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
> > as a result of the event or as a result of an INIT. The handler sets the
> > vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
> > leave the vCPU as not runnable. Any previous VMSA pages that were
> > installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
> > a new VMSA is to be installed, the VMSA guest page is pinned and set as
> > the VMSA in the vCPU VMCB and the vCPU state is set to
> > KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
> > cleared in the vCPU VMCB and the vCPU state is left as
> > KVM_MP_STATE_UNINITIALIZED to prevent it from being run.
> >
> > Signed-off-by: Tom Lendacky <[email protected]>
> > Signed-off-by: Brijesh Singh <[email protected]>
> > Signed-off-by: Ashish Kalra <[email protected]>
> > [mdr: add handling for restrictedmem]
> > Signed-off-by: Michael Roth <[email protected]>
>
>
> What is the intended boot sequence for SEV-SNP guests? FWIW with this
> interface in place, guests will typically use in-guest VMSA pages to
> hold secondary vcpu state. But that means we're now allocating 4kb of
> memory for every vcpu that we create that will be for most of the
> guest's lifetime superfluous.
>
> Wouldn't it make more sense to have a model where we only allocate the
> VMSA for the boot CPU and leave secondary allocation to the guest? We
> already need firmware changes for SEV-SNP - may as well make this one more.
>
> [...]
>
> > +
> > +static int sev_snp_ap_creation(struct vcpu_svm *svm)
> > +{
> > + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
> > + struct kvm_vcpu *vcpu = &svm->vcpu;
> > + struct kvm_vcpu *target_vcpu;
> > + struct vcpu_svm *target_svm;
> > + unsigned int request;
> > + unsigned int apic_id;
> > + bool kick;
> > + int ret;
> > +
> > + request = lower_32_bits(svm->vmcb->control.exit_info_1);
> > + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
> > +
> > + /* Validate the APIC ID */
> > + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
>
>
> Out of curiosity: The target CPU can be my own vCPU, right?
>
>
> > + if (!target_vcpu) {
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
> > + apic_id);
> > + return -EINVAL;
> > + }
> > +
> > + ret = 0;
> > +
> > + target_svm = to_svm(target_vcpu);
> > +
> > + /*
> > + * The target vCPU is valid, so the vCPU will be kicked unless the
> > + * request is for CREATE_ON_INIT. For any errors at this stage, the
> > + * kick will place the vCPU in an non-runnable state.
> > + */
> > + kick = true;
> > +
> > + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
> > +
> > + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
> > + target_svm->sev_es.snp_ap_create = true;
> > +
> > + /* Interrupt injection mode shouldn't change for AP creation */
> > + if (request < SVM_VMGEXIT_AP_DESTROY) {
> > + u64 sev_features;
> > +
> > + sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
> > + sev_features ^= sev->sev_features;
> > + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
> > + vcpu->arch.regs[VCPU_REGS_RAX]);
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + }
> > +
> > + switch (request) {
> > + case SVM_VMGEXIT_AP_CREATE_ON_INIT:
> > + kick = false;
> > + fallthrough;
> > + case SVM_VMGEXIT_AP_CREATE:
> > + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
> > + svm->vmcb->control.exit_info_2);
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + /*
> > + * Malicious guest can RMPADJUST a large page into VMSA which
> > + * will hit the SNP erratum where the CPU will incorrectly signal
> > + * an RMP violation #PF if a hugepage collides with the RMP entry
> > + * of VMSA page, reject the AP CREATE request if VMSA address from
> > + * guest is 2M aligned.
>
>
> This will break genuine current Linux kernels that just happen to
> allocate a guest page, no? In fact, given enough vCPUs you're almost
> guaranteed to hit an aligned structure somewhere. What is the guest
> supposed to do in that situation?
>
>
> > + */
> > + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) {
> > + vcpu_unimpl(vcpu,
> > + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n",
> > + svm->vmcb->control.exit_info_2);
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
> > + break;
> > + case SVM_VMGEXIT_AP_DESTROY:
>
>
> I don't understand the destroy path. Why does this case destroy anything?
>
>
> > + break;
> > + default:
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
> > + request);
> > + ret = -EINVAL;
> > + break;
> > + }
> > +
> > +out:
> > + if (kick) {
> > + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)
> > + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>
>
> What if the guest AP goes through a create -> destroy -> create cycle?
> Will it stay runnable while destroyed?
The code is not very straightforward.
1) target_svm->sev_es.snp_vmsa_gpa is set as INVALID_PAGE in the beginning of this function.
2) If a DESTROY is hit in this function, target_svm->sev_es.snp_vmsa_gpa will be
left as INVALID_PAGE.
3) At the end of this function, it calls kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE).
4) In the vcpu_enter_guest(), the kvm_vcpu_reset()->sev_snp_init_protected_guest_state()
->__sev_snp_init_protected_guest_state() is called.
5) The mp_state is set to KVM_MP_STATE_STOPPED by default and the runtime VMSA is
cleared. Then the it will be initialized according to the guest's
configuration.
6) As the snp_vmsa_gpa is set as INVALID_PAGE in 1, the mp_state will be left as
KVM_MP_STATE_STOPPED.
7) With this code piece:
+ kvm_vcpu_reset(vcpu, true);
+ if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE)
+ goto out;
vcpu_enter_guest() bails out.
>
>
> Alex
>
> > +
> > + kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
> > + kvm_vcpu_kick(target_vcpu);
> > + }
> > +
> > + mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex);
> > +
> > + return ret;
> > +}
> > +
> > static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> > {
> > struct vmcb_control_area *control = &svm->vmcb->control;
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
On Mon, 20 Feb 2023 12:37:52 -0600
Michael Roth <[email protected]> wrote:
Basically, I don't think kvm_mmu_fault_is_private() is promising after going
through both SNP and TDX patches:
1) Fault path is critical. kvm_mmu_fault_is_private() is always doing a gfn_to
_memslot() no matter SNP/TDX is enabled or not. It might mostly hits the
slots->last_used_slot, but the worst case is going through an RB-tree search.
Adding extra overhead on the generic fault path needs to be re-considered
carefully. At least, check if the guest is a CC(SNP/TDX) guest.
2) Just after the gfn_to_memslot() in kvm_mmu_fault_is_private(), there is
another gfn_to_memslot():
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u64 err, bool prefetch)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
.error_code = lower_32_bits(err),
.exec = err & PFERR_FETCH_MASK,
.write = err & PFERR_WRITE_MASK,
.present = err & PFERR_PRESENT_MASK,
.rsvd = err & PFERR_RSVD_MASK,
.user = err & PFERR_USER_MASK,
.prefetch = prefetch,
.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
.nx_huge_page_workaround_enabled =
is_nx_huge_page_enabled(vcpu->kvm),
.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
.is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
};
int r;
if (vcpu->arch.mmu->root_role.direct) {
fault.gfn = fault.addr >> PAGE_SHIFT;
/* here */
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}
I was thinking if checking the private slot and kvm_slot_can_be_private() is
necessary in kvm_mmu_fault_is_private().
TDP MMU is expecting fault.is_private to indicate if CPU thinks the fault
is private or not (For SNP, it is in PF error code, for TDX it is the shared
bit in the fault GPA). TDP MMU will check if the slot is a private slot or
not, leave the userspace to handle it when they thinks differently.
My points:
1) Resolving the PFER in kvm_x86_ops.fault_is_private and setting
fault.is_private is enough. The rest can be handled by the TDP MMU.
2) Put the kvm_x86_ops.fault_is_private in a separate patch so that TDX series
can include it. (64bit-error code part can stay in another patch)
> This callback is used by the KVM MMU to check whether a #NPF was for a
> private GPA or not.
>
> In some cases the full 64-bit error code for the #NPF will be needed to
> make this determination, so also update kvm_mmu_do_page_fault() to
> accept the full 64-bit value so it can be plumbed through to the
> callback.
>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 3 +--
> arch/x86/kvm/mmu/mmu_internal.h | 37 +++++++++++++++++++++++++++---
> 4 files changed, 37 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8dc345cc6318..72183da010b8 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
> KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> +KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e552374f2357..f856d689dda0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1643,6 +1643,7 @@ struct kvm_x86_ops {
>
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
> + bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
>
> bool (*has_wbinvd_exit)(void);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index eda615f3951c..fb3f34b7391c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5724,8 +5724,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> }
>
> if (r == RET_PF_INVALID) {
> - r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> - lower_32_bits(error_code), false);
> + r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false);
> if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> return -EIO;
> }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index e642d431df4b..557a001210df 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -231,6 +231,37 @@ struct kvm_page_fault {
>
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>
> +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> +{
> + struct kvm_memory_slot *slot;
> + bool private_fault = false;
> + gfn_t gfn = gpa_to_gfn(gpa);
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot) {
> + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn);
> + goto out;
> + }
> +
> + if (!kvm_slot_can_be_private(slot)) {
> + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn);
> + goto out;
> + }
> +
> + if (static_call(kvm_x86_fault_is_private)(kvm, gpa, err, &private_fault))
> + goto out;
> +
> + /*
> + * Handling below is for UPM self-tests and guests that treat userspace
> + * as the authority on whether a fault should be private or not.
> + */
> + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +
> +out:
> + pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault);
> + return private_fault;
> +}
> +
> /*
> * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
> * and of course kvm_mmu_do_page_fault().
> @@ -262,11 +293,11 @@ enum {
> };
>
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - u32 err, bool prefetch)
> + u64 err, bool prefetch)
> {
> struct kvm_page_fault fault = {
> .addr = cr2_or_gpa,
> - .error_code = err,
> + .error_code = lower_32_bits(err),
> .exec = err & PFERR_FETCH_MASK,
> .write = err & PFERR_WRITE_MASK,
> .present = err & PFERR_PRESENT_MASK,
> @@ -280,7 +311,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> + .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
> };
> int r;
>
On Mon, 20 Feb 2023 12:38:41 -0600
Michael Roth <[email protected]> wrote:
> Implement a platform hook to do the work of restoring the direct map
> entries and cleaning up RMP table entries for restricted memory that is
> being freed back to the host.
>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 62 ++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.c | 1 +
> arch/x86/kvm/svm/svm.h | 1 +
> 3 files changed, 64 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 7a74a92cb39a..bedec90d034f 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -4509,3 +4509,65 @@ bool sev_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *priv
>
> return true;
> }
> +
> +void sev_invalidate_private_range(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> +{
> + gfn_t gfn = start;
> +
> + if (!sev_snp_guest(slot->kvm))
> + return;
> +
> + if (!kvm_slot_can_be_private(slot)) {
> + pr_warn_ratelimited("SEV: Memslot for GFN: 0x%llx is not private.\n",
> + gfn);
> + return;
> + }
> +
This is a generic check for both SNP and TDX, it should be moved to
kvm_restrictedmem_invalidate_begin().
> + while (gfn <= end) {
> + gpa_t gpa = gfn_to_gpa(gfn);
> + int level = PG_LEVEL_4K;
> + int order, rc;
> + kvm_pfn_t pfn;
> +
> + rc = kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order);
> + if (rc) {
> + pr_warn_ratelimited("SEV: Failed to retrieve restricted PFN for GFN 0x%llx, rc: %d\n",
> + gfn, rc);
> + gfn++;
> + continue;
> + }
> +
> + if (order) {
> + int rmp_level;
> +
> + if (IS_ALIGNED(gpa, page_level_size(PG_LEVEL_2M)) &&
> + gpa + page_level_size(PG_LEVEL_2M) <= gfn_to_gpa(end))
> + level = PG_LEVEL_2M;
> + else
> + pr_debug("%s: GPA 0x%llx is not aligned to 2M, skipping 2M directmap restoration\n",
> + __func__, gpa);
> +
> + /*
> + * TODO: It may still be possible to restore 2M mapping here,
> + * but keep it simple for now.
> + */
> + if (level == PG_LEVEL_2M &&
> + (!snp_lookup_rmpentry(pfn, &rmp_level) || rmp_level == PG_LEVEL_4K)) {
> + pr_debug("%s: PFN 0x%llx is not mapped as 2M private range, skipping 2M directmap restoration\n",
> + __func__, pfn);
> + level = PG_LEVEL_4K;
> + }
> + }
> +
> + pr_debug("%s: GPA %llx PFN %llx order %d level %d\n",
> + __func__, gpa, pfn, order, level);
> + rc = snp_make_page_shared(slot->kvm, gpa, pfn, level);
> + if (rc)
> + pr_err("SEV: Failed to restore page to shared, GPA: 0x%llx PFN: 0x%llx order: %d rc: %d\n",
> + gpa, pfn, order, rc);
> +
> + gfn += page_level_size(level) >> PAGE_SHIFT;
> + put_page(pfn_to_page(pfn));
> + cond_resched();
> + }
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 18e4a6c17d11..3fe5f13b5f3a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4862,6 +4862,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .adjust_mapping_level = sev_adjust_mapping_level,
> .update_mem_attr = sev_update_mem_attr,
> .fault_is_private = sev_fault_is_private,
> + .invalidate_restricted_mem = sev_invalidate_private_range,
> };
>
> /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 97038afa8020..857b674e68f0 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -727,6 +727,7 @@ void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
> void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
> int sev_update_mem_attr(struct kvm_memory_slot *slot, unsigned int attr,
> gfn_t start, gfn_t end);
> +void sev_invalidate_private_range(struct kvm_memory_slot *slot, gfn_t start, gfn_t end);
>
> bool sev_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
>
On Mon, 20 Feb 2023 12:38:42 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> Add a module parameter than can be used to enable or disable the SEV-SNP
> feature. Now that KVM contains the support for the SNP set the GHCB
> hypervisor feature flag to indicate that SNP is supported.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 7 ++++---
> arch/x86/kvm/svm/svm.h | 2 +-
> 2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index bedec90d034f..70d5650d8d95 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -55,14 +55,15 @@ module_param_named(sev, sev_enabled, bool, 0444);
> /* enable/disable SEV-ES support */
> static bool sev_es_enabled = true;
> module_param_named(sev_es, sev_es_enabled, bool, 0444);
> +
> +/* enable/disable SEV-SNP support */
> +static bool sev_snp_enabled = true;
> +module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
> #else
> #define sev_enabled false
> #define sev_es_enabled false
Guess we also need #define sev_snp_enabled false.
> #endif /* CONFIG_KVM_AMD_SEV */
>
> -/* enable/disable SEV-SNP support */
> -static bool sev_snp_enabled;
> -
> #define AP_RESET_HOLD_NONE 0
> #define AP_RESET_HOLD_NAE_EVENT 1
> #define AP_RESET_HOLD_MSR_PROTO 2
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 857b674e68f0..221b38d3c845 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -694,7 +694,7 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu);
> #define GHCB_VERSION_MAX 2ULL
> #define GHCB_VERSION_MIN 1ULL
>
> -#define GHCB_HV_FT_SUPPORTED 0
> +#define GHCB_HV_FT_SUPPORTED (GHCB_HV_FT_SNP | GHCB_HV_FT_SNP_AP_CREATION)
This is not related to the topic of this patch, should be merged into related
patches.
>
> extern unsigned int max_sev_asid;
>
On Mon, Feb 20, 2023 at 12:38:06PM -0600, Michael Roth wrote:
> From: Brijesh Singh <[email protected]>
>
> The integrity guarantee of SEV-SNP is enforced through the RMP table.
> The RMP is used with standard x86 and IOMMU page tables to enforce
> memory restrictions and page access rights. The RMP check is enforced as
> soon as SEV-SNP is enabled globally in the system. When hardware
> encounters an RMP-check failure, it raises a page-fault exception.
>
> The rmp_make_private() and rmp_make_shared() helpers are used to add
> or remove the pages from the RMP table. Improve the rmp_make_private()
> to invalidate state so that pages cannot be used in the direct-map after
> they are added the RMP table, and restored to their default valid
> permission after the pages are removed from the RMP table.
>
> Co-developed-by: Ashish Kalra <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kernel/sev.c | 57 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 57 insertions(+)
>
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index a49f30c10dc1..3e5ff5934e83 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2595,6 +2595,37 @@ int psmash(u64 pfn)
> }
> EXPORT_SYMBOL_GPL(psmash);
>
> +static int restore_direct_map(u64 pfn, int npages)
> +{
> + int i, ret = 0;
> +
> + for (i = 0; i < npages; i++) {
> + ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));
> + if (ret)
> + goto cleanup;
> + }
> +
> +cleanup:
> + WARN(ret > 0, "Failed to restore direct map for pfn 0x%llx\n", pfn + i);
> + return ret;
> +}
> +
> +static int invalidate_direct_map(u64 pfn, int npages)
> +{
> + int i, ret = 0;
> +
> + for (i = 0; i < npages; i++) {
> + ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));
> + if (ret)
> + goto cleanup;
> + }
> +
> +cleanup:
> + WARN(ret > 0, "Failed to invalidate direct map for pfn 0x%llx\n", pfn + i);
> + restore_direct_map(pfn, i);
This immediately restores the direct map after invalidating it. It
probably needs to put behind if(ret).
Regards, Tom
> + return ret;
> +}
> +
> static int rmpupdate(u64 pfn, struct rmp_state *val)
> {
> int max_attempts = 4 * num_present_cpus();
> @@ -2605,6 +2636,21 @@ static int rmpupdate(u64 pfn, struct rmp_state *val)
> if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> return -ENXIO;
>
> + level = RMP_TO_X86_PG_LEVEL(val->pagesize);
> + npages = page_level_size(level) / PAGE_SIZE;
> +
> + /*
> + * If page is getting assigned in the RMP table then unmap it from the
> + * direct map.
> + */
> + if (val->assigned) {
> + if (invalidate_direct_map(pfn, npages)) {
> + pr_err("Failed to unmap %d pages at pfn 0x%llx from the direct_map\n",
> + npages, pfn);
> + return -EFAULT;
> + }
> + }
> +
> do {
> /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
> asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
> @@ -2630,6 +2676,17 @@ static int rmpupdate(u64 pfn, struct rmp_state *val)
> attempts, val->asid, ret, pfn, npages);
> }
>
> + /*
> + * Restore the direct map after the page is removed from the RMP table.
> + */
> + if (!val->assigned) {
> + if (restore_direct_map(pfn, npages)) {
> + pr_err("Failed to map %d pages at pfn 0x%llx into the direct_map\n",
> + npages, pfn);
> + return -EFAULT;
> + }
> + }
> +
> return 0;
> }
>
> --
> 2.25.1
>
On 2/20/23 10:38, Michael Roth wrote:
> From: Brijesh Singh <[email protected]>
>
> The integrity guarantee of SEV-SNP is enforced through the RMP table.
> The RMP is used with standard x86 and IOMMU page tables to enforce
> memory restrictions and page access rights. The RMP check is enforced as
> soon as SEV-SNP is enabled globally in the system. When hardware
> encounters an RMP-check failure, it raises a page-fault exception.
>
> The rmp_make_private() and rmp_make_shared() helpers are used to add
> or remove the pages from the RMP table. Improve the rmp_make_private()
> to invalidate state so that pages cannot be used in the direct-map after
> they are added the RMP table, and restored to their default valid
> permission after the pages are removed from the RMP table.
This is a purely "what" changelog. It doesn't explain the "why" at all.
Could you please elaborate on why this unmapping operation is necessary?
On 2/20/23 10:38, Michael Roth wrote:
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> + return 0;
> +}
> +
> /*
> * By the time we get here, we already hold the mm semaphore
> *
> @@ -5078,6 +5084,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> pmd_migration_entry_wait(mm, vmf.pmd);
> return 0;
> }
> +
> + if (flags & FAULT_FLAG_PAGE_SPLIT)
> + return handle_split_page_fault(&vmf);
I asked this long ago, but how do you prevent these faults from
occurring on hugetlbfs mappings that can't be split?
On 2/20/23 10:37, Michael Roth wrote:
> The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
> access requires an RMP check. In particular, the read accesses from the
> hypervisor do not require RMP checks because the data confidentiality is
> already protected via memory encryption. When hardware encounters an RMP
> checks failure, it raises a page-fault exception. If RMP check failure
> is due to the page-size mismatch, then split the large page to resolve
> the fault.
What does this all _mean_?
When does the kernel need to care about a "page-size mismatch"?
On 28.02.23 21:47, Zhi Wang wrote:
> On Fri, 24 Feb 2023 13:37:48 +0100
> Alexander Graf <[email protected]> wrote:
>
>> On 20.02.23 19:38, Michael Roth wrote:
>>> From: Tom Lendacky <[email protected]>
>>>
>>> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
>>> guests to alter the register state of the APs on their own. This allows
>>> the guest a way of simulating INIT-SIPI.
>>>
>>> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
>>> so as to avoid updating the VMSA pointer while the vCPU is running.
>>>
>>> For CREATE
>>> The guest supplies the GPA of the VMSA to be used for the vCPU with
>>> the specified APIC ID. The GPA is saved in the svm struct of the
>>> target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
>>> to the vCPU and then the vCPU is kicked.
>>>
>>> For CREATE_ON_INIT:
>>> The guest supplies the GPA of the VMSA to be used for the vCPU with
>>> the specified APIC ID the next time an INIT is performed. The GPA is
>>> saved in the svm struct of the target vCPU.
>>>
>>> For DESTROY:
>>> The guest indicates it wishes to stop the vCPU. The GPA is cleared
>>> from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
>>> added to vCPU and then the vCPU is kicked.
>>>
>>> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
>>> as a result of the event or as a result of an INIT. The handler sets the
>>> vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
>>> leave the vCPU as not runnable. Any previous VMSA pages that were
>>> installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
>>> a new VMSA is to be installed, the VMSA guest page is pinned and set as
>>> the VMSA in the vCPU VMCB and the vCPU state is set to
>>> KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
>>> cleared in the vCPU VMCB and the vCPU state is left as
>>> KVM_MP_STATE_UNINITIALIZED to prevent it from being run.
>>>
>>> Signed-off-by: Tom Lendacky <[email protected]>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>> Signed-off-by: Ashish Kalra <[email protected]>
>>> [mdr: add handling for restrictedmem]
>>> Signed-off-by: Michael Roth <[email protected]>
>>
>> What is the intended boot sequence for SEV-SNP guests? FWIW with this
>> interface in place, guests will typically use in-guest VMSA pages to
>> hold secondary vcpu state. But that means we're now allocating 4kb of
>> memory for every vcpu that we create that will be for most of the
>> guest's lifetime superfluous.
>>
>> Wouldn't it make more sense to have a model where we only allocate the
>> VMSA for the boot CPU and leave secondary allocation to the guest? We
>> already need firmware changes for SEV-SNP - may as well make this one more.
>>
>> [...]
>>
>>> +
>>> +static int sev_snp_ap_creation(struct vcpu_svm *svm)
>>> +{
>>> + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
>>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>>> + struct kvm_vcpu *target_vcpu;
>>> + struct vcpu_svm *target_svm;
>>> + unsigned int request;
>>> + unsigned int apic_id;
>>> + bool kick;
>>> + int ret;
>>> +
>>> + request = lower_32_bits(svm->vmcb->control.exit_info_1);
>>> + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
>>> +
>>> + /* Validate the APIC ID */
>>> + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
>>
>> Out of curiosity: The target CPU can be my own vCPU, right?
>>
>>
>>> + if (!target_vcpu) {
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
>>> + apic_id);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = 0;
>>> +
>>> + target_svm = to_svm(target_vcpu);
>>> +
>>> + /*
>>> + * The target vCPU is valid, so the vCPU will be kicked unless the
>>> + * request is for CREATE_ON_INIT. For any errors at this stage, the
>>> + * kick will place the vCPU in an non-runnable state.
>>> + */
>>> + kick = true;
>>> +
>>> + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
>>> +
>>> + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
>>> + target_svm->sev_es.snp_ap_create = true;
>>> +
>>> + /* Interrupt injection mode shouldn't change for AP creation */
>>> + if (request < SVM_VMGEXIT_AP_DESTROY) {
>>> + u64 sev_features;
>>> +
>>> + sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
>>> + sev_features ^= sev->sev_features;
>>> + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
>>> + vcpu->arch.regs[VCPU_REGS_RAX]);
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> + }
>>> +
>>> + switch (request) {
>>> + case SVM_VMGEXIT_AP_CREATE_ON_INIT:
>>> + kick = false;
>>> + fallthrough;
>>> + case SVM_VMGEXIT_AP_CREATE:
>>> + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
>>> + svm->vmcb->control.exit_info_2);
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + /*
>>> + * Malicious guest can RMPADJUST a large page into VMSA which
>>> + * will hit the SNP erratum where the CPU will incorrectly signal
>>> + * an RMP violation #PF if a hugepage collides with the RMP entry
>>> + * of VMSA page, reject the AP CREATE request if VMSA address from
>>> + * guest is 2M aligned.
>>
>> This will break genuine current Linux kernels that just happen to
>> allocate a guest page, no? In fact, given enough vCPUs you're almost
>> guaranteed to hit an aligned structure somewhere. What is the guest
>> supposed to do in that situation?
>>
>>
>>> + */
>>> + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) {
>>> + vcpu_unimpl(vcpu,
>>> + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n",
>>> + svm->vmcb->control.exit_info_2);
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
>>> + break;
>>> + case SVM_VMGEXIT_AP_DESTROY:
>>
>> I don't understand the destroy path. Why does this case destroy anything?
>>
>>
>>> + break;
>>> + default:
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
>>> + request);
>>> + ret = -EINVAL;
>>> + break;
>>> + }
>>> +
>>> +out:
>>> + if (kick) {
>>> + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)
>>> + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>
>> What if the guest AP goes through a create -> destroy -> create cycle?
>> Will it stay runnable while destroyed?
> The code is not very straightforward.
>
> 1) target_svm->sev_es.snp_vmsa_gpa is set as INVALID_PAGE in the beginning of this function.
>
> 2) If a DESTROY is hit in this function, target_svm->sev_es.snp_vmsa_gpa will be
> left as INVALID_PAGE.
>
> 3) At the end of this function, it calls kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE).
>
> 4) In the vcpu_enter_guest(), the kvm_vcpu_reset()->sev_snp_init_protected_guest_state()
> ->__sev_snp_init_protected_guest_state() is called.
>
> 5) The mp_state is set to KVM_MP_STATE_STOPPED by default and the runtime VMSA is
> cleared. Then the it will be initialized according to the guest's
> configuration.
>
> 6) As the snp_vmsa_gpa is set as INVALID_PAGE in 1, the mp_state will be left as
> KVM_MP_STATE_STOPPED.
>
> 7) With this code piece:
>
> + kvm_vcpu_reset(vcpu, true);
> + if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE)
> + goto out;
>
> vcpu_enter_guest() bails out.
Thanks a lot Zhi for the detailed explanation! I think this code flow
wants to become slightly more obvious. For example, if we just said
 case SVM_VMGEXIT_AP_DESTROY:
   /* This will tell __sev_snp_update_protected_guest_state to unmap
the VMSA */
   target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
   break;
We'd get a big win in readability with little effort. It makes it
immediately obvious where to look for the destroy operation.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Mon, 20 Feb 2023 12:38:43 -0600
Michael Roth <[email protected]> wrote:
> From: Brijesh Singh <[email protected]>
>
> Add support to decrypt guest encrypted memory. These API interfaces can
> be used for example to dump VMCBs on SNP guest exit.
>
What kinds of check will be applied from firmware when VMM decrypts this
page? I suppose there has to be kinda mechanism to prevent VMM to decrypt
any page in the guest. It would be nice to have some introduction about
it in the comments.
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> [mdr: minor commit fixups]
> Signed-off-by: Michael Roth <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 32 ++++++++++++++++++++++++++++++++
> include/linux/psp-sev.h | 22 ++++++++++++++++++++--
> 2 files changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index e65563bc8298..bf5167b2acfc 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -2017,6 +2017,38 @@ int sev_guest_df_flush(int *error)
> }
> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>
> +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
> +{
> + struct sev_data_snp_dbg data = {0};
> + struct sev_device *sev;
> + int ret;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return -ENODEV;
> +
> + sev = psp_master->sev_data;
> +
> + if (!sev->snp_initialized)
> + return -EINVAL;
> +
> + data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
> + data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
> + data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
> +
> + /* The destination page must be in the firmware state. */
> + if (rmp_mark_pages_firmware(data.dst_addr, 1, false))
> + return -EIO;
> +
> + ret = sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, &data, error);
> +
> + /* Restore the page state */
> + if (snp_reclaim_pages(data.dst_addr, 1, false))
> + ret = -EIO;
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt_page);
> +
> int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
> {
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 81bafc049eca..92116e2b74fd 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -710,7 +710,6 @@ struct sev_data_snp_dbg {
> u64 gctx_paddr; /* In */
> u64 src_addr; /* In */
> u64 dst_addr; /* In */
> - u32 len; /* In */
> } __packed;
>
> /**
> @@ -913,13 +912,27 @@ int sev_guest_decommission(struct sev_data_decommission *data, int *error);
> * @error: SEV command return code
> *
> * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV if the sev device is not available
> + * -%ENOTSUPP if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO if the sev returned a non-zero return code
> + */
> +int sev_do_cmd(int cmd, void *data, int *psp_ret);
> +
> +/**
> + * snp_guest_dbg_decrypt_page - perform SEV SNP_DBG_DECRYPT command
> + *
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> * 0 if the SEV successfully processed the command
> * -%ENODEV if the SEV device is not available
> * -%ENOTSUPP if the SEV does not support SEV
> * -%ETIMEDOUT if the SEV command timed out
> * -%EIO if the SEV returned a non-zero return code
> */
> -int sev_do_cmd(int cmd, void *data, int *psp_ret);
> +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error);
>
> void *psp_copy_user_blob(u64 uaddr, u32 len);
> void *snp_alloc_firmware_page(gfp_t mask);
> @@ -987,6 +1000,11 @@ static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_P
>
> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
>
> +static inline int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
> +{
> + return -ENODEV;
> +}
> +
> static inline void *snp_alloc_firmware_page(gfp_t mask)
> {
> return NULL;
On Mon, 20 Feb 2023 12:38:44 -0600
Michael Roth <[email protected]> wrote:
> From: Ashish Kalra <[email protected]>
>
> Implement a workaround for an SNP erratum where the CPU will incorrectly
> signal an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
> RMP entry of the VMSAVE target page.
>
> When SEV-SNP is globally enabled, the CPU marks the VMSAVE target page
> as "InUse" while the VMSAVE instruction is executing. If another
> CPU writes to a different page in the same 2MB region while the VMSAVE
> is executing, the CPU will throw an RMP violation #PF.
>
> Use the snp safe generic allocator for allocating the VMSA target
> page which will ensure that the page returned is not a hugepage, as it
> is already being used for the allocating the VMCB, VMSA and AVIC backing
> page.
>
This should be merged with patch where implements the snp_safe_alloc_page().
> Co-developed-by: Marc Orr <[email protected]>
> Signed-off-by: Marc Orr <[email protected]>
> Reported-by: Alper Gun <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 3fe5f13b5f3a..8bda31a61757 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -665,7 +665,7 @@ static int svm_cpu_init(int cpu)
> int ret = -ENOMEM;
>
> memset(sd, 0, sizeof(struct svm_cpu_data));
> - sd->save_area = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + sd->save_area = snp_safe_alloc_page(NULL);
> if (!sd->save_area)
> return ret;
>
On Wed, 1 Mar 2023 08:56:05 -0800
Dave Hansen <[email protected]> wrote:
> On 2/20/23 10:37, Michael Roth wrote:
> > The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
> > access requires an RMP check. In particular, the read accesses from the
> > hypervisor do not require RMP checks because the data confidentiality is
> > already protected via memory encryption. When hardware encounters an RMP
> > checks failure, it raises a page-fault exception. If RMP check failure
> > is due to the page-size mismatch, then split the large page to resolve
> > the fault.
>
> What does this all _mean_?
>
Unlike TDX which implements secure EPT to hold the restricted memory mapping,
SEV-SNP is still using one NPT (similar to Intel EPT) while adding another
level of HW-enforced check controlled by the "RMP" table. Similar to TDX,
it has firmware calls to modify the attributes in the RMP table to achieve
isolation and shared-private memory conversion.
The purpose and structure of RMP is quite similar to the PAMT table in TDX from
the perspective of managing the per-page attributes. Each system page has a
collection of attribute bits. But TDX only uses the PAMT as metadata as it has
a separate secure EPT to achieve HW-enforced check.
The RMP memory access checks has its own schematics. E.g. data write,
page table access from VMM will be checked, but data read is not, mostly
I guess, due to performance consideration. More details can be found from
Table 15-39. RMP Memory Access Checks in [1].
> When does the kernel need to care about a "page-size mismatch"?
The RMP table has the ability to describe a large page (similar to a large page
with a description of large-page PAMT entry in TDX that can be demoted via
TDX SEAMCALLs). E.g. 2MB page.
When the userspace sets the memory attribute of a GFN range through the
restricted memory ioctl, the sev logic (sev_update_mem_attr() in PATCH 48, to
be precise) will try to build a large page description in the RMP table if the
PFNs are continuous. When kernel mm breaks the the large page due to THP, KVM
updates the NPT accordingly.
Then there will be a page-size mismatch between NPT and RMP. It will be
resolved by a RMP fault later. Kinda of lazy sync.
[1] https://www.amd.com/system/files/TechDocs/24593.pdf
On 2/20/23 10:38, Michael Roth wrote:
> + /*
> + * TODO: The RMP entry's hugepage bit is ignored for
> + * shared/unassigned pages. Either handle looping through each
> + * sub-page as part of snp_make_page_shared(), or remove the
> + * level argument.
> + */
> + if (op == SNP_PAGE_STATE_PRIVATE && order &&
> + IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
> + level = order_to_level(order);
> + npages = 1 << order;
> + }
That's a wee bit obtuse.
First of all, I assume that the 'RFC' is because of these TODOs and they
won't survive to the point when you ask for this to be merged.
BTW, what keeps the restrictedmem_get_page() offset and the gfn aligned?
Let's start with this:
> +static inline u8 order_to_level(int order)
> +{
> + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> + return PG_LEVEL_1G;
> +
> + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> + return PG_LEVEL_2M;
> +
> + return PG_LEVEL_4K;
> +}
Right now, 'order' comes only from restrictedmem_get_page(), which I dug
out of:
> https://github.com/mdroth/linux/commit/c6792672cd11737fd255dff10b2d5b6bccc626a8
That order is *only* filled in by THPs. That makes the PG_LEVEL_1G
stuff here kinda silly. I guess it might be seen as thorough, but it's
dead code. I'd probably just make this work on order==9 || order==0 and
warn on anything else.
I'd also highly recommend some comments about how racy this all is. I
guess it probably works, but it would be good to add some comments about
page splits and collapsing.
It's also not obvious why this only cares about private pages.
Anyway, this is the exact kind of thing where I really like a
well-commented helper:
bool can_install_large_rmp_entry(gfn, order)
{
// small pages, blah blah
if (!order)
return false;
// The region being updated must be aligned
if (!IS_ALIGNED(gfn, 1 << order))
return false;
// ... and fit
if (gfn + (1 << order)) > end)
return false;
return true;
}
Which gets used like this:
if (op == SNP_PAGE_STATE_PRIVATE &&
can_install_large_rmp_entry(gfn, order)) {
level = ...
}
On 3/1/23 14:59, Zhi Wang wrote:
> When the userspace sets the memory attribute of a GFN range through the
> restricted memory ioctl, the sev logic (sev_update_mem_attr() in PATCH 48, to
> be precise) will try to build a large page description in the RMP table if the
> PFNs are continuous. When kernel mm breaks the the large page due to THP, KVM
> updates the NPT accordingly.
Gah, this really confused me.
It's *NOT* looking for contiguous PFNs. It's looking for a
restrictedmem THP, which really is something different. Restrictedmem
THPs have contiguous PFNs, but not all contiguous PFNs will result in
trying to build a large page.
Anyway, I'll reply over to the other patch.
But, either way, I'd appreciate this kind of summary in the changelogs
and probably a comment or two:
The RMP needs to be consistent with the contents of the NPT.
KVM updates the NPT but will neglect to update the RMP. It is
updated in response to faults when RMP and NPT get out of sync.
Right?
BTW, why doesn't KVM just update the RMP? Why bother taking the fault?
On Mon, 20 Feb 2023 12:38:45 -0600
Michael Roth <[email protected]> wrote:
In the host-wide certificates installing/uninstalling, a lock is to
prevent the racing between cert upload/download path and the handling
of guest request path.
Guess we need a similar lock to prevent the racing between
snp_{set,get}_instance_certs() and snp_handle_ext_guest_request().
This patch and PATCH 27 are focusing on cert from different sources.
It would be better to abstract. With the functions and data structure
of cert, the cert installing/uninstalling, checks, expectation of the
output buffer can be unified. Then we don't need mostly duplicate
routines in two different paths.
> From: Dionna Glaze <[email protected]>
>
> The /dev/sev device has the ability to store host-wide certificates for
> the key used by the AMD-SP for SEV-SNP attestation report signing,
> but for hosts that want to specify additional certificates that are
> specific to the image launched in a VM, a different way is needed to
> communicate those certificates.
>
> Add two new KVM ioctl to handle this: KVM_SEV_SNP_{GET,SET}_CERTS
>
> The certificates that are set with this command are expected to follow
> the same format as the host certificates, but that format is opaque
> to the kernel.
>
> The new behavior for custom certificates is that the extended guest
> request command will now return the overridden certificates if they
> were installed for the instance. The error condition for a too small
> data buffer is changed to return the overridden certificate data size
> if there is an overridden certificate set installed.
>
> Setting a 0 length certificate returns the system state to only return
> the host certificates on an extended guest request.
>
> Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space
> for an extra certificate.
>
> Cc: Tom Lendacky <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
>
> Signed-off-by: Dionna Glaze <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> [mdr: remove used of "we" and "this patch" in commit log]
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 111 ++++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/svm/svm.h | 1 +
> include/linux/psp-sev.h | 2 +-
> include/uapi/linux/kvm.h | 12 +++++
> 4 files changed, 123 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 70d5650d8d95..18b64b7005e7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2089,6 +2089,7 @@ static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> goto e_free;
>
> sev->snp_certs_data = certs_data;
> + sev->snp_certs_len = 0;
>
> return context;
>
Better to move the fix to PATCH 45.
> @@ -2404,6 +2405,86 @@ static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return ret;
> }
>
> +static int snp_get_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct kvm_sev_snp_get_certs params;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data,
> + sizeof(params)))
> + return -EFAULT;
> +
> + /* No instance certs set. */
> + if (!sev->snp_certs_len)
> + return -ENOENT;
> +
> + if (params.certs_len < sev->snp_certs_len) {
> + /* Output buffer too small. Return the required size. */
> + params.certs_len = sev->snp_certs_len;
> +
> + if (copy_to_user((void __user *)(uintptr_t)argp->data, ¶ms,
> + sizeof(params)))
> + return -EFAULT;
> +
> + return -EINVAL;
> + }
> +
> + if (copy_to_user((void __user *)(uintptr_t)params.certs_uaddr,
> + sev->snp_certs_data, sev->snp_certs_len))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int snp_set_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + unsigned long length = SEV_FW_BLOB_MAX_SIZE;
> + void *to_certs = sev->snp_certs_data;
> + struct kvm_sev_snp_set_certs params;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(¶ms, (void __user *)(uintptr_t)argp->data,
> + sizeof(params)))
> + return -EFAULT;
> +
> + if (params.certs_len > SEV_FW_BLOB_MAX_SIZE)
> + return -EINVAL;
> +
> + /*
> + * Setting a length of 0 is the same as "uninstalling" instance-
> + * specific certificates.
> + */
> + if (params.certs_len == 0) {
> + sev->snp_certs_len = 0;
> + return 0;
> + }
> +
> + /* Page-align the length */
> + length = (params.certs_len + PAGE_SIZE - 1) & PAGE_MASK;
> +
> + if (copy_from_user(to_certs,
> + (void __user *)(uintptr_t)params.certs_uaddr,
> + params.certs_len)) {
> + return -EFAULT;
> + }
> +
> + sev->snp_certs_len = length;
> +
> + return 0;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2503,6 +2584,12 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_FINISH:
> r = snp_launch_finish(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_GET_CERTS:
> + r = snp_get_instance_certs(kvm, &sev_cmd);
> + break;
> + case KVM_SEV_SNP_SET_CERTS:
> + r = snp_set_instance_certs(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -3550,8 +3637,28 @@ static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gp
> if (rc)
> goto unlock;
>
> - rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> - &data_npages, &err);
> + /*
> + * If the VMM has overridden the certs, then change the error message
> + * if the size is inappropriate for the override. Otherwise, use a
> + * regular guest request and copy back the instance certs.
> + */
> + if (sev->snp_certs_len) {
> + if ((data_npages << PAGE_SHIFT) < sev->snp_certs_len) {
> + rc = -EINVAL;
> + err = SNP_GUEST_REQ_INVALID_LEN;
> + goto datalen;
> + }
> + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &req,
> + (int *)&err);
> + } else {
> + rc = snp_guest_ext_guest_request(&req,
> + (unsigned long)sev->snp_certs_data,
> + &data_npages, &err);
> + }
> +datalen:
> + if (sev->snp_certs_len)
> + data_npages = sev->snp_certs_len >> PAGE_SHIFT;
> +
> if (rc) {
> /*
> * If buffer length is small then return the expected
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 221b38d3c845..dced46559508 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -94,6 +94,7 @@ struct kvm_sev_info {
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> void *snp_certs_data;
> + unsigned int snp_certs_len; /* Size of instance override for certs */
> struct mutex guest_req_lock; /* Lock for guest request handling */
>
> u64 sev_features; /* Features set at VMSA creation */
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 92116e2b74fd..3b28b78938f6 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -22,7 +22,7 @@
> #define __psp_pa(x) __pa(x)
> #endif
>
> -#define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */
> +#define SEV_FW_BLOB_MAX_SIZE 0x5000 /* 20KB */
>
> /**
> * SEV platform state
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6e684bf5f723..ad7e24e43547 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1928,6 +1928,8 @@ enum sev_cmd_id {
> KVM_SEV_SNP_LAUNCH_START,
> KVM_SEV_SNP_LAUNCH_UPDATE,
> KVM_SEV_SNP_LAUNCH_FINISH,
> + KVM_SEV_SNP_GET_CERTS,
> + KVM_SEV_SNP_SET_CERTS,
>
> KVM_SEV_NR_MAX,
> };
> @@ -2075,6 +2077,16 @@ struct kvm_sev_snp_launch_finish {
> __u8 pad[6];
> };
>
> +struct kvm_sev_snp_get_certs {
> + __u64 certs_uaddr;
> + __u64 certs_len;
> +};
> +
> +struct kvm_sev_snp_set_certs {
> + __u64 certs_uaddr;
> + __u64 certs_len;
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
> > @@ -2089,6 +2089,7 @@ static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > goto e_free;
> >
> > sev->snp_certs_data = certs_data;
> > + sev->snp_certs_len = 0;
> >
> > return context;
> >
>
> Better to move the fix to PATCH 45.
>
This part isn't a fix, but part of the implementation since
snp_certs_len is added in this patch here
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index 221b38d3c845..dced46559508 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -94,6 +94,7 @@ struct kvm_sev_info {
> > u64 snp_init_flags;
> > void *snp_context; /* SNP guest context page */
> > void *snp_certs_data;
> > + unsigned int snp_certs_len; /* Size of instance override for certs */
> > struct mutex guest_req_lock; /* Lock for guest request handling */
> >
> > u64 sev_features; /* Features set at VMSA creation */
--
-Dionna Glaze, PhD (she/her)
Hi Mike, Zhi,
On 01/03/2023 23:20, Zhi Wang wrote:
> On Mon, 20 Feb 2023 12:38:43 -0600
> Michael Roth <[email protected]> wrote:
>
>> From: Brijesh Singh <[email protected]>
>>
>> Add support to decrypt guest encrypted memory. These API interfaces can
>> be used for example to dump VMCBs on SNP guest exit.
>>
>
> What kinds of check will be applied from firmware when VMM decrypts this
> page? I suppose there has to be kinda mechanism to prevent VMM to decrypt
> any page in the guest. It would be nice to have some introduction about
> it in the comments.
>
The SNP ABI spec says (section 8.27.2 SNP_DBG_DECRYPT):
The firmware checks that the guest's policy allows debugging. If not,
the firmware returns POLICY_FAILURE.
and in the Guest Policy (section 4.3):
Bit 19 - DEBUG
0: Debugging is disallowed.
1: Debugging is allowed.
In the kernel, that firmware error code is defined as
SEV_RET_POLICY_FAILURE.
>> Signed-off-by: Brijesh Singh <[email protected]>
>> Signed-off-by: Ashish Kalra <[email protected]>
>> [mdr: minor commit fixups]
>> Signed-off-by: Michael Roth <[email protected]>
>> ---
>> drivers/crypto/ccp/sev-dev.c | 32 ++++++++++++++++++++++++++++++++
>> include/linux/psp-sev.h | 22 ++++++++++++++++++++--
>> 2 files changed, 52 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>> index e65563bc8298..bf5167b2acfc 100644
>> --- a/drivers/crypto/ccp/sev-dev.c
>> +++ b/drivers/crypto/ccp/sev-dev.c
>> @@ -2017,6 +2017,38 @@ int sev_guest_df_flush(int *error)
>> }
>> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>>
>> +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
>> +{
>> + struct sev_data_snp_dbg data = {0};
>> + struct sev_device *sev;
>> + int ret;
>> +
>> + if (!psp_master || !psp_master->sev_data)
>> + return -ENODEV;
>> +
>> + sev = psp_master->sev_data;
>> +
>> + if (!sev->snp_initialized)
>> + return -EINVAL;
>> +
>> + data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
>> + data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
>> + data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
I guess this works, but I wonder why we need to turn on sme_me_mask on
teh dst_addr. I thought that the firmware decrypts the guest page
(src_addr) to a plaintext page. Couldn't find this requirement in the
SNP spec.
>> +
>> + /* The destination page must be in the firmware state. */
>> + if (rmp_mark_pages_firmware(data.dst_addr, 1, false))
>> + return -EIO;
>> +
>> + ret = sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, &data, error);
>> +
>> + /* Restore the page state */
>> + if (snp_reclaim_pages(data.dst_addr, 1, false))
>> + ret = -EIO;
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt_page);
>> +
>> int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
>> unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
>> {
>> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
>> index 81bafc049eca..92116e2b74fd 100644
>> --- a/include/linux/psp-sev.h
>> +++ b/include/linux/psp-sev.h
>> @@ -710,7 +710,6 @@ struct sev_data_snp_dbg {
>> u64 gctx_paddr; /* In */
>> u64 src_addr; /* In */
>> u64 dst_addr; /* In */
>> - u32 len; /* In */
>> } __packed;
The comment above this ^^^ struct still lists the 'len' field, and also
calls the first field 'handle' instead of 'gctx_paddr'.
Also - why is this change happening in this patch? Why was the incorrect
'len' field added in the first place in "[PATCH RFC v8 20/56]
crypto:ccp: Define the SEV-SNP commands" ? (the comment fixes should
probably go there too).
>>
>> /**
>> @@ -913,13 +912,27 @@ int sev_guest_decommission(struct sev_data_decommission *data, int *error);
>> * @error: SEV command return code
>> *
>> * Returns:
>> + * 0 if the sev successfully processed the command
>> + * -%ENODEV if the sev device is not available
>> + * -%ENOTSUPP if the sev does not support SEV
>> + * -%ETIMEDOUT if the sev command timed out
>> + * -%EIO if the sev returned a non-zero return code
>> + */
I think that if the word 'sev' would be 'SEV' in this comment, the diff
will be a bit less misleading (basically this patch should not introduce
changes to sev_do_cmd).
-Dov
>> +int sev_do_cmd(int cmd, void *data, int *psp_ret);
>> +
>> +/**
>> + * snp_guest_dbg_decrypt_page - perform SEV SNP_DBG_DECRYPT command
>> + *
>> + * @sev_ret: sev command return code
>> + *
>> + * Returns:
>> * 0 if the SEV successfully processed the command
>> * -%ENODEV if the SEV device is not available
>> * -%ENOTSUPP if the SEV does not support SEV
>> * -%ETIMEDOUT if the SEV command timed out
>> * -%EIO if the SEV returned a non-zero return code
>> */
>> -int sev_do_cmd(int cmd, void *data, int *psp_ret);
>> +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error);
>>
>> void *psp_copy_user_blob(u64 uaddr, u32 len);
>> void *snp_alloc_firmware_page(gfp_t mask);
>> @@ -987,6 +1000,11 @@ static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_P
>>
>> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
>>
>> +static inline int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
>> +{
>> + return -ENODEV;
>> +}
>> +
>> static inline void *snp_alloc_firmware_page(gfp_t mask)
>> {
>> return NULL;
>
On Wed, 1 Mar 2023 17:41:11 -0800
Dionna Amalie Glaze <[email protected]> wrote:
> > > @@ -2089,6 +2089,7 @@ static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > > goto e_free;
> > >
> > > sev->snp_certs_data = certs_data;
> > > + sev->snp_certs_len = 0;
> > >
> > > return context;
> > >
> >
> > Better to move the fix to PATCH 45.
> >
>
> This part isn't a fix, but part of the implementation since
> snp_certs_len is added in this patch here
>
I see. My bad. Was thinking it was the snp_serts_len in the global sev as
they has the same name.
> > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > index 221b38d3c845..dced46559508 100644
> > > --- a/arch/x86/kvm/svm/svm.h
> > > +++ b/arch/x86/kvm/svm/svm.h
> > > @@ -94,6 +94,7 @@ struct kvm_sev_info {
> > > u64 snp_init_flags;
> > > void *snp_context; /* SNP guest context page */
> > > void *snp_certs_data;
> > > + unsigned int snp_certs_len; /* Size of instance override for certs */
> > > struct mutex guest_req_lock; /* Lock for guest request handling */
> > >
> > > u64 sev_features; /* Features set at VMSA creation */
>
>
On 20/02/2023 20:38, Michael Roth wrote:
> From: Dionna Glaze <[email protected]>
>
> The /dev/sev device has the ability to store host-wide certificates for
> the key used by the AMD-SP for SEV-SNP attestation report signing,
> but for hosts that want to specify additional certificates that are
> specific to the image launched in a VM, a different way is needed to
> communicate those certificates.
>
> Add two new KVM ioctl to handle this: KVM_SEV_SNP_{GET,SET}_CERTS
>
> The certificates that are set with this command are expected to follow
> the same format as the host certificates, but that format is opaque
> to the kernel.
>
> The new behavior for custom certificates is that the extended guest
> request command will now return the overridden certificates if they
> were installed for the instance. The error condition for a too small
> data buffer is changed to return the overridden certificate data size
> if there is an overridden certificate set installed.
>
> Setting a 0 length certificate returns the system state to only return
> the host certificates on an extended guest request.
>
> Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space
> for an extra certificate.
>
> Cc: Tom Lendacky <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
>
> Signed-off-by: Dionna Glaze <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> [mdr: remove used of "we" and "this patch" in commit log]
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 111 ++++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/svm/svm.h | 1 +
> include/linux/psp-sev.h | 2 +-
> include/uapi/linux/kvm.h | 12 +++++
> 4 files changed, 123 insertions(+), 3 deletions(-)
>
[...]
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 92116e2b74fd..3b28b78938f6 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -22,7 +22,7 @@
> #define __psp_pa(x) __pa(x)
> #endif
>
> -#define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */
> +#define SEV_FW_BLOB_MAX_SIZE 0x5000 /* 20KB */
This change should be removed (it was also discussed in v7). If I
understand correctly, 16KB is a limit of the PSP.
-Dov
>
> /**
> * SEV platform state
On 3/1/23 23:59, Dov Murik wrote:
> Hi Mike, Zhi,
>
> On 01/03/2023 23:20, Zhi Wang wrote:
>> On Mon, 20 Feb 2023 12:38:43 -0600
>> Michael Roth <[email protected]> wrote:
>>
>>> From: Brijesh Singh <[email protected]>
>>>
>>> Add support to decrypt guest encrypted memory. These API interfaces can
>>> be used for example to dump VMCBs on SNP guest exit.
>>>
>>
>> What kinds of check will be applied from firmware when VMM decrypts this
>> page? I suppose there has to be kinda mechanism to prevent VMM to decrypt
>> any page in the guest. It would be nice to have some introduction about
>> it in the comments.
>>
>
> The SNP ABI spec says (section 8.27.2 SNP_DBG_DECRYPT):
>
> The firmware checks that the guest's policy allows debugging. If not,
> the firmware returns POLICY_FAILURE.
>
> and in the Guest Policy (section 4.3):
>
> Bit 19 - DEBUG
> 0: Debugging is disallowed.
> 1: Debugging is allowed.
>
> In the kernel, that firmware error code is defined as
> SEV_RET_POLICY_FAILURE.
>
>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>> Signed-off-by: Ashish Kalra <[email protected]>
>>> [mdr: minor commit fixups]
>>> Signed-off-by: Michael Roth <[email protected]>
>>> ---
>>> drivers/crypto/ccp/sev-dev.c | 32 ++++++++++++++++++++++++++++++++
>>> include/linux/psp-sev.h | 22 ++++++++++++++++++++--
>>> 2 files changed, 52 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
>>> index e65563bc8298..bf5167b2acfc 100644
>>> --- a/drivers/crypto/ccp/sev-dev.c
>>> +++ b/drivers/crypto/ccp/sev-dev.c
>>> @@ -2017,6 +2017,38 @@ int sev_guest_df_flush(int *error)
>>> }
>>> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>>>
>>> +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
>>> +{
>>> + struct sev_data_snp_dbg data = {0};
>>> + struct sev_device *sev;
>>> + int ret;
>>> +
>>> + if (!psp_master || !psp_master->sev_data)
>>> + return -ENODEV;
>>> +
>>> + sev = psp_master->sev_data;
>>> +
>>> + if (!sev->snp_initialized)
>>> + return -EINVAL;
>>> +
>>> + data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
>>> + data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
>>> + data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
>
> I guess this works, but I wonder why we need to turn on sme_me_mask on
> teh dst_addr. I thought that the firmware decrypts the guest page
> (src_addr) to a plaintext page. Couldn't find this requirement in the
> SNP spec.
This sme_me_mask tells the firmware how to access the host memory (similar
to how DMA uses sme_me_mask when supplying addresses to devices under
SME). This needs to match the pagetable mapping being used by the host,
otherwise the contents will appears as ciphertext to the host if they are
not in sync. Since the default pagetable mapping is encrypted, the
sme_me_mask bit must be provided on the destination address. So it is not
a spec requirement, but an SME implementation requirement.
Thanks,
Tom
>
>
>>> +
>>> + /* The destination page must be in the firmware state. */
>>> + if (rmp_mark_pages_firmware(data.dst_addr, 1, false))
>>> + return -EIO;
>>> +
>>> + ret = sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, &data, error);
>>> +
>>> + /* Restore the page state */
>>> + if (snp_reclaim_pages(data.dst_addr, 1, false))
>>> + ret = -EIO;
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt_page);
>>> +
>>> int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
>>> unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
>>> {
>>> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
>>> index 81bafc049eca..92116e2b74fd 100644
>>> --- a/include/linux/psp-sev.h
>>> +++ b/include/linux/psp-sev.h
>>> @@ -710,7 +710,6 @@ struct sev_data_snp_dbg {
>>> u64 gctx_paddr; /* In */
>>> u64 src_addr; /* In */
>>> u64 dst_addr; /* In */
>>> - u32 len; /* In */
>>> } __packed;
>
> The comment above this ^^^ struct still lists the 'len' field, and also
> calls the first field 'handle' instead of 'gctx_paddr'.
>
> Also - why is this change happening in this patch? Why was the incorrect
> 'len' field added in the first place in "[PATCH RFC v8 20/56]
> crypto:ccp: Define the SEV-SNP commands" ? (the comment fixes should
> probably go there too).
>
>
>
>>>
>>> /**
>>> @@ -913,13 +912,27 @@ int sev_guest_decommission(struct sev_data_decommission *data, int *error);
>>> * @error: SEV command return code
>>> *
>>> * Returns:
>>> + * 0 if the sev successfully processed the command
>>> + * -%ENODEV if the sev device is not available
>>> + * -%ENOTSUPP if the sev does not support SEV
>>> + * -%ETIMEDOUT if the sev command timed out
>>> + * -%EIO if the sev returned a non-zero return code
>>> + */
>
> I think that if the word 'sev' would be 'SEV' in this comment, the diff
> will be a bit less misleading (basically this patch should not introduce
> changes to sev_do_cmd).
>
> -Dov
>
>>> +int sev_do_cmd(int cmd, void *data, int *psp_ret);
>>> +
>>> +/**
>>> + * snp_guest_dbg_decrypt_page - perform SEV SNP_DBG_DECRYPT command
>>> + *
>>> + * @sev_ret: sev command return code
>>> + *
>>> + * Returns:
>>> * 0 if the SEV successfully processed the command
>>> * -%ENODEV if the SEV device is not available
>>> * -%ENOTSUPP if the SEV does not support SEV
>>> * -%ETIMEDOUT if the SEV command timed out
>>> * -%EIO if the SEV returned a non-zero return code
>>> */
>>> -int sev_do_cmd(int cmd, void *data, int *psp_ret);
>>> +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error);
>>>
>>> void *psp_copy_user_blob(u64 uaddr, u32 len);
>>> void *snp_alloc_firmware_page(gfp_t mask);
>>> @@ -987,6 +1000,11 @@ static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_P
>>>
>>> void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
>>>
>>> +static inline int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
>>> +{
>>> + return -ENODEV;
>>> +}
>>> +
>>> static inline void *snp_alloc_firmware_page(gfp_t mask)
>>> {
>>> return NULL;
>>
On 02/03/2023 16:33, Tom Lendacky wrote:
> On 3/1/23 23:59, Dov Murik wrote:
>> Hi Mike, Zhi,
>>
>> On 01/03/2023 23:20, Zhi Wang wrote:
>>> On Mon, 20 Feb 2023 12:38:43 -0600
>>> Michael Roth <[email protected]> wrote:
>>>
>>>> From: Brijesh Singh <[email protected]>
>>>>
>>>> Add support to decrypt guest encrypted memory. These API interfaces can
>>>> be used for example to dump VMCBs on SNP guest exit.
>>>>
>>>
>>> What kinds of check will be applied from firmware when VMM decrypts this
>>> page? I suppose there has to be kinda mechanism to prevent VMM to
>>> decrypt
>>> any page in the guest. It would be nice to have some introduction about
>>> it in the comments.
>>>
>>
>> The SNP ABI spec says (section 8.27.2 SNP_DBG_DECRYPT):
>>
>> Â Â The firmware checks that the guest's policy allows debugging. If not,
>> Â Â the firmware returns POLICY_FAILURE.
>>
>> and in the Guest Policy (section 4.3):
>>
>> Â Â Bit 19 - DEBUG
>> Â Â 0: Debugging is disallowed.
>> Â Â 1: Debugging is allowed.
>>
>> In the kernel, that firmware error code is defined as
>> SEV_RET_POLICY_FAILURE.
>>
>>
>>>> Signed-off-by: Brijesh Singh <[email protected]>
>>>> Signed-off-by: Ashish Kalra <[email protected]>
>>>> [mdr: minor commit fixups]
>>>> Signed-off-by: Michael Roth <[email protected]>
>>>> ---
>>>> Â drivers/crypto/ccp/sev-dev.c | 32 ++++++++++++++++++++++++++++++++
>>>>  include/linux/psp-sev.h     | 22 ++++++++++++++++++++--
>>>> Â 2 files changed, 52 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/crypto/ccp/sev-dev.c
>>>> b/drivers/crypto/ccp/sev-dev.c
>>>> index e65563bc8298..bf5167b2acfc 100644
>>>> --- a/drivers/crypto/ccp/sev-dev.c
>>>> +++ b/drivers/crypto/ccp/sev-dev.c
>>>> @@ -2017,6 +2017,38 @@ int sev_guest_df_flush(int *error)
>>>> Â }
>>>> Â EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>>>> Â +int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64
>>>> dst_pfn, int *error)
>>>> +{
>>>> +Â Â Â struct sev_data_snp_dbg data = {0};
>>>> +Â Â Â struct sev_device *sev;
>>>> +Â Â Â int ret;
>>>> +
>>>> +Â Â Â if (!psp_master || !psp_master->sev_data)
>>>> +Â Â Â Â Â Â Â return -ENODEV;
>>>> +
>>>> +Â Â Â sev = psp_master->sev_data;
>>>> +
>>>> +Â Â Â if (!sev->snp_initialized)
>>>> +Â Â Â Â Â Â Â return -EINVAL;
>>>> +
>>>> +Â Â Â data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
>>>> +Â Â Â data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
>>>> +Â Â Â data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
>>
>> I guess this works, but I wonder why we need to turn on sme_me_mask on
>> teh dst_addr. I thought that the firmware decrypts the guest page
>> (src_addr) to a plaintext page. Couldn't find this requirement in the
>> SNP spec.
>
> This sme_me_mask tells the firmware how to access the host memory
> (similar to how DMA uses sme_me_mask when supplying addresses to devices
> under SME). This needs to match the pagetable mapping being used by the
> host, otherwise the contents will appears as ciphertext to the host if
> they are not in sync. Since the default pagetable mapping is encrypted,
> the sme_me_mask bit must be provided on the destination address. So it
> is not a spec requirement, but an SME implementation requirement.
>
Ah, OK, that's clear now. Thanks Tom.
-Dov
On 2/20/23 19:37, Michael Roth wrote:
> From: Nikunj A Dadhania <[email protected]>
>
> Rename sev_{pin|unpin}_memory to sev_memory_{get|put}_pages. Apart
> from pinning the pages, sev_pin_memory also populates the pages array
> which is used by its callers. SEV guest using restricted memfd do not
> to pin the memory but will require the pages array to be populated.
^need to?
> Rename the function appropriately.
>
> No functional change intended.
>
> Signed-off-by: Nikunj A Dadhania <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
On 2/20/23 19:38, Michael Roth wrote:
> From: Brijesh Singh <[email protected]>
>
> The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
> entry for a given page. The RMP entry format is documented in AMD PPR, see
> https://bugzilla.kernel.org/attachment.cgi?id=296015.
>
> Co-developed-by: Ashish Kalra <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> +/*
> + * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
> + * and -errno if there is no corresponding RMP entry.
> + */
Hmm IMHO the kernel's idiomatic way is to return 0 on "success" and I'd
assume the more intuitive expectation of success here if the entry is
assigned? The various callers seem to differ though so I guess it depends on
context. Some however don't distinguish their "failure" from an ERR and
maybe they should, at least for the purposes of the various printks?
> +int snp_lookup_rmpentry(u64 pfn, int *level)
> +{
> + struct rmpentry *e;
> +
> + e = __snp_lookup_rmpentry(pfn, level);
> + if (IS_ERR(e))
> + return PTR_ERR(e);
> +
> + return !!rmpentry_assigned(e);
> +}
> +EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
On 2/20/23 19:38, Michael Roth wrote:
> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
> + unsigned long address)
> +{
> + int rmp_level, level;
> + pgd_t *pgd;
> + pte_t *pte;
> + u64 pfn;
> +
> + pgd = __va(read_cr3_pa());
> + pgd += pgd_index(address);
> +
> + pte = lookup_address_in_pgd(pgd, address, &level);
> +
> + /*
> + * It can happen if there was a race between an unmap event and
> + * the RMP fault delivery.
> + */
> + if (!pte || !pte_present(*pte))
> + return RMP_PF_UNMAP;
> +
> + /*
> + * RMP page fault handler follows this algorithm:
> + * 1. Compute the pfn for the 4kb page being accessed
> + * 2. Read that RMP entry -- If it is assigned then kill the process
> + * 3. Otherwise, check the level from the host page table
> + * If level=PG_LEVEL_4K then the page is already smashed
> + * so just retry the instruction
> + * 4. If level=PG_LEVEL_2M/1G, then the host page needs to be split
> + */
> +
> + pfn = pte_pfn(*pte);
> +
> + /* If its large page then calculte the fault pfn */
> + if (level > PG_LEVEL_4K)
> + pfn = pfn | PFN_DOWN(address & (page_level_size(level) - 1));
> +
> + /*
> + * If its a guest private page, then the fault cannot be resolved.
> + * Send a SIGBUS to terminate the process.
> + *
> + * As documented in APM vol3 pseudo-code for RMPUPDATE, when the 2M range
> + * is covered by a valid (Assigned=1) 2M entry, the middle 511 4k entries
> + * also have Assigned=1. This means that if there is an access to a page
> + * which happens to lie within an Assigned 2M entry, the 4k RMP entry
> + * will also have Assigned=1. Therefore, the kernel should see that
> + * the page is not a valid page and the fault cannot be resolved.
> + */
> + if (snp_lookup_rmpentry(pfn, &rmp_level)) {
> + pr_info("Fatal RMP page fault, terminating process, entry assigned for pfn 0x%llx\n",
> + pfn);
> + do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
> + return RMP_PF_RETRY;
> + }
WRT my reply to 12/56, for example here it might be useful to distinguish
the rmp being assigned from an error of snp_lookup_rmpentry()?
> +
> + /*
> + * The backing page level is higher than the RMP page level, request
> + * to split the page.
> + */
> + if (level > rmp_level)
> + return RMP_PF_SPLIT;
> +
> + return RMP_PF_RETRY;
> +}
> +
> /*
On 2/20/23 19:38, Michael Roth wrote:
> From: Ashish Kalra <[email protected]>
>
> Pages are unsafe to be released back to the page-allocator, if they
> have been transitioned to firmware/guest state and can't be reclaimed
> or transitioned back to hypervisor/shared state. In this case add
> them to an internal leaked pages list to ensure that they are not freed
> or touched/accessed to cause fatal page faults.
>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 28 ++++++++++++++++++++++++++++
> include/linux/psp-sev.h | 8 ++++++++
> 2 files changed, 36 insertions(+)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 35f605936f1b..eca4e59b0f44 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -42,6 +42,12 @@
> static DEFINE_MUTEX(sev_cmd_mutex);
> static struct sev_misc_dev *misc_dev;
>
> +/* list of pages which are leaked and cannot be reclaimed */
> +static LIST_HEAD(snp_leaked_pages_list);
> +static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
> +
> +static atomic_long_t snp_nr_leaked_pages = ATOMIC_LONG_INIT(0);
> +
> static int psp_cmd_timeout = 100;
> module_param(psp_cmd_timeout, int, 0644);
> MODULE_PARM_DESC(psp_cmd_timeout, " default timeout value, in seconds, for PSP commands");
> @@ -188,6 +194,28 @@ static int sev_cmd_buffer_len(int cmd)
> return 0;
> }
>
> +void snp_mark_pages_offline(unsigned long pfn, unsigned int npages)
Why call it offline which has usually a memory hotplug-related meaning? What
about e.g. snp_leak_bad_pages() ?
> +{
> + struct page *page = pfn_to_page(pfn);
> +
> + WARN(1, "psc failed, pfn 0x%lx pages %d (marked offline)\n", pfn, npages);
> +
> + spin_lock(&snp_leaked_pages_list_lock);
> + while (npages--) {
> + /*
> + * Reuse the page's buddy list for chaining into the leaked
> + * pages list. This page should not be on a free list currently
> + * and is also unsafe to be added to a free list.
> + */
> + list_add_tail(&page->buddy_list, &snp_leaked_pages_list);
> + sev_dump_rmpentry(pfn);
> + pfn++;
> + }
> + spin_unlock(&snp_leaked_pages_list_lock);
> + atomic_long_inc(&snp_nr_leaked_pages);
> +}
> +EXPORT_SYMBOL_GPL(snp_mark_pages_offline);
> +
> static void *sev_fw_alloc(unsigned long len)
> {
> struct page *page;
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 46f61e3ae33b..8edf5c548fbf 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -923,6 +923,12 @@ int sev_do_cmd(int cmd, void *data, int *psp_ret);
>
> void *psp_copy_user_blob(u64 uaddr, u32 len);
>
> +/**
> + * sev_mark_pages_offline - insert non-reclaimed firmware/guest pages
> + * into a leaked pages list.
> + */
> +void snp_mark_pages_offline(unsigned long pfn, unsigned int npages);
> +
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> static inline int
> @@ -951,6 +957,8 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int
>
> static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }
>
> +void snp_mark_pages_offline(unsigned long pfn, unsigned int npages) {}
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
On 03/03/23 19:30, Vlastimil Babka wrote:
> On 2/20/23 19:37, Michael Roth wrote:
>> From: Nikunj A Dadhania <[email protected]>
>>
>> Rename sev_{pin|unpin}_memory to sev_memory_{get|put}_pages. Apart
>> from pinning the pages, sev_pin_memory also populates the pages array
>> which is used by its callers. SEV guest using restricted memfd do not
>> to pin the memory but will require the pages array to be populated.
>
> ^need to?
>
Sure
Regards,
Nikunj
On Mon, Feb 20, 2023 at 12:37:52PM -0600,
Michael Roth <[email protected]> wrote:
> This callback is used by the KVM MMU to check whether a #NPF was for a
> private GPA or not.
>
> In some cases the full 64-bit error code for the #NPF will be needed to
> make this determination, so also update kvm_mmu_do_page_fault() to
> accept the full 64-bit value so it can be plumbed through to the
> callback.
We can split 64-bit part into the independent patch.
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 3 +--
> arch/x86/kvm/mmu/mmu_internal.h | 37 +++++++++++++++++++++++++++---
> 4 files changed, 37 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8dc345cc6318..72183da010b8 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -131,6 +131,7 @@ KVM_X86_OP(msr_filter_changed)
> KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> +KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e552374f2357..f856d689dda0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1643,6 +1643,7 @@ struct kvm_x86_ops {
>
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
> + bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
>
> bool (*has_wbinvd_exit)(void);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index eda615f3951c..fb3f34b7391c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5724,8 +5724,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> }
>
> if (r == RET_PF_INVALID) {
> - r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
> - lower_32_bits(error_code), false);
> + r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false);
> if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> return -EIO;
> }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index e642d431df4b..557a001210df 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -231,6 +231,37 @@ struct kvm_page_fault {
>
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>
> +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> +{
> + struct kvm_memory_slot *slot;
> + bool private_fault = false;
> + gfn_t gfn = gpa_to_gfn(gpa);
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot) {
> + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn);
> + goto out;
> + }
> +
> + if (!kvm_slot_can_be_private(slot)) {
> + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn);
> + goto out;
> + }
> +
> + if (static_call(kvm_x86_fault_is_private)(kvm, gpa, err, &private_fault))
> + goto out;
> +
> + /*
> + * Handling below is for UPM self-tests and guests that treat userspace
> + * as the authority on whether a fault should be private or not.
> + */
> + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> +
> +out:
> + pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault);
> + return private_fault;
> +}
> +
> /*
> * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
> * and of course kvm_mmu_do_page_fault().
> @@ -262,11 +293,11 @@ enum {
> };
>
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - u32 err, bool prefetch)
> + u64 err, bool prefetch)
> {
> struct kvm_page_fault fault = {
> .addr = cr2_or_gpa,
> - .error_code = err,
> + .error_code = lower_32_bits(err),
> .exec = err & PFERR_FETCH_MASK,
> .write = err & PFERR_WRITE_MASK,
> .present = err & PFERR_PRESENT_MASK,
> @@ -280,7 +311,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> + .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
I don't think kvm_mmu_fault_is_private(). It's too heavy. We can make it
it's own. I.e. the following.
From b0f914a1a4d154f076c0294831ce9ef0df7eb3d3 Mon Sep 17 00:00:00 2001
Message-Id: <b0f914a1a4d154f076c0294831ce9ef0df7eb3d3.1679114841.git.isaku.yamahata@intel.com>
In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <[email protected]>
Date: Fri, 17 Mar 2023 11:18:13 -0700
Subject: [PATCH 2/4] KVM: x86: Add 'fault_is_private' x86 op
This callback is used by the KVM MMU to check whether a KVM page fault was
for a private GPA or not.
Originally-by: Michael Roth <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.h | 19 +++++++++++++++++++
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/x86.c | 8 ++++++++
5 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e1f57905c8fe..dc5f18ac0bd5 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -99,6 +99,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(fault_is_private)
KVM_X86_OP_OPTIONAL(link_private_spt)
KVM_X86_OP_OPTIONAL(free_private_spt)
KVM_X86_OP_OPTIONAL(split_private_spt)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 59196a80c3c8..0382d236fbf4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1730,6 +1730,7 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
+ bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
void *private_spt);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 4aaef2132b97..1f21680b9b97 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -289,6 +289,25 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
return translate_nested_gpa(vcpu, gpa, access, exception);
}
+static inline bool kvm_mmu_fault_is_private_default(struct kvm *kvm, gpa_t gpa, u64 err)
+{
+ struct kvm_memory_slot *slot;
+ gfn_t gfn = gpa_to_gfn(gpa);
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!slot)
+ return false;
+
+ if (!kvm_slot_can_be_private(slot))
+ return false;
+
+ /*
+ * Handling below is for UPM self-tests and guests that treat userspace
+ * as the authority on whether a fault should be private or not.
+ */
+ return kvm_mem_is_private(kvm, gfn);
+}
+
static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
{
#ifdef CONFIG_KVM_MMU_PRIVATE
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bb5709f1cb57..6b54b069d1ed 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -445,7 +445,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = vcpu->kvm->arch.tdp_max_page_level,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
+ .is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, cr2_or_gpa, err),
};
int r;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd14368c6bc8..0311ab450330 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9419,6 +9419,14 @@ static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
#undef __KVM_X86_OP
kvm_pmu_ops_update(ops->pmu_ops);
+
+ /*
+ * TODO: Once all backend fills this option, remove this and the default
+ * function.
+ */
+ if (!ops->runtime_ops->fault_is_private)
+ static_call_update(kvm_x86_fault_is_private,
+ kvm_mmu_fault_is_private_default);
}
static int kvm_x86_check_processor_compatibility(void)
--
2.25.1
--
Isaku Yamahata <[email protected]>
On Mon, Feb 20, 2023 at 12:37:52PM -0600,
Michael Roth <[email protected]> wrote:
> This callback is used by the KVM MMU to check whether a #NPF was for a
> private GPA or not.
>
> In some cases the full 64-bit error code for the #NPF will be needed to
> make this determination, so also update kvm_mmu_do_page_fault() to
> accept the full 64-bit value so it can be plumbed through to the
> callback.
Here is a patch to change error code 64-bit.
From 428a676face7a06a90e59dca1c32941c9b6ee001 Mon Sep 17 00:00:00 2001
Message-Id: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <[email protected]>
Date: Fri, 17 Mar 2023 12:58:42 -0700
Subject: [PATCH 1/4] KVM: x86/mmu: Pass round full 64-bit error code for the
KVM page fault
In some cases the full 64-bit error code for the KVM page fault will be
needed to make this determination, so also update kvm_mmu_do_page_fault()
to accept the full 64-bit value so it can be plumbed through to the
callback.
The upper 32 bits of error code is discarded at kvm_mmu_page_fault()
by lower_32_bits(). Now it's passed down as full 64 bits. It turns out
that only FNAME(page_fault) depends on it. Move lower_32_bits() into
FNAME(page_fault).
The accesses of fault->error_code are as follows
- FNAME(page_fault): change to explicitly use lower_32_bits()
- kvm_tdp_page_fault(): explicit mask with PFERR_LEVEL_MASK
- kvm_mmu_page_fault(): explicit mask with PFERR_RSVD_MASK,
PFERR_NESTED_GUEST_PAGE
- mmutrace: changed u32 -> u64
- pgprintk(): change %x -> %llx
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 7 +++----
arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 4 ++--
5 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index de9c6b98c41b..4aaef2132b97 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -156,7 +156,7 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
}
kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
- u32 error_code, int max_level);
+ u64 error_code, int max_level);
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 960609d72dd6..0ec94c72895c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4860,7 +4860,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
- pgprintk("%s: gva %llx error %x\n", __func__, fault->addr, fault->error_code);
+ pgprintk("%s: gva %llx error %llx\n", __func__, fault->addr, fault->error_code);
/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
fault->max_level = PG_LEVEL_2M;
@@ -4986,7 +4986,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
- u32 error_code, int max_level)
+ u64 error_code, int max_level)
{
int r;
struct kvm_page_fault fault = (struct kvm_page_fault) {
@@ -6238,8 +6238,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}
if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false);
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index aa0836191b5a..bb5709f1cb57 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -341,7 +341,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
struct kvm_page_fault {
/* arguments to kvm_mmu_do_page_fault. */
const gpa_t addr;
- const u32 error_code;
+ const u64 error_code;
const bool prefetch;
/* Derived from error_code. */
@@ -427,7 +427,7 @@ enum {
};
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch)
+ u64 err, bool prefetch)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 2d7555381955..2e77883c92f6 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -261,7 +261,7 @@ TRACE_EVENT(
TP_STRUCT__entry(
__field(int, vcpu_id)
__field(gpa_t, cr2_or_gpa)
- __field(u32, error_code)
+ __field(u64, error_code)
__field(u64 *, sptep)
__field(u64, old_spte)
__field(u64, new_spte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 594af2e1fd2f..cab6822709e2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -791,7 +791,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
int r;
bool is_self_change_mapping;
- pgprintk("%s: addr %llx err %x\n", __func__, fault->addr, fault->error_code);
+ pgprintk("%s: addr %llx err %llx\n", __func__, fault->addr, fault->error_code);
WARN_ON_ONCE(fault->is_tdp);
/*
@@ -800,7 +800,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* The bit needs to be cleared before walking guest page tables.
*/
r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
- fault->error_code & ~PFERR_RSVD_MASK);
+ lower_32_bits(fault->error_code) & ~PFERR_RSVD_MASK);
/*
* The page is not mapped by the guest. Let the guest handle it.
--
2.25.1
--
Isaku Yamahata <[email protected]>
On Mon, Feb 20, 2023 at 12:37:53PM -0600,
Michael Roth <[email protected]> wrote:
> This callback will do any platform-specific handling needed for
> converting pages between shared/private.
>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
> include/linux/kvm_host.h | 4 ++++
> virt/kvm/kvm_main.c | 29 +++++++++++++++++++++++++++++
> 5 files changed, 49 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 72183da010b8..a8aaf532c2ab 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -132,6 +132,7 @@ KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
> +KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f856d689dda0..2da3fb2d5d1b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1644,6 +1644,8 @@ struct kvm_x86_ops {
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
> bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
> + int (*update_mem_attr)(struct kvm_memory_slot *slot, unsigned int attr,
> + gfn_t start, gfn_t end);
>
> bool (*has_wbinvd_exit)(void);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fb3f34b7391c..053bd77bbf52 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7251,4 +7251,17 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> linfo_update_mixed(gfn, slot, level, mixed);
> }
> }
> +
> +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end)
> +{
> + int ret;
> +
> + ret = static_call(kvm_x86_update_mem_attr)(slot, attrs, start, end);
> + if (ret)
> + pr_warn_ratelimited("Failed to update GFN range 0x%llx-0x%llx with attributes 0x%lx. Ret: %d\n",
> + start, end, attrs, ret);
> +}
> #endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fdc59479b3e2..d200b8f45583 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2330,6 +2330,10 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> unsigned long attrs,
> gfn_t start, gfn_t end);
> +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> + struct kvm_memory_slot *slot,
> + unsigned long attrs,
> + gfn_t start, gfn_t end);
>
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b68574ff6c30..8ec985f1c57d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2561,6 +2561,32 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> kvm_flush_remote_tlbs(kvm);
> }
>
> +static void kvm_post_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> + gfn_t start_orig, gfn_t end_orig)
> +{
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> +
> + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(&iter, slots, start_orig, end_orig) {
> + gfn_t start, end;
> +
> + slot = iter.slot;
> + start = max(start_orig, slot->base_gfn);
> + end = min(end_orig, slot->base_gfn + slot->npages);
> +
> + if (start >= end)
> + continue;
> +
> + kvm_arch_post_set_memory_attributes(kvm, slot, attrs, start, end);
> + }
> + }
> +}
> +
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> struct kvm_memory_attributes *attrs)
> {
> @@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> kvm_mmu_invalidate_end(kvm);
> KVM_MMU_UNLOCK(kvm);
>
> + if (i > start)
> + kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
> +
Doesn't kvm_arch_set_memory_attributes() work for you? i.e the following patch.
The error check and pr_warn_ratelimited() can be pushed down into the callback.
From 7c618c1f3c236c382e64680efcbe7d8a672aa870 Mon Sep 17 00:00:00 2001
Message-Id: <7c618c1f3c236c382e64680efcbe7d8a672aa870.1679114841.git.isaku.yamahata@intel.com>
In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <[email protected]>
Date: Fri, 17 Mar 2023 12:00:09 -0700
Subject: [PATCH 4/4] KVM: x86: Add 'set_mem_attr' x86 op
This callback will do any platform-specific handling needed for
converting pages between shared/private.
Originally-by: Michael Roth <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 1 +
3 files changed, 4 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index dc5f18ac0bd5..956db2ee25a5 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -100,6 +100,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP(fault_is_private)
+KVM_X86_OP_OPTIONAL(set_mem_attr)
KVM_X86_OP_OPTIONAL(link_private_spt)
KVM_X86_OP_OPTIONAL(free_private_spt)
KVM_X86_OP_OPTIONAL(split_private_spt)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0382d236fbf4..88e11dd3afde 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
+ void (*set_mem_attr)(struct kvm *kvm, struct kvm_memory_slot *slot,
+ unsigned int attr, gfn_t start, gfn_t end);
int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
void *private_spt);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0ec94c72895c..329333486e64 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7908,6 +7908,7 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
gfn_t start, gfn_t end)
{
kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
+ static_call(kvm_x86_set_mem_attr)(kvm, slot, attrs, start, end);
}
void kvm_memory_attributes_create_memslot(struct kvm *kvm,
--
2.25.1
--
Isaku Yamahata <[email protected]>
On Mon, Feb 20, 2023 at 12:37:54PM -0600,
Michael Roth <[email protected]> wrote:
> In some cases, like with SEV-SNP, guest memory needs to be updated in a
> platform-specific manner before it can be safely freed back to the host.
> Add hooks to wire up handling of this sort to the invalidation notifiers
> for restricted memory.
>
> Also issue invalidations of all allocated pages during notifier/memslot
> unbinding so that the pages are not left in an unusable state when
> they eventually get freed back to the host upon FD release.
I'm just curios. Could you please elaborate?
Unbind is happen only when memory slot is delete or vm is destroyed. In the
case of memory slot deletion, the gpa region is zapped via
kvm_arch_commit_memory_region(). In the case of VM destroy, we have
kvm_flush_shadow_all() which calls
kvm_arch_flush_shadow_all() =>kvm_mmu_zap_all(). Doesn't it work?
Thanks,
--
Isaku Yamahata <[email protected]>
On Fri, Mar 17, 2023 at 09:51:37PM -0700, Isaku Yamahata wrote:
> On Mon, Feb 20, 2023 at 12:37:52PM -0600,
> Michael Roth <[email protected]> wrote:
>
> > This callback is used by the KVM MMU to check whether a #NPF was for a
> > private GPA or not.
> >
> > In some cases the full 64-bit error code for the #NPF will be needed to
> > make this determination, so also update kvm_mmu_do_page_fault() to
> > accept the full 64-bit value so it can be plumbed through to the
> > callback.
Hi Isaku, Zhi,
Thanks for your efforts trying to get us in sync on these shared
interfaces. Would be great to have a common base we can build on for the
SNP/TDX series. You mentioned a couple patches here that I couldn't find
on the list, are you planning to submit these as a separate series?
>
> We can split 64-bit part into the independent patch.
Agreed that makes sense.
>
> > Signed-off-by: Michael Roth <[email protected]>
> > ---
<snip>
> > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> > +{
> > + struct kvm_memory_slot *slot;
> > + bool private_fault = false;
> > + gfn_t gfn = gpa_to_gfn(gpa);
> > +
> > + slot = gfn_to_memslot(kvm, gfn);
> > + if (!slot) {
> > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn);
> > + goto out;
> > + }
> > +
> > + if (!kvm_slot_can_be_private(slot)) {
> > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn);
> > + goto out;
> > + }
> > +
> > + if (static_call(kvm_x86_fault_is_private)(kvm, gpa, err, &private_fault))
> > + goto out;
> > +
> > + /*
> > + * Handling below is for UPM self-tests and guests that treat userspace
> > + * as the authority on whether a fault should be private or not.
> > + */
> > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > +
> > +out:
> > + pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault);
> > + return private_fault;
> > +}
> > +
> > /*
> > * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
> > * and of course kvm_mmu_do_page_fault().
> > @@ -262,11 +293,11 @@ enum {
> > };
> >
> > static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > - u32 err, bool prefetch)
> > + u64 err, bool prefetch)
> > {
> > struct kvm_page_fault fault = {
> > .addr = cr2_or_gpa,
> > - .error_code = err,
> > + .error_code = lower_32_bits(err),
> > .exec = err & PFERR_FETCH_MASK,
> > .write = err & PFERR_WRITE_MASK,
> > .present = err & PFERR_PRESENT_MASK,
> > @@ -280,7 +311,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> > .req_level = PG_LEVEL_4K,
> > .goal_level = PG_LEVEL_4K,
> > - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> > + .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err),
>
> I don't think kvm_mmu_fault_is_private(). It's too heavy. We can make it
> it's own. I.e. the following.
Is it causing performance issues? If most of that is mainly due to
gfn_to_memslot()/kvm_slot_can_be_private() check, then maybe that part
can be dropped. In the past Sean has mentioned that we shouldn't have to
do kvm_slot_can_be_private() checks prior to kvm_mem_is_private(), but I
haven't tried removing those yet to see if things still work as expected.
>
> From b0f914a1a4d154f076c0294831ce9ef0df7eb3d3 Mon Sep 17 00:00:00 2001
> Message-Id: <b0f914a1a4d154f076c0294831ce9ef0df7eb3d3.1679114841.git.isaku.yamahata@intel.com>
> In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> From: Isaku Yamahata <[email protected]>
> Date: Fri, 17 Mar 2023 11:18:13 -0700
> Subject: [PATCH 2/4] KVM: x86: Add 'fault_is_private' x86 op
>
> This callback is used by the KVM MMU to check whether a KVM page fault was
> for a private GPA or not.
>
> Originally-by: Michael Roth <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu.h | 19 +++++++++++++++++++
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/x86.c | 8 ++++++++
> 5 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index e1f57905c8fe..dc5f18ac0bd5 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -99,6 +99,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP(fault_is_private)
> KVM_X86_OP_OPTIONAL(link_private_spt)
> KVM_X86_OP_OPTIONAL(free_private_spt)
> KVM_X86_OP_OPTIONAL(split_private_spt)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 59196a80c3c8..0382d236fbf4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1730,6 +1730,7 @@ struct kvm_x86_ops {
>
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
> + bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
>
> int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> void *private_spt);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 4aaef2132b97..1f21680b9b97 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -289,6 +289,25 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> return translate_nested_gpa(vcpu, gpa, access, exception);
> }
>
> +static inline bool kvm_mmu_fault_is_private_default(struct kvm *kvm, gpa_t gpa, u64 err)
> +{
> + struct kvm_memory_slot *slot;
> + gfn_t gfn = gpa_to_gfn(gpa);
> +
> + slot = gfn_to_memslot(kvm, gfn);
> + if (!slot)
> + return false;
> +
> + if (!kvm_slot_can_be_private(slot))
> + return false;
> +
> + /*
> + * Handling below is for UPM self-tests and guests that treat userspace
> + * as the authority on whether a fault should be private or not.
> + */
> + return kvm_mem_is_private(kvm, gfn);
> +}
> +
> static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
> {
> #ifdef CONFIG_KVM_MMU_PRIVATE
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index bb5709f1cb57..6b54b069d1ed 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -445,7 +445,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .max_level = vcpu->kvm->arch.tdp_max_page_level,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> - .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
> + .is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, cr2_or_gpa, err),
> };
> int r;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fd14368c6bc8..0311ab450330 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9419,6 +9419,14 @@ static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
> #undef __KVM_X86_OP
>
> kvm_pmu_ops_update(ops->pmu_ops);
> +
> + /*
> + * TODO: Once all backend fills this option, remove this and the default
> + * function.
> + */
> + if (!ops->runtime_ops->fault_is_private)
> + static_call_update(kvm_x86_fault_is_private,
> + kvm_mmu_fault_is_private_default);
I'm not sure about this approach, since the self-tests (and possibly SEV
(which doesn't use a separate #NPF error bit like SNP/TDX)) currently
rely on that kvm_mem_is_private() call to determine whether to handle as
a private fault or not. But to run either of those, we would need to
load the kvm_amd module, which will have already introduced it's own
kvm_x86_fault_is_private implementation via svm_init(), so the handling
provided by kvm_mmu_fault_is_private_default would never be available and
so we wouldn't be able to run the UPM self-tests.
To me it seems like that handling always needs to be in place as a
fallback when not running SNP/TDX. It doesn't necessarily need to be in the
kvm_x86_fault_is_private handler though, maybe some generic handling for
UPM selftests can be pushed down into KVM MMU. Doing so could also
address a race that Sean mentioned between the time kvm_mem_is_private()
is called here (which happens before mmu_invalidate_seq is recorded for
the #NPF) vs. when it actually gets used in __kvm_faultin_pfn().
If we take that approach, then the requirements for specific TDX/SNP
handling are reduced as well, since we only need to check the
encryption/shared bit, and that could maybe be done as a simple setting
that where you tell KVM MMU the position of the bit, whether it
indicates shared vs. private, then both TDX/SNP could re-use a simple
helper to check the #NPF error code and set .is_private based on that.
Then KVM MMU could, if no bit is indicated, just fall back to using the
value of kvm_mem_is_private() somewhere in __kvm_fault_pfn() or
something.
I mentioned this to Sean a while back, which I think is compatible with
what he was looking for:
https://lore.kernel.org/lkml/[email protected]/
Would be good to get his input before spending too much time adding new
state/configuration stuff in KVM MMU though.
As an interim solution, would my original patch work if we could
confirm that the gfn_to_memslot()/kvm_slot_can_be_private() sequence is
no longer needed?
Thanks!
-Mike
> }
>
> static int kvm_x86_check_processor_compatibility(void)
> --
> 2.25.1
>
>
>
>
> --
> Isaku Yamahata <[email protected]>
On Fri, Mar 17, 2023 at 09:56:11PM -0700, Isaku Yamahata wrote:
> On Mon, Feb 20, 2023 at 12:37:53PM -0600,
> Michael Roth <[email protected]> wrote:
>
> > This callback will do any platform-specific handling needed for
> > converting pages between shared/private.
> >
> > Signed-off-by: Michael Roth <[email protected]>
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
> > include/linux/kvm_host.h | 4 ++++
> > virt/kvm/kvm_main.c | 29 +++++++++++++++++++++++++++++
> > 5 files changed, 49 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 72183da010b8..a8aaf532c2ab 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -132,6 +132,7 @@ KVM_X86_OP(complete_emulated_msr)
> > KVM_X86_OP(vcpu_deliver_sipi_vector)
> > KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> > KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
> > +KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
> >
> > #undef KVM_X86_OP
> > #undef KVM_X86_OP_OPTIONAL
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f856d689dda0..2da3fb2d5d1b 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1644,6 +1644,8 @@ struct kvm_x86_ops {
> > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > int root_level);
> > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
> > + int (*update_mem_attr)(struct kvm_memory_slot *slot, unsigned int attr,
> > + gfn_t start, gfn_t end);
> >
> > bool (*has_wbinvd_exit)(void);
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index fb3f34b7391c..053bd77bbf52 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7251,4 +7251,17 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > linfo_update_mixed(gfn, slot, level, mixed);
> > }
> > }
> > +
> > +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned long attrs,
> > + gfn_t start, gfn_t end)
> > +{
> > + int ret;
> > +
> > + ret = static_call(kvm_x86_update_mem_attr)(slot, attrs, start, end);
> > + if (ret)
> > + pr_warn_ratelimited("Failed to update GFN range 0x%llx-0x%llx with attributes 0x%lx. Ret: %d\n",
> > + start, end, attrs, ret);
> > +}
> > #endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index fdc59479b3e2..d200b8f45583 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2330,6 +2330,10 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > struct kvm_memory_slot *slot,
> > unsigned long attrs,
> > gfn_t start, gfn_t end);
> > +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > + struct kvm_memory_slot *slot,
> > + unsigned long attrs,
> > + gfn_t start, gfn_t end);
> >
> > static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > {
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b68574ff6c30..8ec985f1c57d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2561,6 +2561,32 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> > kvm_flush_remote_tlbs(kvm);
> > }
> >
> > +static void kvm_post_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> > + gfn_t start_orig, gfn_t end_orig)
> > +{
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + struct kvm_memslot_iter iter;
> > + int i;
> > +
> > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> > + slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start_orig, end_orig) {
> > + gfn_t start, end;
> > +
> > + slot = iter.slot;
> > + start = max(start_orig, slot->base_gfn);
> > + end = min(end_orig, slot->base_gfn + slot->npages);
> > +
> > + if (start >= end)
> > + continue;
> > +
> > + kvm_arch_post_set_memory_attributes(kvm, slot, attrs, start, end);
> > + }
> > + }
> > +}
> > +
> > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > struct kvm_memory_attributes *attrs)
> > {
> > @@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > kvm_mmu_invalidate_end(kvm);
> > KVM_MMU_UNLOCK(kvm);
> >
> > + if (i > start)
> > + kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
> > +
>
> Doesn't kvm_arch_set_memory_attributes() work for you? i.e the following patch.
> The error check and pr_warn_ratelimited() can be pushed down into the callback.
This is originally how I had but when CONFIG_PREEMPT_COUNT is set this
will generate warnings for this callback as well as the invalidation
callback as reported in v7 here:
https://lore.kernel.org/lkml/Y80vhKwQyw8hS%2F22@notebook/
The main issue is that kvm_mem_attrs_changed() is called while holding
the KVM MMU lock, which disables preemption. But when updating
attributes for SNP, we also need to remove private pages from kernel
directmap, which involves acquiring a mutex which results in
"BUG: scheduling while atomic" warnings.
So that's why we ended up somewhat duplicating some of the logic and
using a separate callback chain that happens out of KVM MMU lock.
-Mike
>
> From 7c618c1f3c236c382e64680efcbe7d8a672aa870 Mon Sep 17 00:00:00 2001
> Message-Id: <7c618c1f3c236c382e64680efcbe7d8a672aa870.1679114841.git.isaku.yamahata@intel.com>
> In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> From: Isaku Yamahata <[email protected]>
> Date: Fri, 17 Mar 2023 12:00:09 -0700
> Subject: [PATCH 4/4] KVM: x86: Add 'set_mem_attr' x86 op
>
> This callback will do any platform-specific handling needed for
> converting pages between shared/private.
>
> Originally-by: Michael Roth <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/mmu/mmu.c | 1 +
> 3 files changed, 4 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index dc5f18ac0bd5..956db2ee25a5 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -100,6 +100,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> KVM_X86_OP(load_mmu_pgd)
> KVM_X86_OP(fault_is_private)
> +KVM_X86_OP_OPTIONAL(set_mem_attr)
> KVM_X86_OP_OPTIONAL(link_private_spt)
> KVM_X86_OP_OPTIONAL(free_private_spt)
> KVM_X86_OP_OPTIONAL(split_private_spt)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0382d236fbf4..88e11dd3afde 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
> void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> int root_level);
> bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
> + void (*set_mem_attr)(struct kvm *kvm, struct kvm_memory_slot *slot,
> + unsigned int attr, gfn_t start, gfn_t end);
>
> int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> void *private_spt);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0ec94c72895c..329333486e64 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7908,6 +7908,7 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> gfn_t start, gfn_t end)
> {
> kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
> + static_call(kvm_x86_set_mem_attr)(kvm, slot, attrs, start, end);
> }
>
> void kvm_memory_attributes_create_memslot(struct kvm *kvm,
> --
> 2.25.1
>
> --
> Isaku Yamahata <[email protected]>
On Fri, Mar 17, 2023 at 10:13:22PM -0700, Isaku Yamahata wrote:
> On Mon, Feb 20, 2023 at 12:37:54PM -0600,
> Michael Roth <[email protected]> wrote:
>
> > In some cases, like with SEV-SNP, guest memory needs to be updated in a
> > platform-specific manner before it can be safely freed back to the host.
> > Add hooks to wire up handling of this sort to the invalidation notifiers
> > for restricted memory.
> >
> > Also issue invalidations of all allocated pages during notifier/memslot
> > unbinding so that the pages are not left in an unusable state when
> > they eventually get freed back to the host upon FD release.
>
> I'm just curios. Could you please elaborate?
> Unbind is happen only when memory slot is delete or vm is destroyed. In the
> case of memory slot deletion, the gpa region is zapped via
> kvm_arch_commit_memory_region(). In the case of VM destroy, we have
> kvm_flush_shadow_all() which calls
> kvm_arch_flush_shadow_all() =>kvm_mmu_zap_all(). Doesn't it work?
The main thing here is unbind happens right before the restrictedmem
pages are released back to the host, and for SNP we need to clear the
associated RMP table entries to switch them from guest-owned to
hypervisor-owned. It doesn't necessarily need to be a separate callback,
but I'm not sure if it makes sense to squash that down into the various
MMU zapping helpers.
-Mike
>
> Thanks,
> --
> Isaku Yamahata <[email protected]>
On Mon, 20 Mar 2023 13:05:43 -0500
Michael Roth <[email protected]> wrote:
> On Fri, Mar 17, 2023 at 09:56:11PM -0700, Isaku Yamahata wrote:
> > On Mon, Feb 20, 2023 at 12:37:53PM -0600,
> > Michael Roth <[email protected]> wrote:
> >
> > > This callback will do any platform-specific handling needed for
> > > converting pages between shared/private.
> > >
> > > Signed-off-by: Michael Roth <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
> > > include/linux/kvm_host.h | 4 ++++
> > > virt/kvm/kvm_main.c | 29 +++++++++++++++++++++++++++++
> > > 5 files changed, 49 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > > index 72183da010b8..a8aaf532c2ab 100644
> > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > @@ -132,6 +132,7 @@ KVM_X86_OP(complete_emulated_msr)
> > > KVM_X86_OP(vcpu_deliver_sipi_vector)
> > > KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> > > KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
> > > +KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
> > >
> > > #undef KVM_X86_OP
> > > #undef KVM_X86_OP_OPTIONAL
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index f856d689dda0..2da3fb2d5d1b 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1644,6 +1644,8 @@ struct kvm_x86_ops {
> > > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > int root_level);
> > > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
> > > + int (*update_mem_attr)(struct kvm_memory_slot *slot, unsigned int attr,
> > > + gfn_t start, gfn_t end);
> > >
> > > bool (*has_wbinvd_exit)(void);
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index fb3f34b7391c..053bd77bbf52 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -7251,4 +7251,17 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > linfo_update_mixed(gfn, slot, level, mixed);
> > > }
> > > }
> > > +
> > > +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > > + struct kvm_memory_slot *slot,
> > > + unsigned long attrs,
> > > + gfn_t start, gfn_t end)
> > > +{
> > > + int ret;
> > > +
> > > + ret = static_call(kvm_x86_update_mem_attr)(slot, attrs, start, end);
> > > + if (ret)
> > > + pr_warn_ratelimited("Failed to update GFN range 0x%llx-0x%llx with attributes 0x%lx. Ret: %d\n",
> > > + start, end, attrs, ret);
> > > +}
> > > #endif
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index fdc59479b3e2..d200b8f45583 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2330,6 +2330,10 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > struct kvm_memory_slot *slot,
> > > unsigned long attrs,
> > > gfn_t start, gfn_t end);
> > > +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > > + struct kvm_memory_slot *slot,
> > > + unsigned long attrs,
> > > + gfn_t start, gfn_t end);
> > >
> > > static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > {
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index b68574ff6c30..8ec985f1c57d 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2561,6 +2561,32 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> > > kvm_flush_remote_tlbs(kvm);
> > > }
> > >
> > > +static void kvm_post_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> > > + gfn_t start_orig, gfn_t end_orig)
> > > +{
> > > + struct kvm_memory_slot *slot;
> > > + struct kvm_memslots *slots;
> > > + struct kvm_memslot_iter iter;
> > > + int i;
> > > +
> > > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> > > + slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start_orig, end_orig) {
> > > + gfn_t start, end;
> > > +
> > > + slot = iter.slot;
> > > + start = max(start_orig, slot->base_gfn);
> > > + end = min(end_orig, slot->base_gfn + slot->npages);
> > > +
> > > + if (start >= end)
> > > + continue;
> > > +
> > > + kvm_arch_post_set_memory_attributes(kvm, slot, attrs, start, end);
> > > + }
> > > + }
> > > +}
> > > +
> > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > struct kvm_memory_attributes *attrs)
> > > {
> > > @@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > kvm_mmu_invalidate_end(kvm);
> > > KVM_MMU_UNLOCK(kvm);
> > >
> > > + if (i > start)
> > > + kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
> > > +
> >
> > Doesn't kvm_arch_set_memory_attributes() work for you? i.e the following patch.
> > The error check and pr_warn_ratelimited() can be pushed down into the callback.
>
> This is originally how I had but when CONFIG_PREEMPT_COUNT is set this
> will generate warnings for this callback as well as the invalidation
> callback as reported in v7 here:
>
> https://lore.kernel.org/lkml/Y80vhKwQyw8hS%2F22@notebook/
>
> The main issue is that kvm_mem_attrs_changed() is called while holding
> the KVM MMU lock, which disables preemption. But when updating
> attributes for SNP, we also need to remove private pages from kernel
> directmap, which involves acquiring a mutex which results in
> "BUG: scheduling while atomic" warnings.
>
> So that's why we ended up somewhat duplicating some of the logic and
> using a separate callback chain that happens out of KVM MMU lock.
Let's split the things of changing memory attributes:
1) Update the memory attributes in the xa array (Both TDX and SNP)
2) Zapping the EPT/NPT mappings (Required by TDX)
3) Update RMP table (Required by SNP)
4) Update the directmap of kernel (SNP, but I guess TDX needs it as well)
Does SNP really need to zap the NPT mappings when changing the memory
attributes? (The new mappings will be created later in the fault). I don't
find this requirement from APM.
If yes, can we postpone the update of the RMP table in the later fault,
like TDX? So that we can save this update_mem_attr x86 ops as things
will be solved in the SNP-specific fault handler.
If no, guess we need a x86 ops to tell if a zapping is required.
Back to the lock, updating RMP table doesn't require a mutex. Taking
the lock is required when updating the directmap. both TDX/SNP requires
this update the directmap when changing memory attributes.
Wouldn't it better to factor the touching directmap of kernel part out?
Then you can call the x86 ops.update_mem_attr() in kvm_mem_attrs_changed().
And update the direct kernel mapping for both TDX/SNP in the
kvm_post_mem_attrs_changed().
>
> -Mike
>
> >
> > From 7c618c1f3c236c382e64680efcbe7d8a672aa870 Mon Sep 17 00:00:00 2001
> > Message-Id: <7c618c1f3c236c382e64680efcbe7d8a672aa870.1679114841.git.isaku.yamahata@intel.com>
> > In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > From: Isaku Yamahata <[email protected]>
> > Date: Fri, 17 Mar 2023 12:00:09 -0700
> > Subject: [PATCH 4/4] KVM: x86: Add 'set_mem_attr' x86 op
> >
> > This callback will do any platform-specific handling needed for
> > converting pages between shared/private.
> >
> > Originally-by: Michael Roth <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > arch/x86/kvm/mmu/mmu.c | 1 +
> > 3 files changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index dc5f18ac0bd5..956db2ee25a5 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -100,6 +100,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > KVM_X86_OP(load_mmu_pgd)
> > KVM_X86_OP(fault_is_private)
> > +KVM_X86_OP_OPTIONAL(set_mem_attr)
> > KVM_X86_OP_OPTIONAL(link_private_spt)
> > KVM_X86_OP_OPTIONAL(free_private_spt)
> > KVM_X86_OP_OPTIONAL(split_private_spt)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 0382d236fbf4..88e11dd3afde 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
> > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > int root_level);
> > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
> > + void (*set_mem_attr)(struct kvm *kvm, struct kvm_memory_slot *slot,
> > + unsigned int attr, gfn_t start, gfn_t end);
> >
> > int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > void *private_spt);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0ec94c72895c..329333486e64 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7908,6 +7908,7 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > gfn_t start, gfn_t end)
> > {
> > kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
> > + static_call(kvm_x86_set_mem_attr)(kvm, slot, attrs, start, end);
> > }
> >
> > void kvm_memory_attributes_create_memslot(struct kvm *kvm,
> > --
> > 2.25.1
> >
> > --
> > Isaku Yamahata <[email protected]>
On Tue, 21 Mar 2023 20:58:38 -0500
Michael Roth <[email protected]> wrote:
> On Tue, Mar 21, 2023 at 01:21:36PM +0200, Zhi Wang wrote:
> > On Mon, 20 Mar 2023 13:05:43 -0500
> > Michael Roth <[email protected]> wrote:
> >
> > > On Fri, Mar 17, 2023 at 09:56:11PM -0700, Isaku Yamahata wrote:
> > > > On Mon, Feb 20, 2023 at 12:37:53PM -0600,
> > > > Michael Roth <[email protected]> wrote:
> > > >
> > > > > This callback will do any platform-specific handling needed for
> > > > > converting pages between shared/private.
> > > > >
> > > > > Signed-off-by: Michael Roth <[email protected]>
> > > > > ---
> > > > > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > > > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > > > arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
> > > > > include/linux/kvm_host.h | 4 ++++
> > > > > virt/kvm/kvm_main.c | 29 +++++++++++++++++++++++++++++
> > > > > 5 files changed, 49 insertions(+)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > > > > index 72183da010b8..a8aaf532c2ab 100644
> > > > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > > > @@ -132,6 +132,7 @@ KVM_X86_OP(complete_emulated_msr)
> > > > > KVM_X86_OP(vcpu_deliver_sipi_vector)
> > > > > KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> > > > > KVM_X86_OP_OPTIONAL_RET0(fault_is_private);
> > > > > +KVM_X86_OP_OPTIONAL_RET0(update_mem_attr)
> > > > >
> > > > > #undef KVM_X86_OP
> > > > > #undef KVM_X86_OP_OPTIONAL
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index f856d689dda0..2da3fb2d5d1b 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -1644,6 +1644,8 @@ struct kvm_x86_ops {
> > > > > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > > > int root_level);
> > > > > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code, bool *private_fault);
> > > > > + int (*update_mem_attr)(struct kvm_memory_slot *slot, unsigned int attr,
> > > > > + gfn_t start, gfn_t end);
> > > > >
> > > > > bool (*has_wbinvd_exit)(void);
> > > > >
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index fb3f34b7391c..053bd77bbf52 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -7251,4 +7251,17 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > > > linfo_update_mixed(gfn, slot, level, mixed);
> > > > > }
> > > > > }
> > > > > +
> > > > > +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > > > > + struct kvm_memory_slot *slot,
> > > > > + unsigned long attrs,
> > > > > + gfn_t start, gfn_t end)
> > > > > +{
> > > > > + int ret;
> > > > > +
> > > > > + ret = static_call(kvm_x86_update_mem_attr)(slot, attrs, start, end);
> > > > > + if (ret)
> > > > > + pr_warn_ratelimited("Failed to update GFN range 0x%llx-0x%llx with attributes 0x%lx. Ret: %d\n",
> > > > > + start, end, attrs, ret);
> > > > > +}
> > > > > #endif
> > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > index fdc59479b3e2..d200b8f45583 100644
> > > > > --- a/include/linux/kvm_host.h
> > > > > +++ b/include/linux/kvm_host.h
> > > > > @@ -2330,6 +2330,10 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > > > struct kvm_memory_slot *slot,
> > > > > unsigned long attrs,
> > > > > gfn_t start, gfn_t end);
> > > > > +void kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> > > > > + struct kvm_memory_slot *slot,
> > > > > + unsigned long attrs,
> > > > > + gfn_t start, gfn_t end);
> > > > >
> > > > > static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > > > > {
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index b68574ff6c30..8ec985f1c57d 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -2561,6 +2561,32 @@ static void kvm_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> > > > > kvm_flush_remote_tlbs(kvm);
> > > > > }
> > > > >
> > > > > +static void kvm_post_mem_attrs_changed(struct kvm *kvm, unsigned long attrs,
> > > > > + gfn_t start_orig, gfn_t end_orig)
> > > > > +{
> > > > > + struct kvm_memory_slot *slot;
> > > > > + struct kvm_memslots *slots;
> > > > > + struct kvm_memslot_iter iter;
> > > > > + int i;
> > > > > +
> > > > > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> > > > > + slots = __kvm_memslots(kvm, i);
> > > > > +
> > > > > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start_orig, end_orig) {
> > > > > + gfn_t start, end;
> > > > > +
> > > > > + slot = iter.slot;
> > > > > + start = max(start_orig, slot->base_gfn);
> > > > > + end = min(end_orig, slot->base_gfn + slot->npages);
> > > > > +
> > > > > + if (start >= end)
> > > > > + continue;
> > > > > +
> > > > > + kvm_arch_post_set_memory_attributes(kvm, slot, attrs, start, end);
> > > > > + }
> > > > > + }
> > > > > +}
> > > > > +
> > > > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > > struct kvm_memory_attributes *attrs)
> > > > > {
> > > > > @@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > > kvm_mmu_invalidate_end(kvm);
> > > > > KVM_MMU_UNLOCK(kvm);
> > > > >
> > > > > + if (i > start)
> > > > > + kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
> > > > > +
> > > >
> > > > Doesn't kvm_arch_set_memory_attributes() work for you? i.e the following patch.
> > > > The error check and pr_warn_ratelimited() can be pushed down into the callback.
> > >
> > > This is originally how I had but when CONFIG_PREEMPT_COUNT is set this
> > > will generate warnings for this callback as well as the invalidation
> > > callback as reported in v7 here:
> > >
> > > https://lore.kernel.org/lkml/Y80vhKwQyw8hS%2F22@notebook/
> > >
> > > The main issue is that kvm_mem_attrs_changed() is called while holding
> > > the KVM MMU lock, which disables preemption. But when updating
> > > attributes for SNP, we also need to remove private pages from kernel
> > > directmap, which involves acquiring a mutex which results in
> > > "BUG: scheduling while atomic" warnings.
> > >
> > > So that's why we ended up somewhat duplicating some of the logic and
> > > using a separate callback chain that happens out of KVM MMU lock.
> >
> > Let's split the things of changing memory attributes:
> >
> > 1) Update the memory attributes in the xa array (Both TDX and SNP)
> > 2) Zapping the EPT/NPT mappings (Required by TDX)
> > 3) Update RMP table (Required by SNP)
> > 4) Update the directmap of kernel (SNP, but I guess TDX needs it as well)
>
Thanks for the effort of detailed reply. It is very informative.
> I'm not so sure TDX requires this. I was under that impression, but
> Kirill raised some doubts about this and I'm not sure it's been
> confirmed. If it's purely an SNP thing then there may not be much value
> in creating a separate callback for it:
>
> https://lore.kernel.org/linux-mm/[email protected]/T/#meba4ce80709cd3afd3818b61e6419fd800287b9e
>
Hmm, Krill and Isaku, can you confirm that TDX doesn't need this?
I think it is a generic requirement that TDX/SNP are not expecting the
host to touch a private page either from the kernel or the userspace.
> And for SNP, the current code does the unmapping/RMP update in the same
> function:
>
> [PATCH RFC v8 15/56] x86/sev: Invalidate pages from the direct map when adding them to the RMP table
>
> I'm not against splitting RMP/directmap handling, but just want to
> understand what the requirements are around that a bit better.
>
> Does handling the #3 / RMP update / kvm_arch_post_set_memory_attributes
> stuff outside of MMU lock cause issues on TDX side? What sort of
> handling is needed in these callbacks for TDX (if anything)?
>
No, it doesn't cause problem for TDX as TDX doesn't need such callback.
Unlike SNP, which has (1 NPT + 1 RMP) and the enforced HW check is done by RMP, TDX has
two EPT(smiliar with NPT)s (1 shared + 1 private). Converting the memory attr is achieved
by zapping the mapping from one EPT and creating the mapping in the other one in the fault
when guest access the memory. The fault GPA will carry the "SHARED" bit (!C-BIT), so
KVM knows which EPT should be chosen for populating the mapping.
I was trying to figure out what should be the proper callback and at which layer it should
sit for achieving changing memory attr for both TDX/SNP. The current callback looks a little
bit hacky. Duplicate code pieces because of locks implies the SW structure might need to be
re-considered.
> >
> > Does SNP really need to zap the NPT mappings when changing the memory
> > attributes? (The new mappings will be created later in the fault). I don't
> > find this requirement from APM.
>
> I don't think we've added anything specifically for SNP. Do you mean the
> generic kvm_unmap_gfn_range/kvm_flush_remote_tlbs sequence below?
>
> kvm_vm_ioctl_set_mem_attributes():
> KVM_MMU_LOCK(kvm)
> kvm_mmu_invalidate_begin()
> ...
> KVM_MMU_UNLOCK(kvm)
>
> kvm_vm_set_region_attr() // xarray/attribute update
>
> ...
> KVM_MMU_LOCK(kvm)
> kvm_mem_attrs_changed():
> flush |= kvm_unmap_gfn_range()
> if (flush)
> kvm_flush_remote_tlbs()
> KVM_MMU_UNLOCK(kvm)
>
Yes, I was talking about the sequence above. I was confused of why changing
RMP requires a zapping-recreating flow of NPT in SNP.
> In general, when the RMPUPDATE instruction happens, the TLB entries for
> the GPAs being modified will be flushed, so subsequent nested page fault
> should be able to obtain the updated mapping based on xarray/#NPF at that
> point. In that respect *maybe* we don't need to zap the entries there.
>
> But if the nested page fault occurs before the RMPUPDATE, I think we would
> have a race if the above sequence isn't in place to handle the unmap/flush,
> since in that case we might get a stale mapping because nothing would've
> forced a tlbflush.
>
> There's also stuff like the UPM selftests and SEV lazy-pinning where I
> think that kvm_unmap_gfn_range() sequence is also needed. But I might be
> misunderstanding the question here.
>
In this case, an extra tlbflush would solve? Still, the unnecessary
zapping/recreating of mapping is not promising. I understand that the way
how this patch goes is probably to minimize the changes, but it would be
nice to focus more on what is really needed in a common path and abstract
and re-factor from there.
Can you elaborate more about how the lazy-pinning unpin path is connected
with the zapping here? So that I can dig more about it.
Selftest is a minor case, guess we deal with them via enabling a switch.
E.g. a prop in debugfs.
> > If yes, can we postpone the update of the RMP table in the later fault,
> > like TDX? So that we can save this update_mem_attr x86 ops as things
> > will be solved in the SNP-specific fault handler.
>
> Hmm, I think this would be possible. But it's nice to be able to handle
> the RMPUPDATE as part of KVM_SET_MEMORY_ATTRIBUTES, since it allows
> KVM MMU code to rely solely on xarray state and not have to query RMP
> table to check if a particular PFN needs an RMPUPDATE before mapping it
> into RMP table.
>
> At least... it would *in theory*, if the RMPUPDATE happened under
> protection of mmu_invalidate_seq (in which case it could inherit all the
> same protections KVM MMU has around mmu_invalidate_seq/fault->mmu_seq,
> e.g. letting the guest retry the #PF if fault->mmu_seq is stale).
>
> But currently, RMPUPDATE (via kvm_arch_post_set_memory_attributes) happens
> *after* the invalidation sequence above, so in theory a guest could fault
> on a page just after xarray state is updated, but before the RMPUPDATE has
> been done, in which case the KVM MMU code would properly map the page
> accordingly to xarray, but since RMPUPDATE wouldn't have happened yet, the
> state of the corresponding PFN in RMP table won't match the shared/private
> access type expected by the guest, so when it tries to access it it will
> get another #NPF with RMP bit set in the error code, which will get
> handled as a no-op in handle_rmp_page_fault() (patch #44) and loop like
> this until the RMPUPDATE is finally done. So it still works out, but
> maybe not keeping as much in sync with xarray state and could be.
>
I see. rmp fault handler only deals with page size mismatch for now.
> But deferring RMPUPDATE to fault time goes in the other direction of
> that. Are there benefits/requirements for doing things this way for TDX?
> I could see it being beneficial in terms of reducing overhead for
> uneeded page-state transitions, since they are only done on-demand but
> doesn't seem like it would be that much overhead compared to some of the
> other operations being done.
>
Besides the HW design, I guess one major purpose is to optimize the
booting time of VMs with large memory. Also, post migration can be another case.
Out of curiosity, What is the avg cost of RMUPDATE? Suppose it is an x86
instruction and not going through PSP firmware.
> >
> > If no, guess we need a x86 ops to tell if a zapping is required.
>
> Sorry don't think I quite understand the suggestion. What would this
> zapping be covering vs. the invalidation sequence that currently happens
> in kvm_vm_ioctl_set_mem_attributes()?
I was thinking that zapping of the mapping in EPT/NPT was required by TDX
while SNP might only need an RMP update + TLB flush. Thus, the abstraction
of the kvm_x86_ops.update_mem_attr should sit at this level. But let's
scratch this for now as I need to dig more about the lazy pinning stuff.
>
> >
> > Back to the lock, updating RMP table doesn't require a mutex. Taking
> > the lock is required when updating the directmap. both TDX/SNP requires
> > this update the directmap when changing memory attributes.
>
> Is that confirmed? If so, do you have a pointer to the associated
> documentation? I'm a bit unclear on this due to above-mentioned
> discussion.
>
> >
> > Wouldn't it better to factor the touching directmap of kernel part out?
>
> It actually needs to happen before the RMPUPDATE. As soon as there is a
> shared->private conversion in the RMP table for a particular PFN, then
> any access via directmap by any particular kernel thread to any PFN that
> happens to be in the same physical 2M range can cause an RMP fault on
> the host, which would be fatal. So the rmpupdate() helper in this series
> will unmap directmap entry corresponding the PFN before a shared->private
> RMPUPDATE, and restore mappings after private->shared RMPUPDATE
>
> So we could still factor it out, but it would be something like:
>
> if (attr == private)
> kvm_unmap_directmap(start, end)
> kvm_mem_attrs_changed()
> if (attr == shared)
> kvm_map_directmap(start, end)
>
> >
> > Then you can call the x86 ops.update_mem_attr() in kvm_mem_attrs_changed().
> > And update the direct kernel mapping for both TDX/SNP in the
> > kvm_post_mem_attrs_changed().
>
> Or, adjusting for the above logic, move the unmapping/mapping to a new
> kvm_pre_mem_attrs_changed() and kvm_post_mem_attrs_changed(), respectively.
>
> Which seems pretty reasonable to me. Then we can:
> - drop duplicating the kvm_for_each_memslot_in_gfn_range() walk stuff because
> we'd just need to know what PFNs to map/unmap from directmap
> (although we'd still need a loop around kvm_restrictedmem_get_pfn()
> for the GFN range so not necessarily prettier)
> - call the RMPUPDATE / corresponding TDX handling via kvm_mem_attrs_changed()
> which brings it both under KVM MMU lock and also let's it piggyback
> off the fault->mmu_seq handling so it doesn't get out of sync with
> xarray during fault time.
>
That sounds better. I am just little bit worried that update_mem_attr() will
end up as an SNP-only callback.
> But would be good to hear others' opinions on this. And also confirm
> whether TDX needs that pre/post directmap handle or not.
Yes.
>
> Thanks!
>
> -Mike
>
> >
> > >
> > > -Mike
> > >
> > > >
> > > > From 7c618c1f3c236c382e64680efcbe7d8a672aa870 Mon Sep 17 00:00:00 2001
> > > > Message-Id: <7c618c1f3c236c382e64680efcbe7d8a672aa870.1679114841.git.isaku.yamahata@intel.com>
> > > > In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > > > References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > > > From: Isaku Yamahata <[email protected]>
> > > > Date: Fri, 17 Mar 2023 12:00:09 -0700
> > > > Subject: [PATCH 4/4] KVM: x86: Add 'set_mem_attr' x86 op
> > > >
> > > > This callback will do any platform-specific handling needed for
> > > > converting pages between shared/private.
> > > >
> > > > Originally-by: Michael Roth <[email protected]>
> > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > > arch/x86/kvm/mmu/mmu.c | 1 +
> > > > 3 files changed, 4 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > > > index dc5f18ac0bd5..956db2ee25a5 100644
> > > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > > @@ -100,6 +100,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > > > KVM_X86_OP(load_mmu_pgd)
> > > > KVM_X86_OP(fault_is_private)
> > > > +KVM_X86_OP_OPTIONAL(set_mem_attr)
> > > > KVM_X86_OP_OPTIONAL(link_private_spt)
> > > > KVM_X86_OP_OPTIONAL(free_private_spt)
> > > > KVM_X86_OP_OPTIONAL(split_private_spt)
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 0382d236fbf4..88e11dd3afde 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
> > > > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > > int root_level);
> > > > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
> > > > + void (*set_mem_attr)(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > + unsigned int attr, gfn_t start, gfn_t end);
> > > >
> > > > int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > > > void *private_spt);
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 0ec94c72895c..329333486e64 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -7908,6 +7908,7 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > > gfn_t start, gfn_t end)
> > > > {
> > > > kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
> > > > + static_call(kvm_x86_set_mem_attr)(kvm, slot, attrs, start, end);
> > > > }
> > > >
> > > > void kvm_memory_attributes_create_memslot(struct kvm *kvm,
> > > > --
> > > > 2.25.1
> > > >
> > > > --
> > > > Isaku Yamahata <[email protected]>
> >
> >
On 20.02.23 19:38, Michael Roth wrote:
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
> it as the measurement of the guest at launch.
>
> While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
> to encrypt the VMSA pages.
>
> If its an SNP guest, then VMSA was added in the RMP entry as
> a guest owned page and also removed from the kernel direct map
> so flush it later after it is transitioned back to hypervisor
> state and restored in the direct map.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Harald Hoyer <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 23 ++++
> arch/x86/kvm/svm/sev.c | 122 ++++++++++++++++++
> include/uapi/linux/kvm.h | 14 ++
> 3 files changed, 159 insertions(+)
>
[...]
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 03dd227f6090..515e22d0dc30 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2280,6 +2280,109 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> snp_launch_update_gfn_handler, argp);
> }
>
> +static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {};
> + struct kvm_vcpu *vcpu;
> + unsigned long i;
> + int ret;
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> + data.page_type = SNP_PAGE_TYPE_VMSA;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct vcpu_svm *svm = to_svm(vcpu);
> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + /* Perform some pre-encryption checks against the VMSA */
> + ret = sev_es_sync_vmsa(svm);
> + if (ret)
> + return ret;
> +
> + /* Transition the VMSA page to a firmware state. */
> + ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);
> + if (ret)
> + return ret;
> +
> + /* Issue the SNP command to encrypt the VMSA */
> + data.address = __sme_pa(svm->sev_es.vmsa);
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> + &data, &argp->error);
There is no contract in KVM that dictates that the first entry in the
vcpu list needs to be vcpu_id==0 (BSP). That means if you use a user
space that spawns vCPUs in parallel on init, you will end up with the
BSP behind APs in the LAUNCH_UPDATE order.
This is a problem because for LAUNCH_UPDATE, the order matters. BSP and
AP vCPUs have different initial state and so if you want to reconstruct
the launch digest, you need to ensure that the guest knows the order.
The easiest way I can think of to fix this is to call
snp_launch_update_vmsa twice: Once filtering for vcpu_id == 0 and once
for vcpu_id != 0.
Thanks,
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Mon, Feb 20, 2023 at 11:37:09PM +0200, Zhi Wang wrote:
> On Mon, 20 Feb 2023 12:37:55 -0600
> Michael Roth <[email protected]> wrote:
>
> > From: Vishal Annapurve <[email protected]>
> >
> > Introduce HVA range operator so that other KVM subsystems
> > can operate on HVA range.
> >
> > Signed-off-by: Vishal Annapurve <[email protected]>
> > [mdr: minor checkpatch alignment fixups]
> > Signed-off-by: Michael Roth <[email protected]>
> > ---
> > include/linux/kvm_host.h | 6 +++++
> > virt/kvm/kvm_main.c | 48 ++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 54 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 4d542060cd93..c615650ed256 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1402,6 +1402,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > void kvm_mmu_invalidate_end(struct kvm *kvm);
> >
> > +typedef int (*kvm_hva_range_op_t)(struct kvm *kvm,
> > + struct kvm_gfn_range *range, void *data);
> > +
> > +int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
> > + unsigned long hva_end, kvm_hva_range_op_t handler, void *data);
> > +
> > long kvm_arch_dev_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg);
> > long kvm_arch_vcpu_ioctl(struct file *filp,
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index f7e00593cc5d..4ccd655dd5af 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -642,6 +642,54 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> > return (int)ret;
> > }
> >
>
> Below function seems a reduced duplicate of __kvm_handle_hva_range()
> in virt/kvm/kvm_main.c. It would be nice to factor __kvm_handle_hva_range().
A few differences make it difficult to refactor this clearly:
- This handler is mainly used for loading initial contents into guest
image before booting and doesn't rely on the MMU lock being held. It
also *can't* be called with MMU lock held because it suffers from the
same issue with mem_attr_update() hook where it needs to take a
mutex as part of unmapping from directmap when transitioning page to
private state in RMP table
- This handler wants to return an error code, as opposed to existing
handlers which return a true/false values which are passed along to
MMU notifier call-site and handled differently.
- This handler wants to terminate iterating through memslots as soon
as it encounters the first failure, whereas the existing handlers
expect to be called for each slot regardless of return value.
So it's a pretty different use-case that adds enough complexity to
__kvm_handle_hva_range() that it might need be worth refactoring it,
since it complicates some bits that are closely tied to dealing with
invalidations where the extra complexity probably needs to be
well-warranted.
I took a stab at it here for reference, but even with what seems to be
the minimal set of changes it doesn't save on any code and ultimately I
think it makes it harder to make sense of what going on:
https://github.com/mdroth/linux/commit/976c5fb708f7babe899fd80e27e19f8ba3f6818d
Is there a better approach?
Thanks,
-Mike
>
> > +int kvm_vm_do_hva_range_op(struct kvm *kvm, unsigned long hva_start,
> > + unsigned long hva_end, kvm_hva_range_op_t handler, void *data)
> > +{
> > + int ret = 0;
> > + struct kvm_gfn_range gfn_range;
> > + struct kvm_memory_slot *slot;
> > + struct kvm_memslots *slots;
> > + int i, idx;
> > +
> > + if (WARN_ON_ONCE(hva_end <= hva_start))
> > + return -EINVAL;
> > +
> > + idx = srcu_read_lock(&kvm->srcu);
> > +
> > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> > + struct interval_tree_node *node;
> > +
> > + slots = __kvm_memslots(kvm, i);
> > + kvm_for_each_memslot_in_hva_range(node, slots,
> > + hva_start, hva_end - 1) {
> > + unsigned long start, end;
> > +
> > + slot = container_of(node, struct kvm_memory_slot,
> > + hva_node[slots->node_idx]);
> > + start = max(hva_start, slot->userspace_addr);
> > + end = min(hva_end, slot->userspace_addr +
> > + (slot->npages << PAGE_SHIFT));
> > +
> > + /*
> > + * {gfn(page) | page intersects with [hva_start, hva_end)} =
> > + * {gfn_start, gfn_start+1, ..., gfn_end-1}.
> > + */
> > + gfn_range.start = hva_to_gfn_memslot(start, slot);
> > + gfn_range.end = hva_to_gfn_memslot(end + PAGE_SIZE - 1, slot);
> > + gfn_range.slot = slot;
> > +
> > + ret = handler(kvm, &gfn_range, data);
> > + if (ret)
> > + goto e_ret;
> > + }
> > + }
> > +
> > +e_ret:
> > + srcu_read_unlock(&kvm->srcu, idx);
> > +
> > + return ret;
> > +}
> > +
> > static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> > unsigned long start,
> > unsigned long end,
>
On Thu, Mar 23, 2023 at 08:17:16PM +0200, Zhi Wang wrote:
> On Tue, 21 Mar 2023 20:58:38 -0500
> Michael Roth <[email protected]> wrote:
>
> > On Tue, Mar 21, 2023 at 01:21:36PM +0200, Zhi Wang wrote:
> > > On Mon, 20 Mar 2023 13:05:43 -0500
> > > Michael Roth <[email protected]> wrote:
> > >
> > > > On Fri, Mar 17, 2023 at 09:56:11PM -0700, Isaku Yamahata wrote:
> > > > > On Mon, Feb 20, 2023 at 12:37:53PM -0600,
> > > > > Michael Roth <[email protected]> wrote:
> > > > >
> > > > > > This callback will do any platform-specific handling needed for
> > > > > > converting pages between shared/private.
> > > > > >
> > > > > > Signed-off-by: Michael Roth <[email protected]>
> > > > > > ---
<snip>
> > > > > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > > > struct kvm_memory_attributes *attrs)
> > > > > > {
> > > > > > @@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > > > kvm_mmu_invalidate_end(kvm);
> > > > > > KVM_MMU_UNLOCK(kvm);
> > > > > >
> > > > > > + if (i > start)
> > > > > > + kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
> > > > > > +
> > > > >
> > > > > Doesn't kvm_arch_set_memory_attributes() work for you? i.e the following patch.
> > > > > The error check and pr_warn_ratelimited() can be pushed down into the callback.
> > > >
> > > > This is originally how I had but when CONFIG_PREEMPT_COUNT is set this
> > > > will generate warnings for this callback as well as the invalidation
> > > > callback as reported in v7 here:
> > > >
> > > > https://lore.kernel.org/lkml/Y80vhKwQyw8hS%2F22@notebook/
> > > >
> > > > The main issue is that kvm_mem_attrs_changed() is called while holding
> > > > the KVM MMU lock, which disables preemption. But when updating
> > > > attributes for SNP, we also need to remove private pages from kernel
> > > > directmap, which involves acquiring a mutex which results in
> > > > "BUG: scheduling while atomic" warnings.
> > > >
> > > > So that's why we ended up somewhat duplicating some of the logic and
> > > > using a separate callback chain that happens out of KVM MMU lock.
> > >
> > > Let's split the things of changing memory attributes:
> > >
> > > 1) Update the memory attributes in the xa array (Both TDX and SNP)
> > > 2) Zapping the EPT/NPT mappings (Required by TDX)
> > > 3) Update RMP table (Required by SNP)
> > > 4) Update the directmap of kernel (SNP, but I guess TDX needs it as well)
> >
>
> Thanks for the effort of detailed reply. It is very informative.
>
> > I'm not so sure TDX requires this. I was under that impression, but
> > Kirill raised some doubts about this and I'm not sure it's been
> > confirmed. If it's purely an SNP thing then there may not be much value
> > in creating a separate callback for it:
> >
> > https://lore.kernel.org/linux-mm/[email protected]/T/#meba4ce80709cd3afd3818b61e6419fd800287b9e
> >
>
> Hmm, Krill and Isaku, can you confirm that TDX doesn't need this?
>
> I think it is a generic requirement that TDX/SNP are not expecting the
> host to touch a private page either from the kernel or the userspace.
This main issue is that in the case of the RMP table, a write to a 2M
mapping is interpreted as a write to a private page even if the write
didn't actually touch any private pages within the range. It may be
possible in firmware/hardware to distinguish between these 2 cases, but I'm
not sure whether that's the case for TDX or not.
>
> > And for SNP, the current code does the unmapping/RMP update in the same
> > function:
> >
> > [PATCH RFC v8 15/56] x86/sev: Invalidate pages from the direct map when adding them to the RMP table
> >
> > I'm not against splitting RMP/directmap handling, but just want to
> > understand what the requirements are around that a bit better.
> >
> > Does handling the #3 / RMP update / kvm_arch_post_set_memory_attributes
> > stuff outside of MMU lock cause issues on TDX side? What sort of
> > handling is needed in these callbacks for TDX (if anything)?
> >
>
> No, it doesn't cause problem for TDX as TDX doesn't need such callback.
>
> Unlike SNP, which has (1 NPT + 1 RMP) and the enforced HW check is done by RMP, TDX has
> two EPT(smiliar with NPT)s (1 shared + 1 private). Converting the memory attr is achieved
> by zapping the mapping from one EPT and creating the mapping in the other one in the fault
> when guest access the memory. The fault GPA will carry the "SHARED" bit (!C-BIT), so
> KVM knows which EPT should be chosen for populating the mapping.
>
> I was trying to figure out what should be the proper callback and at which layer it should
> sit for achieving changing memory attr for both TDX/SNP. The current callback looks a little
> bit hacky. Duplicate code pieces because of locks implies the SW structure might need to be
> re-considered.
>
> > >
> > > Does SNP really need to zap the NPT mappings when changing the memory
> > > attributes? (The new mappings will be created later in the fault). I don't
> > > find this requirement from APM.
> >
> > I don't think we've added anything specifically for SNP. Do you mean the
> > generic kvm_unmap_gfn_range/kvm_flush_remote_tlbs sequence below?
> >
> > kvm_vm_ioctl_set_mem_attributes():
> > KVM_MMU_LOCK(kvm)
> > kvm_mmu_invalidate_begin()
> > ...
> > KVM_MMU_UNLOCK(kvm)
> >
> > kvm_vm_set_region_attr() // xarray/attribute update
> >
> > ...
> > KVM_MMU_LOCK(kvm)
> > kvm_mem_attrs_changed():
> > flush |= kvm_unmap_gfn_range()
> > if (flush)
> > kvm_flush_remote_tlbs()
> > KVM_MMU_UNLOCK(kvm)
> >
>
> Yes, I was talking about the sequence above. I was confused of why changing
> RMP requires a zapping-recreating flow of NPT in SNP.
Hmm, so you're suggesting doing something like moving the
kvm_unmap_gfn_range()/kvm_flush_remote_tlbs() into .update_mem_attr()
callback so the platform can decide if it needs the zap/flush rather
than handling it in generic code?
If SNP really did't need the handling it's an interesting thought, but
I think it is needed there after all...
>
> > In general, when the RMPUPDATE instruction happens, the TLB entries for
> > the GPAs being modified will be flushed, so subsequent nested page fault
> > should be able to obtain the updated mapping based on xarray/#NPF at that
> > point. In that respect *maybe* we don't need to zap the entries there.
> >
> > But if the nested page fault occurs before the RMPUPDATE, I think we would
> > have a race if the above sequence isn't in place to handle the unmap/flush,
> > since in that case we might get a stale mapping because nothing would've
> > forced a tlbflush.
> >
> > There's also stuff like the UPM selftests and SEV lazy-pinning where I
> > think that kvm_unmap_gfn_range() sequence is also needed. But I might be
> > misunderstanding the question here.
> >
>
> In this case, an extra tlbflush would solve? Still, the unnecessary
> zapping/recreating of mapping is not promising. I understand that the way
> how this patch goes is probably to minimize the changes, but it would be
> nice to focus more on what is really needed in a common path and abstract
> and re-factor from there.
I tested this and just doing the tlbflush is insufficient. If the
entries are still present in nested page table HV doesn't necessarily
get an #NPF so the guest can still end up with a stale mapping. After a
shared->private conversion the guest would cause #NPF with RMP/ENC bits
set once it tries to access it as a private page, but for a
private->shared conversion the guest can subsequently access the page
with ENC-bit=0 without causing an #NPF, but in this case GFN can still
be mapped to the PFN restrictedmem was using to back it when it was in
private state, instead of normal non-restricted memory.
So it seems like SNP needs the zapping behavior as well, and that it
isn't very different from the TDX/SEV-lazy/selftest users. So having
common handling seems worthwhile.
>
> Can you elaborate more about how the lazy-pinning unpin path is connected
> with the zapping here? So that I can dig more about it.
Just to be clear this is with regard to SEV lazy-pinning, which makes
used of restrictedmem mainly for lazy-pinning of private pages, rather
than SNP (which also inherits lazy-pinning from restrictedmem).
The main thing with SEV is that the private/shared status of a page is
completely up to how the guest decides to map it in its page tables,
unlike with SNP where a mismatch between guest-expected status and
actual status in the RMP table will generate a #NPF. So SEV would be
even more reliant on current zapping behavior to ensure the NPT will
be updated.
>
> Selftest is a minor case, guess we deal with them via enabling a switch.
> E.g. a prop in debugfs.
Wouldn't want to break things if a guest was running while selftest was
running, or something along that line. Especially since it seems to be a
common requirement given the above.
>
> > > If yes, can we postpone the update of the RMP table in the later fault,
> > > like TDX? So that we can save this update_mem_attr x86 ops as things
> > > will be solved in the SNP-specific fault handler.
> >
> > Hmm, I think this would be possible. But it's nice to be able to handle
> > the RMPUPDATE as part of KVM_SET_MEMORY_ATTRIBUTES, since it allows
> > KVM MMU code to rely solely on xarray state and not have to query RMP
> > table to check if a particular PFN needs an RMPUPDATE before mapping it
> > into RMP table.
> >
> > At least... it would *in theory*, if the RMPUPDATE happened under
> > protection of mmu_invalidate_seq (in which case it could inherit all the
> > same protections KVM MMU has around mmu_invalidate_seq/fault->mmu_seq,
> > e.g. letting the guest retry the #PF if fault->mmu_seq is stale).
> >
> > But currently, RMPUPDATE (via kvm_arch_post_set_memory_attributes) happens
> > *after* the invalidation sequence above, so in theory a guest could fault
> > on a page just after xarray state is updated, but before the RMPUPDATE has
> > been done, in which case the KVM MMU code would properly map the page
> > accordingly to xarray, but since RMPUPDATE wouldn't have happened yet, the
> > state of the corresponding PFN in RMP table won't match the shared/private
> > access type expected by the guest, so when it tries to access it it will
> > get another #NPF with RMP bit set in the error code, which will get
> > handled as a no-op in handle_rmp_page_fault() (patch #44) and loop like
> > this until the RMPUPDATE is finally done. So it still works out, but
> > maybe not keeping as much in sync with xarray state and could be.
> >
>
> I see. rmp fault handler only deals with page size mismatch for now.
That's correct.
>
> > But deferring RMPUPDATE to fault time goes in the other direction of
> > that. Are there benefits/requirements for doing things this way for TDX?
> > I could see it being beneficial in terms of reducing overhead for
> > uneeded page-state transitions, since they are only done on-demand but
> > doesn't seem like it would be that much overhead compared to some of the
> > other operations being done.
> >
>
> Besides the HW design, I guess one major purpose is to optimize the
> booting time of VMs with large memory. Also, post migration can be another case.
It seems like without lazy-acceptance support in the guest there isn't
too much reason to optimize here, since the guest will necessarily fault
in every page as part of pre-accepting the memory in OVMF.
And if we're making use of lazy-acceptance, for SNP at least, we wouldn't
end up getting .update_mem_attr() callbacks in the first place since
those are ultimately the result of the guest issuing a shared->private
conversion request, which would generally happen until just before the
guest decides to accept/pvalidate that GFN. So with lazy-acceptance the
guest optimizes most of this potential overhead away already.
>
> Out of curiosity, What is the avg cost of RMUPDATE? Suppose it is an x86
> instruction and not going through PSP firmware.
Yes, it's an x86 instruction, no firmware calls there. Average seems to
be about 1us per instruction. It's not insignificant, up to 5s for a
16GB guest in a worst case scenario where the guest does not optimize
for 2MB shared->private conversions and has no lazy-acceptance support,
but it doesn't seem like it would be common to try to boot large guests
in such a configuration.
>
> > >
> > > If no, guess we need a x86 ops to tell if a zapping is required.
> >
> > Sorry don't think I quite understand the suggestion. What would this
> > zapping be covering vs. the invalidation sequence that currently happens
> > in kvm_vm_ioctl_set_mem_attributes()?
>
> I was thinking that zapping of the mapping in EPT/NPT was required by TDX
> while SNP might only need an RMP update + TLB flush. Thus, the abstraction
> of the kvm_x86_ops.update_mem_attr should sit at this level. But let's
> scratch this for now as I need to dig more about the lazy pinning stuff.
>
> >
> > >
> > > Back to the lock, updating RMP table doesn't require a mutex. Taking
> > > the lock is required when updating the directmap. both TDX/SNP requires
> > > this update the directmap when changing memory attributes.
> >
> > Is that confirmed? If so, do you have a pointer to the associated
> > documentation? I'm a bit unclear on this due to above-mentioned
> > discussion.
> >
> > >
> > > Wouldn't it better to factor the touching directmap of kernel part out?
> >
> > It actually needs to happen before the RMPUPDATE. As soon as there is a
> > shared->private conversion in the RMP table for a particular PFN, then
> > any access via directmap by any particular kernel thread to any PFN that
> > happens to be in the same physical 2M range can cause an RMP fault on
> > the host, which would be fatal. So the rmpupdate() helper in this series
> > will unmap directmap entry corresponding the PFN before a shared->private
> > RMPUPDATE, and restore mappings after private->shared RMPUPDATE
> >
> > So we could still factor it out, but it would be something like:
> >
> > if (attr == private)
> > kvm_unmap_directmap(start, end)
> > kvm_mem_attrs_changed()
> > if (attr == shared)
> > kvm_map_directmap(start, end)
> >
> > >
> > > Then you can call the x86 ops.update_mem_attr() in kvm_mem_attrs_changed().
> > > And update the direct kernel mapping for both TDX/SNP in the
> > > kvm_post_mem_attrs_changed().
> >
> > Or, adjusting for the above logic, move the unmapping/mapping to a new
> > kvm_pre_mem_attrs_changed() and kvm_post_mem_attrs_changed(), respectively.
> >
> > Which seems pretty reasonable to me. Then we can:
> > - drop duplicating the kvm_for_each_memslot_in_gfn_range() walk stuff because
> > we'd just need to know what PFNs to map/unmap from directmap
> > (although we'd still need a loop around kvm_restrictedmem_get_pfn()
> > for the GFN range so not necessarily prettier)
> > - call the RMPUPDATE / corresponding TDX handling via kvm_mem_attrs_changed()
> > which brings it both under KVM MMU lock and also let's it piggyback
> > off the fault->mmu_seq handling so it doesn't get out of sync with
> > xarray during fault time.
> >
>
> That sounds better. I am just little bit worried that update_mem_attr() will
> end up as an SNP-only callback.
If it really ends up looking like an SNP-only thing, I don't see any
immediate issue with deferring this handling until fault time. But the
previous constraints remain:
- directmap unmap needs to happen before shared->private RMPUPDATE,
can't be called while holding KVM MMU lock or other spinlock
- RMPUPDATE, and that needs to happen outside
- directmap restore needs to happen after private->shared RMPUPDATE,
can't be called while holding KVM MMU lock or other spinlock
I saw the TDX patches added some x86 ops / hooks in KVM MMU to handle
mapping secure pages. Is there anything there you think is worth
re-using/re-purposing for SNP use-case?
Thanks,
-Mike
>
> > But would be good to hear others' opinions on this. And also confirm
> > whether TDX needs that pre/post directmap handle or not.
>
> Yes.
>
> >
> > Thanks!
> >
> > -Mike
> >
> > >
> > > >
> > > > -Mike
> > > >
> > > > >
> > > > > From 7c618c1f3c236c382e64680efcbe7d8a672aa870 Mon Sep 17 00:00:00 2001
> > > > > Message-Id: <7c618c1f3c236c382e64680efcbe7d8a672aa870.1679114841.git.isaku.yamahata@intel.com>
> > > > > In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > > > > References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > > > > From: Isaku Yamahata <[email protected]>
> > > > > Date: Fri, 17 Mar 2023 12:00:09 -0700
> > > > > Subject: [PATCH 4/4] KVM: x86: Add 'set_mem_attr' x86 op
> > > > >
> > > > > This callback will do any platform-specific handling needed for
> > > > > converting pages between shared/private.
> > > > >
> > > > > Originally-by: Michael Roth <[email protected]>
> > > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > > ---
> > > > > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > > > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > > > arch/x86/kvm/mmu/mmu.c | 1 +
> > > > > 3 files changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > > > > index dc5f18ac0bd5..956db2ee25a5 100644
> > > > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > > > @@ -100,6 +100,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > > > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > > > > KVM_X86_OP(load_mmu_pgd)
> > > > > KVM_X86_OP(fault_is_private)
> > > > > +KVM_X86_OP_OPTIONAL(set_mem_attr)
> > > > > KVM_X86_OP_OPTIONAL(link_private_spt)
> > > > > KVM_X86_OP_OPTIONAL(free_private_spt)
> > > > > KVM_X86_OP_OPTIONAL(split_private_spt)
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index 0382d236fbf4..88e11dd3afde 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
> > > > > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > > > int root_level);
> > > > > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
> > > > > + void (*set_mem_attr)(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > > + unsigned int attr, gfn_t start, gfn_t end);
> > > > >
> > > > > int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > > > > void *private_spt);
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index 0ec94c72895c..329333486e64 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -7908,6 +7908,7 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > > > gfn_t start, gfn_t end)
> > > > > {
> > > > > kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
> > > > > + static_call(kvm_x86_set_mem_attr)(kvm, slot, attrs, start, end);
> > > > > }
> > > > >
> > > > > void kvm_memory_attributes_create_memslot(struct kvm *kvm,
> > > > > --
> > > > > 2.25.1
> > > > >
> > > > > --
> > > > > Isaku Yamahata <[email protected]>
> > >
> > >
>
On Wed, Mar 01, 2023 at 08:21:17AM -0800, Dave Hansen wrote:
> On 2/20/23 10:38, Michael Roth wrote:
> > +static int handle_split_page_fault(struct vm_fault *vmf)
> > +{
> > + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> > + return 0;
> > +}
> > +
> > /*
> > * By the time we get here, we already hold the mm semaphore
> > *
> > @@ -5078,6 +5084,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> > pmd_migration_entry_wait(mm, vmf.pmd);
> > return 0;
> > }
> > +
> > + if (flags & FAULT_FLAG_PAGE_SPLIT)
> > + return handle_split_page_fault(&vmf);
>
> I asked this long ago, but how do you prevent these faults from
> occurring on hugetlbfs mappings that can't be split?
>
In v6 there used to be a KVM ioctl to register a user HVA range for use
with SEV-SNP guests, and as part of that registration the code would scan
all the VMAs encompassed by that range and check for VM_HUGETLB in
vma->vm_flags.
With v7+ this registration mechanism has been replaced with the
new restricted memfd implementation provided by UPM to manage private guest
memory. Normal shmem/memfd backend can specify HugeTLBFS via a
MFD_HUGETLB flag when creating the memfd, but for restricted memfd no
special flags are allowed, so HugeTLBFS isn't possible for the pages
that are used for private memory. Though it might make sense to enforce
that in SNP-specific code still, in case restricted memfd does
eventually gain that ability...
But now, with v7+, the non-private memory that doesn't get allocated via
restricted memfd (and thus can actually be mapped into userspace and
used for things like buffers shared between host/guest), can still be
allocated via HugeTLBFS since there is nothing SNP is doing to
specifically guard against that. So we'd probably want to reimplement
similar logic to what was in v6 to guard against this, since it's these
mapping that would potentially be triggering the RMP faults and require
splitting.
However...
The fact that any pages potentially triggering these #PFs are able to be
mapped as 2M in the first place means that all the PFNs covered by that
2M mapping must also been allocated by via mappable/VMA memory rather
than via restricted memfd where userspace mappings are not possible.
So I think we should be able to drop this patch entirely, as well as
allow the use of HugeTLBFS for non-restricted memfd memory (though
eventually the guest will switch all its memory to private/restricted
so not gaining much there other than reducing management complexity).
-Mike
On Wed, Mar 29, 2023 at 02:00:11AM +0300, Zhi Wang wrote:
> On Mon, 27 Mar 2023 23:36:35 -0500
> Michael Roth <[email protected]> wrote:
>
> > On Thu, Mar 23, 2023 at 08:17:16PM +0200, Zhi Wang wrote:
> > > On Tue, 21 Mar 2023 20:58:38 -0500
> > > Michael Roth <[email protected]> wrote:
> > >
> > > > On Tue, Mar 21, 2023 at 01:21:36PM +0200, Zhi Wang wrote:
> > > > > On Mon, 20 Mar 2023 13:05:43 -0500
> > > > > Michael Roth <[email protected]> wrote:
> > > > >
> > > > > > On Fri, Mar 17, 2023 at 09:56:11PM -0700, Isaku Yamahata wrote:
> > > > > > > On Mon, Feb 20, 2023 at 12:37:53PM -0600,
> > > > > > > Michael Roth <[email protected]> wrote:
> > > > > > >
> > > > > > > > This callback will do any platform-specific handling needed for
> > > > > > > > converting pages between shared/private.
> > > > > > > >
> > > > > > > > Signed-off-by: Michael Roth <[email protected]>
> > > > > > > > ---
> >
> > <snip>
> >
> > > > > > > > static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > > > > > struct kvm_memory_attributes *attrs)
> > > > > > > > {
> > > > > > > > @@ -2602,6 +2628,9 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > > > > > kvm_mmu_invalidate_end(kvm);
> > > > > > > > KVM_MMU_UNLOCK(kvm);
> > > > > > > >
> > > > > > > > + if (i > start)
> > > > > > > > + kvm_post_mem_attrs_changed(kvm, attrs->attributes, start, i);
> > > > > > > > +
> > > > > > >
> > > > > > > Doesn't kvm_arch_set_memory_attributes() work for you? i.e the following patch.
> > > > > > > The error check and pr_warn_ratelimited() can be pushed down into the callback.
> > > > > >
> > > > > > This is originally how I had but when CONFIG_PREEMPT_COUNT is set this
> > > > > > will generate warnings for this callback as well as the invalidation
> > > > > > callback as reported in v7 here:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/Y80vhKwQyw8hS%2F22@notebook/
> > > > > >
> > > > > > The main issue is that kvm_mem_attrs_changed() is called while holding
> > > > > > the KVM MMU lock, which disables preemption. But when updating
> > > > > > attributes for SNP, we also need to remove private pages from kernel
> > > > > > directmap, which involves acquiring a mutex which results in
> > > > > > "BUG: scheduling while atomic" warnings.
> > > > > >
> > > > > > So that's why we ended up somewhat duplicating some of the logic and
> > > > > > using a separate callback chain that happens out of KVM MMU lock.
> > > > >
> > > > > Let's split the things of changing memory attributes:
> > > > >
> > > > > 1) Update the memory attributes in the xa array (Both TDX and SNP)
> > > > > 2) Zapping the EPT/NPT mappings (Required by TDX)
> > > > > 3) Update RMP table (Required by SNP)
> > > > > 4) Update the directmap of kernel (SNP, but I guess TDX needs it as well)
> > > >
> > >
> > > Thanks for the effort of detailed reply. It is very informative.
> > >
> > > > I'm not so sure TDX requires this. I was under that impression, but
> > > > Kirill raised some doubts about this and I'm not sure it's been
> > > > confirmed. If it's purely an SNP thing then there may not be much value
> > > > in creating a separate callback for it:
> > > >
> > > > https://lore.kernel.org/linux-mm/[email protected]/T/#meba4ce80709cd3afd3818b61e6419fd800287b9e
> > > >
> > >
> > > Hmm, Krill and Isaku, can you confirm that TDX doesn't need this?
> > >
> > > I think it is a generic requirement that TDX/SNP are not expecting the
> > > host to touch a private page either from the kernel or the userspace.
> >
> > This main issue is that in the case of the RMP table, a write to a 2M
> > mapping is interpreted as a write to a private page even if the write
> > didn't actually touch any private pages within the range. It may be
> > possible in firmware/hardware to distinguish between these 2 cases, but I'm
> > not sure whether that's the case for TDX or not.
> >
>
> I was thinking this today. SNP uses this trick to cope with the existing
> HW design. It is mandatory for SNP. What was the benefit of
> removing/restoring the directmap for the purpose of restricting the access
> from the kernel for TDX? Probably catching a wrong touch to a private
> kernel page from the kernel space instead of triggering MCE.
Not sure, but yah might've been something like that.
>
> My point goes to let them stay in SNP for now as they seems closely coupled
> with RMUPDATE so far and the timing of doing RMUPDATE is still under
> discussion. Maybe we invent some new pre/post, but they turn not necessary
> later if the timing of RMUPDATE is changed.
Makes sense.
>
> > >
> > > > And for SNP, the current code does the unmapping/RMP update in the same
> > > > function:
> > > >
> > > > [PATCH RFC v8 15/56] x86/sev: Invalidate pages from the direct map when adding them to the RMP table
> > > >
> > > > I'm not against splitting RMP/directmap handling, but just want to
> > > > understand what the requirements are around that a bit better.
> > > >
> > > > Does handling the #3 / RMP update / kvm_arch_post_set_memory_attributes
> > > > stuff outside of MMU lock cause issues on TDX side? What sort of
> > > > handling is needed in these callbacks for TDX (if anything)?
> > > >
> > >
> > > No, it doesn't cause problem for TDX as TDX doesn't need such callback.
> > >
> > > Unlike SNP, which has (1 NPT + 1 RMP) and the enforced HW check is done by RMP, TDX has
> > > two EPT(smiliar with NPT)s (1 shared + 1 private). Converting the memory attr is achieved
> > > by zapping the mapping from one EPT and creating the mapping in the other one in the fault
> > > when guest access the memory. The fault GPA will carry the "SHARED" bit (!C-BIT), so
> > > KVM knows which EPT should be chosen for populating the mapping.
> > >
> > > I was trying to figure out what should be the proper callback and at which layer it should
> > > sit for achieving changing memory attr for both TDX/SNP. The current callback looks a little
> > > bit hacky. Duplicate code pieces because of locks implies the SW structure might need to be
> > > re-considered.
> > >
> > > > >
> > > > > Does SNP really need to zap the NPT mappings when changing the memory
> > > > > attributes? (The new mappings will be created later in the fault). I don't
> > > > > find this requirement from APM.
> > > >
> > > > I don't think we've added anything specifically for SNP. Do you mean the
> > > > generic kvm_unmap_gfn_range/kvm_flush_remote_tlbs sequence below?
> > > >
> > > > kvm_vm_ioctl_set_mem_attributes():
> > > > KVM_MMU_LOCK(kvm)
> > > > kvm_mmu_invalidate_begin()
> > > > ...
> > > > KVM_MMU_UNLOCK(kvm)
> > > >
> > > > kvm_vm_set_region_attr() // xarray/attribute update
> > > >
> > > > ...
> > > > KVM_MMU_LOCK(kvm)
> > > > kvm_mem_attrs_changed():
> > > > flush |= kvm_unmap_gfn_range()
> > > > if (flush)
> > > > kvm_flush_remote_tlbs()
> > > > KVM_MMU_UNLOCK(kvm)
> > > >
> > >
> > > Yes, I was talking about the sequence above. I was confused of why changing
> > > RMP requires a zapping-recreating flow of NPT in SNP.
> >
> > Hmm, so you're suggesting doing something like moving the
> > kvm_unmap_gfn_range()/kvm_flush_remote_tlbs() into .update_mem_attr()
> > callback so the platform can decide if it needs the zap/flush rather
> > than handling it in generic code?
> >
> > If SNP really did't need the handling it's an interesting thought, but
> > I think it is needed there after all...
>
> Yes. I was thinking so. When I was reading the APM, I thought one of the
> advantages of SNP architecture was splitting the HW-enforced check from NPT
> mapping management. Thus when updating the RMP, the NPT can stay as it is. I
> guess that's why I was so fancying of eliminating the extra zapping.
Yah it was an interesting thought. Maybe before the switch to UPM it might've
been possible to avoid the zapping, but seems like it's a fairly general
requirement now.
>
> After reading the code of the UPM, I think we did need the zapping here for
> SNP as pages for private memory and shared memory seem from different
> sources. shared <->private memory conversion needs to wipe the mapping of
> pages from one source and go through a fault-mapping flow on the pages
> from the other source. Still wondering if there is any good idea to improve
> it based on the current approaches.
It may still be possible to avoiding zapping for shared->private
conversions since the subsequent access with C-bit set would cause an
#NPF with RMP bit set, but yah, it won't work for private->shared and
doesn't seem worth it to complicate the logic for anything SNP-specific
unless it would make a significant performance impact, but I think it
would be lost in the noise vs. RMPUPDATE/initial faulting of private
pages/directmap handling/guest exits for page-state change/etc.
>
> >
> > >
> > > > In general, when the RMPUPDATE instruction happens, the TLB entries for
> > > > the GPAs being modified will be flushed, so subsequent nested page fault
> > > > should be able to obtain the updated mapping based on xarray/#NPF at that
> > > > point. In that respect *maybe* we don't need to zap the entries there.
> > > >
> > > > But if the nested page fault occurs before the RMPUPDATE, I think we would
> > > > have a race if the above sequence isn't in place to handle the unmap/flush,
> > > > since in that case we might get a stale mapping because nothing would've
> > > > forced a tlbflush.
> > > >
> > > > There's also stuff like the UPM selftests and SEV lazy-pinning where I
> > > > think that kvm_unmap_gfn_range() sequence is also needed. But I might be
> > > > misunderstanding the question here.
> > > >
> > >
> > > In this case, an extra tlbflush would solve? Still, the unnecessary
> > > zapping/recreating of mapping is not promising. I understand that the way
> > > how this patch goes is probably to minimize the changes, but it would be
> > > nice to focus more on what is really needed in a common path and abstract
> > > and re-factor from there.
> >
> > I tested this and just doing the tlbflush is insufficient. If the
> > entries are still present in nested page table HV doesn't necessarily
> > get an #NPF so the guest can still end up with a stale mapping. After a
> > shared->private conversion the guest would cause #NPF with RMP/ENC bits
> > set once it tries to access it as a private page, but for a
> > private->shared conversion the guest can subsequently access the page
> > with ENC-bit=0 without causing an #NPF, but in this case GFN can still
> > be mapped to the PFN restrictedmem was using to back it when it was in
> > private state, instead of normal non-restricted memory.
> >
> > So it seems like SNP needs the zapping behavior as well, and that it
> > isn't very different from the TDX/SEV-lazy/selftest users. So having
> > common handling seems worthwhile.
> >
> > >
> > > Can you elaborate more about how the lazy-pinning unpin path is connected
> > > with the zapping here? So that I can dig more about it.
> >
> > Just to be clear this is with regard to SEV lazy-pinning, which makes
> > used of restrictedmem mainly for lazy-pinning of private pages, rather
> > than SNP (which also inherits lazy-pinning from restrictedmem).
> >
> > The main thing with SEV is that the private/shared status of a page is
> > completely up to how the guest decides to map it in its page tables,
> > unlike with SNP where a mismatch between guest-expected status and
> > actual status in the RMP table will generate a #NPF. So SEV would be
> > even more reliant on current zapping behavior to ensure the NPT will
> > be updated.
> >
> > >
> > > Selftest is a minor case, guess we deal with them via enabling a switch.
> > > E.g. a prop in debugfs.
> >
> > Wouldn't want to break things if a guest was running while selftest was
> > running, or something along that line. Especially since it seems to be a
> > common requirement given the above.
> >
> > >
> > > > > If yes, can we postpone the update of the RMP table in the later fault,
> > > > > like TDX? So that we can save this update_mem_attr x86 ops as things
> > > > > will be solved in the SNP-specific fault handler.
> > > >
> > > > Hmm, I think this would be possible. But it's nice to be able to handle
> > > > the RMPUPDATE as part of KVM_SET_MEMORY_ATTRIBUTES, since it allows
> > > > KVM MMU code to rely solely on xarray state and not have to query RMP
> > > > table to check if a particular PFN needs an RMPUPDATE before mapping it
> > > > into RMP table.
> > > >
> > > > At least... it would *in theory*, if the RMPUPDATE happened under
> > > > protection of mmu_invalidate_seq (in which case it could inherit all the
> > > > same protections KVM MMU has around mmu_invalidate_seq/fault->mmu_seq,
> > > > e.g. letting the guest retry the #PF if fault->mmu_seq is stale).
> > > >
> > > > But currently, RMPUPDATE (via kvm_arch_post_set_memory_attributes) happens
> > > > *after* the invalidation sequence above, so in theory a guest could fault
> > > > on a page just after xarray state is updated, but before the RMPUPDATE has
> > > > been done, in which case the KVM MMU code would properly map the page
> > > > accordingly to xarray, but since RMPUPDATE wouldn't have happened yet, the
> > > > state of the corresponding PFN in RMP table won't match the shared/private
> > > > access type expected by the guest, so when it tries to access it it will
> > > > get another #NPF with RMP bit set in the error code, which will get
> > > > handled as a no-op in handle_rmp_page_fault() (patch #44) and loop like
> > > > this until the RMPUPDATE is finally done. So it still works out, but
> > > > maybe not keeping as much in sync with xarray state and could be.
> > > >
> > >
> > > I see. rmp fault handler only deals with page size mismatch for now.
> >
> > That's correct.
> >
> > >
> > > > But deferring RMPUPDATE to fault time goes in the other direction of
> > > > that. Are there benefits/requirements for doing things this way for TDX?
> > > > I could see it being beneficial in terms of reducing overhead for
> > > > uneeded page-state transitions, since they are only done on-demand but
> > > > doesn't seem like it would be that much overhead compared to some of the
> > > > other operations being done.
> > > >
> > >
> > > Besides the HW design, I guess one major purpose is to optimize the
> > > booting time of VMs with large memory. Also, post migration can be another case.
> >
> > It seems like without lazy-acceptance support in the guest there isn't
> > too much reason to optimize here, since the guest will necessarily fault
> > in every page as part of pre-accepting the memory in OVMF.
> >
>
> Do you mean pre-accepting the whole system memory? or only the initial
> allocated memory for OVMF?
I mean "pre-accepting" in the sense of "guests that don't have
lazy-acceptance enabled", so yes the whole of system memory.
>
> TDX has already been working in a lazy-acceptance style. I believe the
> lazy-acceptance will be the furture direction. But I understand that part
> can be in stage 2 with the lazy-acceptance for SNP.
For SNP the OVMF bits are there:
https://www.mail-archive.com/[email protected]/msg53857.html
but guest-side kernel support still requires these patches getting upstream:
Kirill's base+TDX lazy-acceptance support:
https://lore.kernel.org/linux-mm/[email protected]/T/
Tom's follow-on SNP support:
https://lore.kernel.org/lkml/[email protected]/
Dionna's patch to instruct OVMF to not pre-accept remaining memory
after EFI_EXIT_BOOT_SERVICES:
https://lore.kernel.org/lkml/[email protected]/T/
But yes, I agree lazy-acceptance will probably be the common default for
TDX/SNP guests, where we'd only be zapping entries that we know the
guest is going to be using anyway, as well as avoiding all the other
related overhead.
>
> > And if we're making use of lazy-acceptance, for SNP at least, we wouldn't
> > end up getting .update_mem_attr() callbacks in the first place since
> > those are ultimately the result of the guest issuing a shared->private
> > conversion request, which would generally happen until just before the
> > guest decides to accept/pvalidate that GFN. So with lazy-acceptance the
> > guest optimizes most of this potential overhead away already.
> I understand there can be
> >
> > >
> > > Out of curiosity, What is the avg cost of RMUPDATE? Suppose it is an x86
> > > instruction and not going through PSP firmware.
> >
> > Yes, it's an x86 instruction, no firmware calls there. Average seems to
> > be about 1us per instruction. It's not insignificant, up to 5s for a
> > 16GB guest in a worst case scenario where the guest does not optimize
> > for 2MB shared->private conversions and has no lazy-acceptance support,
> > but it doesn't seem like it would be common to try to boot large guests
> > in such a configuration.
> >
> I see.
> > >
> > > > >
> > > > > If no, guess we need a x86 ops to tell if a zapping is required.
> > > >
> > > > Sorry don't think I quite understand the suggestion. What would this
> > > > zapping be covering vs. the invalidation sequence that currently happens
> > > > in kvm_vm_ioctl_set_mem_attributes()?
> > >
> > > I was thinking that zapping of the mapping in EPT/NPT was required by TDX
> > > while SNP might only need an RMP update + TLB flush. Thus, the abstraction
> > > of the kvm_x86_ops.update_mem_attr should sit at this level. But let's
> > > scratch this for now as I need to dig more about the lazy pinning stuff.
> > >
> > > >
> > > > >
> > > > > Back to the lock, updating RMP table doesn't require a mutex. Taking
> > > > > the lock is required when updating the directmap. both TDX/SNP requires
> > > > > this update the directmap when changing memory attributes.
> > > >
> > > > Is that confirmed? If so, do you have a pointer to the associated
> > > > documentation? I'm a bit unclear on this due to above-mentioned
> > > > discussion.
> > > >
> > > > >
> > > > > Wouldn't it better to factor the touching directmap of kernel part out?
> > > >
> > > > It actually needs to happen before the RMPUPDATE. As soon as there is a
> > > > shared->private conversion in the RMP table for a particular PFN, then
> > > > any access via directmap by any particular kernel thread to any PFN that
> > > > happens to be in the same physical 2M range can cause an RMP fault on
> > > > the host, which would be fatal. So the rmpupdate() helper in this series
> > > > will unmap directmap entry corresponding the PFN before a shared->private
> > > > RMPUPDATE, and restore mappings after private->shared RMPUPDATE
> > > >
> > > > So we could still factor it out, but it would be something like:
> > > >
> > > > if (attr == private)
> > > > kvm_unmap_directmap(start, end)
> > > > kvm_mem_attrs_changed()
> > > > if (attr == shared)
> > > > kvm_map_directmap(start, end)
> > > >
> > > > >
> > > > > Then you can call the x86 ops.update_mem_attr() in kvm_mem_attrs_changed().
> > > > > And update the direct kernel mapping for both TDX/SNP in the
> > > > > kvm_post_mem_attrs_changed().
> > > >
> > > > Or, adjusting for the above logic, move the unmapping/mapping to a new
> > > > kvm_pre_mem_attrs_changed() and kvm_post_mem_attrs_changed(), respectively.
> > > >
> > > > Which seems pretty reasonable to me. Then we can:
> > > > - drop duplicating the kvm_for_each_memslot_in_gfn_range() walk stuff because
> > > > we'd just need to know what PFNs to map/unmap from directmap
> > > > (although we'd still need a loop around kvm_restrictedmem_get_pfn()
> > > > for the GFN range so not necessarily prettier)
> > > > - call the RMPUPDATE / corresponding TDX handling via kvm_mem_attrs_changed()
> > > > which brings it both under KVM MMU lock and also let's it piggyback
> > > > off the fault->mmu_seq handling so it doesn't get out of sync with
> > > > xarray during fault time.
> > > >
> > >
> > > That sounds better. I am just little bit worried that update_mem_attr() will
> > > end up as an SNP-only callback.
> >
> > If it really ends up looking like an SNP-only thing, I don't see any
> > immediate issue with deferring this handling until fault time. But the
> > previous constraints remain:
> >
> > - directmap unmap needs to happen before shared->private RMPUPDATE,
> > can't be called while holding KVM MMU lock or other spinlock
> > - RMPUPDATE, and that needs to happen outside
> > - directmap restore needs to happen after private->shared RMPUPDATE,
> > can't be called while holding KVM MMU lock or other spinlock
> >
> > I saw the TDX patches added some x86 ops / hooks in KVM MMU to handle
> > mapping secure pages. Is there anything there you think is worth
> > re-using/re-purposing for SNP use-case?
> >
>
> They are mainly for the TDP MMU to manipulate the TDX secure EPT through
> firmware call. Due to the TDX architecture (1 shared + 1 secure EPT) and
> mirror secure EPT design, TDP MMU is aware there is a "secure EPT but with
> reduced-features", when a #PF happens which EPT should be operated, and
> when memory attr is changed which EPT should be zapped. Those callbacks
> in TDP MMU are used for maintaining the mirror secure EPT.
>
> SNP designs in a different way. It doesn't need the TDP MMU to be aware of
> "there is another kind of NPT" as the enforced check is done in RMP, SNP can
> easily, keep all the RMP routines out of the TDP MMU. Also, SNP doesn't need
> a mirror NPT. Thus, SNP doesn't need those callbacks.
>
> In *theory*, it is possible to re-purpose some of the callbacks and merge
> RMP update into TDP MMU, like omitting the manipulating mirror secure EPT part
> for SNP, but updating the RMP in the callback of setting SPTE. But I have to
> think about what would be the benefit as it doesn't fit the SNP HW nature.
Yah, can't think of any benefit currently. With both pre-acceptance and
lazy-acceptance those pages will get touched shorty after
KVM_SET_MEMORY_ATTRIBUTES, so the more obvious benefit of avoiding
unecessary RMPUPDATEs doesn't end up working here. And in terms of re-using
callbacks it seems like the TDX and SNP hooks are both where it makes the
most sense to have them. Would be nice to have a shared callback for
this but might not be the right approach here.
-Mike
>
>
> > Thanks,
> >
> > -Mike
> >
> > >
> > > > But would be good to hear others' opinions on this. And also confirm
> > > > whether TDX needs that pre/post directmap handle or not.
> > >
> > > Yes.
> > >
> > > >
> > > > Thanks!
> > > >
> > > > -Mike
> > > >
> > > > >
> > > > > >
> > > > > > -Mike
> > > > > >
> > > > > > >
> > > > > > > From 7c618c1f3c236c382e64680efcbe7d8a672aa870 Mon Sep 17 00:00:00 2001
> > > > > > > Message-Id: <7c618c1f3c236c382e64680efcbe7d8a672aa870.1679114841.git.isaku.yamahata@intel.com>
> > > > > > > In-Reply-To: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > > > > > > References: <428a676face7a06a90e59dca1c32941c9b6ee001.1679114841.git.isaku.yamahata@intel.com>
> > > > > > > From: Isaku Yamahata <[email protected]>
> > > > > > > Date: Fri, 17 Mar 2023 12:00:09 -0700
> > > > > > > Subject: [PATCH 4/4] KVM: x86: Add 'set_mem_attr' x86 op
> > > > > > >
> > > > > > > This callback will do any platform-specific handling needed for
> > > > > > > converting pages between shared/private.
> > > > > > >
> > > > > > > Originally-by: Michael Roth <[email protected]>
> > > > > > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > > > > > ---
> > > > > > > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > > > > > > arch/x86/include/asm/kvm_host.h | 2 ++
> > > > > > > arch/x86/kvm/mmu/mmu.c | 1 +
> > > > > > > 3 files changed, 4 insertions(+)
> > > > > > >
> > > > > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > > > > > > index dc5f18ac0bd5..956db2ee25a5 100644
> > > > > > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > > > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > > > > > @@ -100,6 +100,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > > > > > > KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > > > > > > KVM_X86_OP(load_mmu_pgd)
> > > > > > > KVM_X86_OP(fault_is_private)
> > > > > > > +KVM_X86_OP_OPTIONAL(set_mem_attr)
> > > > > > > KVM_X86_OP_OPTIONAL(link_private_spt)
> > > > > > > KVM_X86_OP_OPTIONAL(free_private_spt)
> > > > > > > KVM_X86_OP_OPTIONAL(split_private_spt)
> > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > > > index 0382d236fbf4..88e11dd3afde 100644
> > > > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > > > @@ -1731,6 +1731,8 @@ struct kvm_x86_ops {
> > > > > > > void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > > > > > int root_level);
> > > > > > > bool (*fault_is_private)(struct kvm *kvm, gpa_t gpa, u64 error_code);
> > > > > > > + void (*set_mem_attr)(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > > > > + unsigned int attr, gfn_t start, gfn_t end);
> > > > > > >
> > > > > > > int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > > > > > > void *private_spt);
> > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > > index 0ec94c72895c..329333486e64 100644
> > > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > > > @@ -7908,6 +7908,7 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
> > > > > > > gfn_t start, gfn_t end)
> > > > > > > {
> > > > > > > kvm_update_lpage_mixed_flag(kvm, slot, true, attrs, start, end);
> > > > > > > + static_call(kvm_x86_set_mem_attr)(kvm, slot, attrs, start, end);
> > > > > > > }
> > > > > > >
> > > > > > > void kvm_memory_attributes_create_memslot(struct kvm *kvm,
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
> > > > > > > --
> > > > > > > Isaku Yamahata <[email protected]>
> > > > >
> > > > >
> > >
>
On Fri, Feb 24, 2023 at 01:37:48PM +0100, Alexander Graf wrote:
>
> On 20.02.23 19:38, Michael Roth wrote:
> > From: Tom Lendacky <[email protected]>
> >
> > Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
> > guests to alter the register state of the APs on their own. This allows
> > the guest a way of simulating INIT-SIPI.
> >
> > A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
> > so as to avoid updating the VMSA pointer while the vCPU is running.
> >
> > For CREATE
> > The guest supplies the GPA of the VMSA to be used for the vCPU with
> > the specified APIC ID. The GPA is saved in the svm struct of the
> > target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
> > to the vCPU and then the vCPU is kicked.
> >
> > For CREATE_ON_INIT:
> > The guest supplies the GPA of the VMSA to be used for the vCPU with
> > the specified APIC ID the next time an INIT is performed. The GPA is
> > saved in the svm struct of the target vCPU.
> >
> > For DESTROY:
> > The guest indicates it wishes to stop the vCPU. The GPA is cleared
> > from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
> > added to vCPU and then the vCPU is kicked.
> >
> > The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
> > as a result of the event or as a result of an INIT. The handler sets the
> > vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
> > leave the vCPU as not runnable. Any previous VMSA pages that were
> > installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
> > a new VMSA is to be installed, the VMSA guest page is pinned and set as
> > the VMSA in the vCPU VMCB and the vCPU state is set to
> > KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
> > cleared in the vCPU VMCB and the vCPU state is left as
> > KVM_MP_STATE_UNINITIALIZED to prevent it from being run.
> >
> > Signed-off-by: Tom Lendacky <[email protected]>
> > Signed-off-by: Brijesh Singh <[email protected]>
> > Signed-off-by: Ashish Kalra <[email protected]>
> > [mdr: add handling for restrictedmem]
> > Signed-off-by: Michael Roth <[email protected]>
>
>
> What is the intended boot sequence for SEV-SNP guests? FWIW with this
> interface in place, guests will typically use in-guest VMSA pages to hold
> secondary vcpu state. But that means we're now allocating 4kb of memory for
> every vcpu that we create that will be for most of the guest's lifetime
> superfluous.
>
> Wouldn't it make more sense to have a model where we only allocate the VMSA
> for the boot CPU and leave secondary allocation to the guest? We already
> need firmware changes for SEV-SNP - may as well make this one more.
I don't think we'd necessarily need a firmware change. We could just
free original VMSA back to the hypervisor as soon as those APs come
online. The down-side to that versus deferring cleaning till guest
shutdown is there is some flushing activity (see:
sev_flush_encrypted_page()) that would now likely be occuring during
guest boot up where the overhead might be more noticeable. But for SNP
the host likely supports X86_FEATURE_SME_COHERENT so the overhead
probably isn't that bad.
>
> [...]
>
> > +
> > +static int sev_snp_ap_creation(struct vcpu_svm *svm)
> > +{
> > + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
> > + struct kvm_vcpu *vcpu = &svm->vcpu;
> > + struct kvm_vcpu *target_vcpu;
> > + struct vcpu_svm *target_svm;
> > + unsigned int request;
> > + unsigned int apic_id;
> > + bool kick;
> > + int ret;
> > +
> > + request = lower_32_bits(svm->vmcb->control.exit_info_1);
> > + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
> > +
> > + /* Validate the APIC ID */
> > + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
>
>
> Out of curiosity: The target CPU can be my own vCPU, right?
I don't think that would be the normal behavior, but maybe with some
care it's possible for a guest to do things that way. I haven't seen
anything strictly prohibiting this in the relevant specs.
>
>
> > + if (!target_vcpu) {
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
> > + apic_id);
> > + return -EINVAL;
> > + }
> > +
> > + ret = 0;
> > +
> > + target_svm = to_svm(target_vcpu);
> > +
> > + /*
> > + * The target vCPU is valid, so the vCPU will be kicked unless the
> > + * request is for CREATE_ON_INIT. For any errors at this stage, the
> > + * kick will place the vCPU in an non-runnable state.
> > + */
> > + kick = true;
> > +
> > + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
> > +
> > + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
> > + target_svm->sev_es.snp_ap_create = true;
> > +
> > + /* Interrupt injection mode shouldn't change for AP creation */
> > + if (request < SVM_VMGEXIT_AP_DESTROY) {
> > + u64 sev_features;
> > +
> > + sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
> > + sev_features ^= sev->sev_features;
> > + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
> > + vcpu->arch.regs[VCPU_REGS_RAX]);
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + }
> > +
> > + switch (request) {
> > + case SVM_VMGEXIT_AP_CREATE_ON_INIT:
> > + kick = false;
> > + fallthrough;
> > + case SVM_VMGEXIT_AP_CREATE:
> > + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
> > + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
> > + svm->vmcb->control.exit_info_2);
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + /*
> > + * Malicious guest can RMPADJUST a large page into VMSA which
> > + * will hit the SNP erratum where the CPU will incorrectly signal
> > + * an RMP violation #PF if a hugepage collides with the RMP entry
> > + * of VMSA page, reject the AP CREATE request if VMSA address from
> > + * guest is 2M aligned.
>
>
> This will break genuine current Linux kernels that just happen to allocate a
> guest page, no? In fact, given enough vCPUs you're almost guaranteed to hit
> an aligned structure somewhere. What is the guest supposed to do in that
> situation?
The initial SNP support for guest kernels already made use of
snp_alloc_vmsa_page() to do the appropriate workaround to avoid allocating
2MB-aligned VMSA pages.
-Mike
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
On 4/4/23 17:48, Michael Roth wrote:
> On Fri, Feb 24, 2023 at 01:37:48PM +0100, Alexander Graf wrote:
>>
>> On 20.02.23 19:38, Michael Roth wrote:
>>> From: Tom Lendacky <[email protected]>
>>>
>>> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
>>> guests to alter the register state of the APs on their own. This allows
>>> the guest a way of simulating INIT-SIPI.
>>>
>>> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
>>> so as to avoid updating the VMSA pointer while the vCPU is running.
>>>
>>> For CREATE
>>> The guest supplies the GPA of the VMSA to be used for the vCPU with
>>> the specified APIC ID. The GPA is saved in the svm struct of the
>>> target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
>>> to the vCPU and then the vCPU is kicked.
>>>
>>> For CREATE_ON_INIT:
>>> The guest supplies the GPA of the VMSA to be used for the vCPU with
>>> the specified APIC ID the next time an INIT is performed. The GPA is
>>> saved in the svm struct of the target vCPU.
>>>
>>> For DESTROY:
>>> The guest indicates it wishes to stop the vCPU. The GPA is cleared
>>> from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
>>> added to vCPU and then the vCPU is kicked.
>>>
>>> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
>>> as a result of the event or as a result of an INIT. The handler sets the
>>> vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
>>> leave the vCPU as not runnable. Any previous VMSA pages that were
>>> installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
>>> a new VMSA is to be installed, the VMSA guest page is pinned and set as
>>> the VMSA in the vCPU VMCB and the vCPU state is set to
>>> KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
>>> cleared in the vCPU VMCB and the vCPU state is left as
>>> KVM_MP_STATE_UNINITIALIZED to prevent it from being run.
>>>
>>> Signed-off-by: Tom Lendacky <[email protected]>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>> Signed-off-by: Ashish Kalra <[email protected]>
>>> [mdr: add handling for restrictedmem]
>>> Signed-off-by: Michael Roth <[email protected]>
>>
>>
>> What is the intended boot sequence for SEV-SNP guests? FWIW with this
>> interface in place, guests will typically use in-guest VMSA pages to hold
>> secondary vcpu state. But that means we're now allocating 4kb of memory for
>> every vcpu that we create that will be for most of the guest's lifetime
>> superfluous.
>>
>> Wouldn't it make more sense to have a model where we only allocate the VMSA
>> for the boot CPU and leave secondary allocation to the guest? We already
>> need firmware changes for SEV-SNP - may as well make this one more.
>
> I don't think we'd necessarily need a firmware change. We could just
> free original VMSA back to the hypervisor as soon as those APs come
> online. The down-side to that versus deferring cleaning till guest
> shutdown is there is some flushing activity (see:
> sev_flush_encrypted_page()) that would now likely be occuring during
> guest boot up where the overhead might be more noticeable. But for SNP
> the host likely supports X86_FEATURE_SME_COHERENT so the overhead
> probably isn't that bad.
Currently, OVMF code will perform a broadcast IPI to start all the APs
because it doesn't know the APIC IDs until they start for the first time.
Until the APIC IDs are known, the guest BSP can't create the VMSAs.
However, a new GHCB event is in plan to retrieve the APIC IDs for the
guest. Once that is in place, then you could create just a single VMSA for
the BSP and then allow the guest to create the remainder (the current OVMF
PoC patches to support an SVSM do this). The VMM would have to know that
the hypervisor and the firmware both support that, though. That could be
advertised as part of the GUID table of the firmware (in the case of OVMF)
and as a capability from KVM.
Thanks,
Tom
>
>>
>> [...]
>>
>>> +
>>> +static int sev_snp_ap_creation(struct vcpu_svm *svm)
>>> +{
>>> + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
>>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>>> + struct kvm_vcpu *target_vcpu;
>>> + struct vcpu_svm *target_svm;
>>> + unsigned int request;
>>> + unsigned int apic_id;
>>> + bool kick;
>>> + int ret;
>>> +
>>> + request = lower_32_bits(svm->vmcb->control.exit_info_1);
>>> + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
>>> +
>>> + /* Validate the APIC ID */
>>> + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
>>
>>
>> Out of curiosity: The target CPU can be my own vCPU, right?
>
> I don't think that would be the normal behavior, but maybe with some
> care it's possible for a guest to do things that way. I haven't seen
> anything strictly prohibiting this in the relevant specs.
>
>>
>>
>>> + if (!target_vcpu) {
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
>>> + apic_id);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = 0;
>>> +
>>> + target_svm = to_svm(target_vcpu);
>>> +
>>> + /*
>>> + * The target vCPU is valid, so the vCPU will be kicked unless the
>>> + * request is for CREATE_ON_INIT. For any errors at this stage, the
>>> + * kick will place the vCPU in an non-runnable state.
>>> + */
>>> + kick = true;
>>> +
>>> + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
>>> +
>>> + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
>>> + target_svm->sev_es.snp_ap_create = true;
>>> +
>>> + /* Interrupt injection mode shouldn't change for AP creation */
>>> + if (request < SVM_VMGEXIT_AP_DESTROY) {
>>> + u64 sev_features;
>>> +
>>> + sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
>>> + sev_features ^= sev->sev_features;
>>> + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
>>> + vcpu->arch.regs[VCPU_REGS_RAX]);
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> + }
>>> +
>>> + switch (request) {
>>> + case SVM_VMGEXIT_AP_CREATE_ON_INIT:
>>> + kick = false;
>>> + fallthrough;
>>> + case SVM_VMGEXIT_AP_CREATE:
>>> + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
>>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
>>> + svm->vmcb->control.exit_info_2);
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + /*
>>> + * Malicious guest can RMPADJUST a large page into VMSA which
>>> + * will hit the SNP erratum where the CPU will incorrectly signal
>>> + * an RMP violation #PF if a hugepage collides with the RMP entry
>>> + * of VMSA page, reject the AP CREATE request if VMSA address from
>>> + * guest is 2M aligned.
>>
>>
>> This will break genuine current Linux kernels that just happen to allocate a
>> guest page, no? In fact, given enough vCPUs you're almost guaranteed to hit
>> an aligned structure somewhere. What is the guest supposed to do in that
>> situation?
>
> The initial SNP support for guest kernels already made use of
> snp_alloc_vmsa_page() to do the appropriate workaround to avoid allocating
> 2MB-aligned VMSA pages.
>
> -Mike
>
>>
>>
>>
>> Amazon Development Center Germany GmbH
>> Krausenstr. 38
>> 10117 Berlin
>> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
>> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
>> Sitz: Berlin
>> Ust-ID: DE 289 237 879
>>
>>
On Wed, Mar 01, 2023 at 03:37:00PM -0800, Dave Hansen wrote:
> On 2/20/23 10:38, Michael Roth wrote:
> > + /*
> > + * TODO: The RMP entry's hugepage bit is ignored for
> > + * shared/unassigned pages. Either handle looping through each
> > + * sub-page as part of snp_make_page_shared(), or remove the
> > + * level argument.
> > + */
> > + if (op == SNP_PAGE_STATE_PRIVATE && order &&
> > + IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
> > + level = order_to_level(order);
> > + npages = 1 << order;
> > + }
>
> That's a wee bit obtuse.
>
> First of all, I assume that the 'RFC' is because of these TODOs and they
> won't survive to the point when you ask for this to be merged.
Yes, will make sure to have all the TODOs in the tree addressed before
dropping the RFC tag.
>
> BTW, what keeps the restrictedmem_get_page() offset and the gfn aligned?
I don't think anything enforces that currently, but there is a TODO in the
UPM v10 patchset to enforce that:
https://github.com/AMDESE/linux/commit/5c86db7f98701f614c48946b733f2542c962f139#diff-e7514a224c92c2e47224f99919405a37ee7edc4612953135229cfb6e07a680d8R2131
So currently this patch relies on the following:
- the fact that the memslot alignment/sizes for a standard x86 guest's
associated memory regions are 2M-aligned, so when they are bound to a
restrictedmem FD they are naturally packed in at restrictedmem offsets
that are also 2M-aligned. But of course we can't assume userspace will
live up to this assumption and need the above TODO in KVM to enforce
this when registering new memslots.
- that restrictedmem/shmem will ensure that THPs are only allocated for
restrictedmem offsets that are 2M-aligned. I think this enforcement
happens in shmem_alloc_hugefolio().
which both seem to hold in testing. But it's probably a good idea to add
an explicit check for this, at least until KVM implements something to
enforce this earlier in guest life-cycle.
>
> Let's start with this:
>
> > +static inline u8 order_to_level(int order)
> > +{
> > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > + return PG_LEVEL_1G;
> > +
> > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > + return PG_LEVEL_2M;
> > +
> > + return PG_LEVEL_4K;
> > +}
>
> Right now, 'order' comes only from restrictedmem_get_page(), which I dug
> out of:
>
> > https://github.com/mdroth/linux/commit/c6792672cd11737fd255dff10b2d5b6bccc626a8
>
> That order is *only* filled in by THPs. That makes the PG_LEVEL_1G
> stuff here kinda silly. I guess it might be seen as thorough, but it's
> dead code. I'd probably just make this work on order==9 || order==0 and
> warn on anything else.
Ok, makes sense.
>
> I'd also highly recommend some comments about how racy this all is. I
> guess it probably works, but it would be good to add some comments about
> page splits and collapsing.
Collapsing while in this code path should be ok since the 4K sub-pages will
just end up getting mapped as 4K in RMP table. KVM MMU will then map
them into nested page table as 4K as well and we'll get non-optimal
performance, but things should otherwise work.
Splitting is a similar story: if we map as 2M in RMP table, and then
afterward the page gets split, then KVM MMU during fault time would map
the pages in the NPT as 4K, and when the guest attempts to access
private pages of this sort they'll generate a nested page fault with
PFERR_GUEST_RMP_BIT and PFERR_GUEST_SIZEM_BIT set, and the code in
handle_rmp_page_fault() will issue a PSMASH instruction to split the
2M RMP entry into 512 4K RMP entries.
Will add some comments around this.
>
> It's also not obvious why this only cares about private pages.
Mainly because the shared memory that actually get mapped into the guest
is always shared in the RMP table. It is normal VMA memory that is not
allocated by UPM/restrictedmem. We will never attempt to make them
private, so there is never a need to bother with switching them back to
shared.
So we only need to handle RMP updates for the UPM/restrictedmem PFNs.
Obviously for shared->private conversion before mapping it into the
guest, but also for private->shared conversion since we will still
get RMP check failures if we try to leave the PFNs as private in the
RMP table and map the above-mentioned VMA memory into the guest
instead.
Will add some more comments around this.
>
> Anyway, this is the exact kind of thing where I really like a
> well-commented helper:
>
> bool can_install_large_rmp_entry(gfn, order)
> {
> // small pages, blah blah
> if (!order)
> return false;
>
> // The region being updated must be aligned
> if (!IS_ALIGNED(gfn, 1 << order))
> return false;
> // ... and fit
> if (gfn + (1 << order)) > end)
> return false;
>
> return true;
> }
>
> Which gets used like this:
>
> if (op == SNP_PAGE_STATE_PRIVATE &&
> can_install_large_rmp_entry(gfn, order)) {
> level = ...
> }
Makes sense, will implement something along these lines.
Thanks!
-Mike
On 3/28/23 16:31, Michael Roth wrote:
> However...
>
> The fact that any pages potentially triggering these #PFs are able to be
> mapped as 2M in the first place means that all the PFNs covered by that
> 2M mapping must also been allocated by via mappable/VMA memory rather
> than via restricted memfd where userspace mappings are not possible.
>
> So I think we should be able to drop this patch entirely, as well as
> allow the use of HugeTLBFS for non-restricted memfd memory (though
> eventually the guest will switch all its memory to private/restricted
> so not gaining much there other than reducing management complexity).
This is sounding a bit voodoo-ish to me.
If this whole series is predicated on having its memory supplied via one
very specific ABI with very specific behavior.
That connection and the associated contract isn't spelled out very
clearly in this series. I'm sure it works on your machine and is clear
to _you_ but I'm worried that nobody else is going to be able to figure
out the voodoo.
Could we make sure that this stuff is made very clear in the
Documentation and cover letter, please?
On 3/30/23 00:59, Michael Roth wrote:
> On Fri, Mar 03, 2023 at 04:28:39PM +0100, Vlastimil Babka wrote:
>> On 2/20/23 19:38, Michael Roth wrote:
>> > From: Brijesh Singh <[email protected]>
>> >
>> > The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
>> > entry for a given page. The RMP entry format is documented in AMD PPR, see
>> > https://bugzilla.kernel.org/attachment.cgi?id=296015.
>> >
>> > Co-developed-by: Ashish Kalra <[email protected]>
>> > Signed-off-by: Ashish Kalra <[email protected]>
>> > Signed-off-by: Brijesh Singh <[email protected]>
>> > Signed-off-by: Michael Roth <[email protected]>
>> > ---
>>
>> > +/*
>> > + * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
>> > + * and -errno if there is no corresponding RMP entry.
>> > + */
>>
>> Hmm IMHO the kernel's idiomatic way is to return 0 on "success" and I'd
>> assume the more intuitive expectation of success here if the entry is
>> assigned?
>
> In general I'd agree. Here's it's a little awkward though.
> snp_lookup_rmpentry() sort of wants to be a bool, where true indicates
> an assigned entry was found, false indicates no assigned entry.
>
> But it also has to deal with error values, so the most direct way to
> encapsulate that is true == 1, false == 0, and < 0 for errors.
>
> Inverting it to align more with kernel expections of 0 == success/true
> gets awkward too, because stuff like:
>
> if (snp_lookup_rmpentry(...))
> //error
>
> still doesn't work the way most other functions written in this way
> would since it could still be "successful" if we were expecting PFN to
> be in shared state. So the return value needs special handling there
> too.
>
> Would it make sense to define it something like this?:
>
> /*
> * Query information about the RMP entry corresponding to the given
> * PFN.
> *
> * Returns 0 on success, and -errno if there was a problem accessing
> * the RMP entry.
> */
> int snp_lookup_rmpentry(u64 pfn, int *level, bool *assigned)
Yeah that looks fine to me. Hope you find out it makes it easier to work
with in the callers too.
>
>> The various callers seem to differ though so I guess it depends on
>> context. Some however don't distinguish their "failure" from an ERR and
>> maybe they should, at least for the purposes of the various printks?
>
> Yes, regardless of what we decide above, the call-sites should properly
> distinguish between failure/assigned/not-assigned and report the
> information accordingly. I'll get those fixed up where needed.
Great, thanks!
> Thanks,
>
> -Mike
>
>>
>> > +int snp_lookup_rmpentry(u64 pfn, int *level)
>> > +{
>> > + struct rmpentry *e;
>> > +
>> > + e = __snp_lookup_rmpentry(pfn, level);
>> > + if (IS_ERR(e))
>> > + return PTR_ERR(e);
>> > +
>> > + return !!rmpentry_assigned(e);
>> > +}
>> > +EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
>>
On 20.02.2023 20:38, Michael Roth wrote:
> From: Brijesh Singh <[email protected]>
>
>
> +static int snp_decommission_context(struct kvm *kvm)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_addr data = {};
> + int ret;
> +
> + /* If context is not created then do nothing */
> + if (!sev->snp_context)
> + return 0;
> +
> + data.gctx_paddr = __sme_pa(sev->snp_context);
> + ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
> + if (WARN_ONCE(ret, "failed to release guest context"))
> + return ret;
> +
> + /* free the context page now */
> + snp_free_firmware_page(sev->snp_context);
> + sev->snp_context = NULL;
> +
> + return 0;
> +}
> +
Even though it's not documented, SNP_DECOMMISSION seems to clear the
WBINVD indicator just like DEACTIVATE does for SEV.
Won't ASID recycling race with SNP_DECOMMISSION if the latter isn't
guarded with sev_deactivate_lock?
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -2333,7 +2440,15 @@ void sev_vm_destroy(struct kvm *kvm)
> }
> }
>
> - sev_unbind_asid(kvm, sev->handle);
> + if (sev_snp_guest(kvm)) {
> + if (snp_decommission_context(kvm)) {
> + WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
> + return;
> + }
> + } else {
> + sev_unbind_asid(kvm, sev->handle);
> + }
> +
> sev_asid_free(sev);
> }
>
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
We discovered that the kdump crashkernel would not work with our SEV-
SNP configuration.
After reading the device table from the previour kernel we would
see a lot of
AMD-Vi: Completion-Wait loop timed out
errors and finally crash:
Kernel panic - not syncing: timer doesn't work through Interrupt-
remapped IO-APIC
We found that disabeling SNP in the outgoing (crashing) kernel
would enable the crashkernel to take over the iommu config and
boot from there.
We opened a PR over on github against the rfc-v9 branch to discuss
the issue:
https://github.com/AMDESE/linux/pull/5
Cheers Johanna
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879