2022-06-20 23:01:56

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 00/49] Add AMD Secure Nested Paging (SEV-SNP)

From: Ashish Kalra <[email protected]>

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Guest Support now part of mainline.

This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.

The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.

The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:
- Register GHCB GPA
- Page State Change Request
- Hypevisor feature
- Guest message request

The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
access requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality is
already protected via memory encryption. When hardware encounters an RMP
checks failure, it raises a page-fault exception. If RMP check failure
is due to the page-size mismatch, then split the large page to resolve
the fault.

The series does not provide support for the interrupt security and migration
and those feature will be added after the base support.

Please note that some areas, such as how private guest pages are
managed/pinned/protected, are likely to change once Unmapped Private Memory
support is further along in development/design and can be incorporated
into this series. We are posting these patches without UPM support for now
to hopefully get some review on other aspects of the series in the meantime.

Here is a link to latest UPM v6 patches:
https://lore.kernel.org/linux-mm/[email protected]/

A branch containing these patches is available here:
https://github.com/AMDESE/linux/tree/sev-snp-5.18-rc3-v3

Changes since v5:
* Rebase to 5.18.0-rc3, these patches are just for review so they
are based on 5.18.0-rc3 linux-next release as this included the
SNP guest patches which weren't in mainline then.
* Using kvm_write_guest() to sync the GHCB scratch buffer can fail
due to host mapping being 2M, but RMP being 4K. The page fault
handling in do_user_addr_fault() fails to split the 2M page to handle
RMP fault due it being called in a non-preemptible context. Instead,
use the already kernel mapped ghcb to sync the scratch buffer when
the scratch buffer is contained within the GHCB.
* warn and retry failed rmpupdates.
* Fix for stale per-cpu pointer due to cond_resched due during
ghcb mapping.
* Multiple fixes for SEV-SNP AP Creation.
* Remove SRCU to synchronize the PSC and gfn mapping replacing it
with a spinlock.
* Remove generic post_{map,unmap}_gfn ops, need to revisit these
later with respect to UPM support.
* Fix kvm_mmu_get_tdp_walk() to handle "suspicious RCU usage"
warning.
* Fix sev_snp_init() to do WBINVD/DF_FLUSH command after SNP_INIT
command has been issued.
* Fix sev_free_vcpu() to flush the VMSA page after it is transitioned
back to hypervisor state and restored in the kernel direct map.

Changes since v4:
* Move the RMP entry definition to x86 specific header file.
* Move the dump RMP entry function to SEV specific file.
* Use BIT_ULL while defining the #PF bit fields.
* Add helper function to check the IOMMU support for SEV-SNP feature.
* Add helper functions for the page state transition.
* Map and unmap the pages from the direct map after page is added or
removed in RMP table.
* Enforce the minimum SEV-SNP firmware version.
* Extend the LAUNCH_UPDATE to accept the base_gfn and remove the
logic to calculate the gfn from the hva.
* Add a check in LAUNCH_UPDATE to ensure that all the pages are
shared before calling the PSP.
* Mark the memory failure when failing to remove the page from the
RMP table or clearing the immutable bit.
* Exclude the encrypted hva range from the KSM.
* Remove the gfn tracking during the kvm_gfn_map() and use SRCU to
syncronize the PSC and gfn mapping.
* Allow PSC on the registered hva range only.
* Add support for the Preferred GPA VMGEXIT.
* Simplify the PSC handling routines.
* Use the static_call() for the newly added kvm_x86_ops.
* Remove the long-lived GHCB map.
* Move the snp enable module parameter to the end of the file.
* Remove the kvm_x86_op for the RMP fault handling. Call the
fault handler directly from the #NPF interception.

Changes since v3:
* Add support for extended guest message request.
* Add ioctl to query the SNP Platform status.
* Add ioctl to get and set the SNP config.
* Add check to verify that memory reserved for the RMP covers the full system RAM.
* Start the SNP specific commands from 256 instead of 255.
* Multiple cleanup and fixes based on the review feedback.

Changes since v2:
* Add AP creation support.
* Drop the patch to handle the RMP fault for the kernel address.
* Add functions to track the write access from the hypervisor.
* Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
* Dump the RMP entry on RMP violation for the debug.
* Shorten the GHCB macro names.
* Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
* Sync the header with the latest 0.9 SNP spec.

Changes since v1:
* Add AP reset MSR protocol VMGEXIT NAE.
* Add Hypervisor features VMGEXIT NAE.
* Move the RMP table initialization and RMPUPDATE/PSMASH helper in
arch/x86/kernel/sev.c.
* Add support to map/unmap SEV legacy command buffer to firmware state when
SNP is active.
* Enhance PSP driver to provide helper to allocate/free memory used for the
firmware context page.
* Add support to handle RMP fault for the kernel address.
* Add support to handle GUEST_REQUEST NAE event for attestation.
* Rename RMP table lookup helper.
* Drop typedef from rmpentry struct definition.
* Drop SNP static key and use cpu_feature_enabled() to check whether SEV-SNP
is active.
* Multiple cleanup/fixes to address Boris review feedback.


Ashish Kalra (1):
KVM: SVM: Sync the GHCB scratch buffer using already mapped ghcb

Brijesh Singh (42):
x86/cpufeatures: Add SEV-SNP CPU feature
iommu/amd: Introduce function to check SEV-SNP support
x86/sev: Add the host SEV-SNP initialization support
x86/sev: set SYSCFG.MFMD
x86/sev: Add RMP entry lookup helpers
x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
x86/sev: Invalid pages from direct map when adding it to RMP table
x86/traps: Define RMP violation #PF error code
x86/fault: Add support to handle the RMP fault for user address
x86/fault: Add support to dump RMP entry on fault
crypto:ccp: Define the SEV-SNP commands
crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
crypto:ccp: Provide APIs to issue SEV-SNP commands
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
crypto: ccp: Handle the legacy SEV command when SNP is enabled
crypto: ccp: Add the SNP_PLATFORM_STATUS command
crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
crypto: ccp: Provide APIs to query extended attestation report
KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
KVM: SVM: Add initial SEV-SNP support
KVM: SVM: Add KVM_SNP_INIT command
KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
KVM: SVM: Disallow registering memory range from HugeTLB for SNP guest
KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
KVM: SVM: Mark the private vma unmerable for SEV-SNP guests
KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
KVM: X86: Keep the NPT and RMP page level in sync
KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
KVM: x86: Define RMP page fault error bits for #NPF
KVM: x86: Update page-fault trace to log full 64-bit error code
KVM: SVM: Do not use long-lived GHCB map while setting scratch area
KVM: SVM: Remove the long-lived GHCB host map
KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
KVM: SVM: Add support to handle Page State Change VMGEXIT
KVM: SVM: Introduce ops for the post gfn map and unmap
KVM: x86: Export the kvm_zap_gfn_range() for the SNP use
KVM: SVM: Add support to handle the RMP nested page fault
KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
KVM: SVM: Add module parameter to enable the SEV-SNP
ccp: add support to decrypt the page

Michael Roth (2):
*fix for stale per-cpu pointer due to cond_resched during ghcb
mapping
*debug: warn and retry failed rmpupdates

Sean Christopherson (1):
KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP

Tom Lendacky (3):
KVM: SVM: Add support to handle AP reset MSR protocol
KVM: SVM: Use a VMSA physical address variable for populating VMCB
KVM: SVM: Support SEV-SNP AP Creation NAE event

Documentation/virt/coco/sevguest.rst | 54 +
.../virt/kvm/x86/amd-memory-encryption.rst | 102 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/kvm-x86-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 15 +
arch/x86/include/asm/msr-index.h | 9 +
arch/x86/include/asm/sev-common.h | 28 +
arch/x86/include/asm/sev.h | 45 +
arch/x86/include/asm/svm.h | 6 +
arch/x86/include/asm/trap_pf.h | 18 +-
arch/x86/kernel/cpu/amd.c | 3 +-
arch/x86/kernel/sev.c | 400 ++++
arch/x86/kvm/lapic.c | 5 +-
arch/x86/kvm/mmu.h | 7 +-
arch/x86/kvm/mmu/mmu.c | 90 +
arch/x86/kvm/svm/sev.c | 1703 ++++++++++++++++-
arch/x86/kvm/svm/svm.c | 62 +-
arch/x86/kvm/svm/svm.h | 75 +-
arch/x86/kvm/trace.h | 40 +-
arch/x86/kvm/x86.c | 10 +-
arch/x86/mm/fault.c | 84 +-
drivers/crypto/ccp/sev-dev.c | 908 ++++++++-
drivers/crypto/ccp/sev-dev.h | 17 +
drivers/iommu/amd/init.c | 30 +
include/linux/iommu.h | 9 +
include/linux/mm.h | 3 +-
include/linux/mm_types.h | 3 +
include/linux/psp-sev.h | 346 ++++
include/linux/sev.h | 32 +
include/uapi/linux/kvm.h | 56 +
include/uapi/linux/psp-sev.h | 60 +
mm/memory.c | 13 +
tools/arch/x86/include/asm/cpufeatures.h | 1 +
34 files changed, 4090 insertions(+), 155 deletions(-)
create mode 100644 include/linux/sev.h

--
2.25.1


2022-06-20 23:03:18

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

From: Brijesh Singh <[email protected]>

The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
track the owner of each page of memory. Pages of memory can be owned by
the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
section 15.36.3 for more detail on RMP.

The RMP table is used to enforce access control to memory. The table itself
is not directly writable by the software. New CPU instructions (RMPUPDATE,
PVALIDATE, RMPADJUST) are used to manipulate the RMP entries.

Based on the platform configuration, the BIOS reserves the memory used
for the RMP table. The start and end address of the RMP table must be
queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
RMP_END are not set then disable the SEV-SNP feature.

The SEV-SNP feature is enabled only after the RMP table is successfully
initialized.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/msr-index.h | 6 +
arch/x86/kernel/sev.c | 144 +++++++++++++++++++++++
3 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 36369e76cc63..c1be3091a383 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -68,6 +68,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_AMD_MEM_ENCRYPT
+# define DISABLE_SEV_SNP 0
+#else
+# define DISABLE_SEV_SNP (1 << (X86_FEATURE_SEV_SNP & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -91,7 +97,7 @@
DISABLE_ENQCMD)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
-#define DISABLED_MASK19 0
+#define DISABLED_MASK19 (DISABLE_SEV_SNP)
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)

#endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9e2e7185fc1d..57a8280e283a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -507,6 +507,8 @@
#define MSR_AMD64_SEV_ENABLED BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
#define MSR_AMD64_SEV_ES_ENABLED BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
#define MSR_AMD64_SEV_SNP_ENABLED BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
+#define MSR_AMD64_RMP_BASE 0xc0010132
+#define MSR_AMD64_RMP_END 0xc0010133

#define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f

@@ -581,6 +583,10 @@
#define MSR_AMD64_SYSCFG 0xc0010010
#define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
#define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_AMD64_SYSCFG_SNP_EN_BIT 24
+#define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
#define MSR_K8_INT_PENDING_MSG 0xc0010055
/* C1E active bits in int pending message */
#define K8_INTP_C1E_ACTIVE_MASK 0x18000000
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index f01f4550e2c6..3a233b5d47c5 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -22,6 +22,8 @@
#include <linux/efi.h>
#include <linux/platform_device.h>
#include <linux/io.h>
+#include <linux/cpumask.h>
+#include <linux/iommu.h>

#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
@@ -38,6 +40,7 @@
#include <asm/apic.h>
#include <asm/cpuid.h>
#include <asm/cmdline.h>
+#include <asm/iommu.h>

#define DR7_RESET_VALUE 0x400

@@ -57,6 +60,12 @@
#define AP_INIT_CR0_DEFAULT 0x60000010
#define AP_INIT_MXCSR_DEFAULT 0x1f80

+/*
+ * The first 16KB from the RMP_BASE is used by the processor for the
+ * bookkeeping, the range need to be added during the RMP entry lookup.
+ */
+#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
+
/* For early boot hypervisor communication in SEV-ES enabled guests */
static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);

@@ -69,6 +78,10 @@ static struct ghcb *boot_ghcb __section(".data");
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;

+static unsigned long rmptable_start __ro_after_init;
+static unsigned long rmptable_end __ro_after_init;
+
+
/* #VC handler runtime per-CPU data */
struct sev_es_runtime_data {
struct ghcb ghcb_page;
@@ -2218,3 +2231,134 @@ static int __init snp_init_platform_device(void)
return 0;
}
device_initcall(snp_init_platform_device);
+
+#undef pr_fmt
+#define pr_fmt(fmt) "SEV-SNP: " fmt
+
+static int __snp_enable(unsigned int cpu)
+{
+ u64 val;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return 0;
+
+ rdmsrl(MSR_AMD64_SYSCFG, val);
+
+ val |= MSR_AMD64_SYSCFG_SNP_EN;
+ val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
+
+ wrmsrl(MSR_AMD64_SYSCFG, val);
+
+ return 0;
+}
+
+static __init void snp_enable(void *arg)
+{
+ __snp_enable(smp_processor_id());
+}
+
+static bool get_rmptable_info(u64 *start, u64 *len)
+{
+ u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
+
+ rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
+ rdmsrl(MSR_AMD64_RMP_END, rmp_end);
+
+ if (!rmp_base || !rmp_end) {
+ pr_info("Memory for the RMP table has not been reserved by BIOS\n");
+ return false;
+ }
+
+ rmp_sz = rmp_end - rmp_base + 1;
+
+ /*
+ * Calculate the amount the memory that must be reserved by the BIOS to
+ * address the full system RAM. The reserved memory should also cover the
+ * RMP table itself.
+ *
+ * See PPR Family 19h Model 01h, Revision B1 section 2.1.4.2 for more
+ * information on memory requirement.
+ */
+ nr_pages = totalram_pages();
+ calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
+
+ if (calc_rmp_sz > rmp_sz) {
+ pr_info("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
+ calc_rmp_sz, rmp_sz);
+ return false;
+ }
+
+ *start = rmp_base;
+ *len = rmp_sz;
+
+ pr_info("RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
+
+ return true;
+}
+
+static __init int __snp_rmptable_init(void)
+{
+ u64 rmp_base, sz;
+ void *start;
+ u64 val;
+
+ if (!get_rmptable_info(&rmp_base, &sz))
+ return 1;
+
+ start = memremap(rmp_base, sz, MEMREMAP_WB);
+ if (!start) {
+ pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
+ return 1;
+ }
+
+ /*
+ * Check if SEV-SNP is already enabled, this can happen if we are coming from
+ * kexec boot.
+ */
+ rdmsrl(MSR_AMD64_SYSCFG, val);
+ if (val & MSR_AMD64_SYSCFG_SNP_EN)
+ goto skip_enable;
+
+ /* Initialize the RMP table to zero */
+ memset(start, 0, sz);
+
+ /* Flush the caches to ensure that data is written before SNP is enabled. */
+ wbinvd_on_all_cpus();
+
+ /* Enable SNP on all CPUs. */
+ on_each_cpu(snp_enable, NULL, 1);
+
+skip_enable:
+ rmptable_start = (unsigned long)start;
+ rmptable_end = rmptable_start + sz;
+
+ return 0;
+}
+
+static int __init snp_rmptable_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
+ return 0;
+
+ if (!iommu_sev_snp_supported())
+ goto nosnp;
+
+ if (__snp_rmptable_init())
+ goto nosnp;
+
+ cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
+
+ return 0;
+
+nosnp:
+ setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+ return 1;
+}
+
+/*
+ * This must be called after the PCI subsystem. This is because before enabling
+ * the SNP feature we need to ensure that IOMMU supports the SEV-SNP feature.
+ * The iommu_sev_snp_support() is used for checking the feature, and it is
+ * available after subsys_initcall().
+ */
+fs_initcall(snp_rmptable_init);
--
2.25.1

2022-06-20 23:03:21

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 04/49] x86/sev: set SYSCFG.MFMD

From: Brijesh Singh <[email protected]>

SEV-SNP FW >= 1.51 requires that SYSCFG.MFMD must be set.

Subsequent CCP patches while require 1.51 as the minimum SEV-SNP
firmware version.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/msr-index.h | 3 +++
arch/x86/kernel/sev.c | 24 ++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 57a8280e283a..1e36f16daa56 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -587,6 +587,9 @@
#define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
#define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
+#define MSR_AMD64_SYSCFG_MFDM_BIT 19
+#define MSR_AMD64_SYSCFG_MFDM BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
+
#define MSR_K8_INT_PENDING_MSG 0xc0010055
/* C1E active bits in int pending message */
#define K8_INTP_C1E_ACTIVE_MASK 0x18000000
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 3a233b5d47c5..25c7feb367f6 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2257,6 +2257,27 @@ static __init void snp_enable(void *arg)
__snp_enable(smp_processor_id());
}

+static int __mfdm_enable(unsigned int cpu)
+{
+ u64 val;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return 0;
+
+ rdmsrl(MSR_AMD64_SYSCFG, val);
+
+ val |= MSR_AMD64_SYSCFG_MFDM;
+
+ wrmsrl(MSR_AMD64_SYSCFG, val);
+
+ return 0;
+}
+
+static __init void mfdm_enable(void *arg)
+{
+ __mfdm_enable(smp_processor_id());
+}
+
static bool get_rmptable_info(u64 *start, u64 *len)
{
u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
@@ -2325,6 +2346,9 @@ static __init int __snp_rmptable_init(void)
/* Flush the caches to ensure that data is written before SNP is enabled. */
wbinvd_on_all_cpus();

+ /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
+ on_each_cpu(mfdm_enable, NULL, 1);
+
/* Enable SNP on all CPUs. */
on_each_cpu(snp_enable, NULL, 1);

--
2.25.1

2022-06-20 23:04:09

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

From: Brijesh Singh <[email protected]>

The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
entry for a given page. The RMP entry format is documented in AMD PPR, see
https://bugzilla.kernel.org/attachment.cgi?id=296015.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev.h | 27 ++++++++++++++++++++++++
arch/x86/kernel/sev.c | 43 ++++++++++++++++++++++++++++++++++++++
include/linux/sev.h | 30 ++++++++++++++++++++++++++
3 files changed, 100 insertions(+)
create mode 100644 include/linux/sev.h

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 9c2d33f1cfee..cb16f0e5b585 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -9,6 +9,7 @@
#define __ASM_ENCRYPTED_STATE_H

#include <linux/types.h>
+#include <linux/sev.h>
#include <asm/insn.h>
#include <asm/sev-common.h>
#include <asm/bootparam.h>
@@ -84,6 +85,32 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);

/* RMP page size */
#define RMP_PG_SIZE_4K 0
+#define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
+
+/*
+ * The RMP entry format is not architectural. The format is defined in PPR
+ * Family 19h Model 01h, Rev B1 processor.
+ */
+struct __packed rmpentry {
+ union {
+ struct {
+ u64 assigned : 1,
+ pagesize : 1,
+ immutable : 1,
+ rsvd1 : 9,
+ gpa : 39,
+ asid : 10,
+ vmsa : 1,
+ validated : 1,
+ rsvd2 : 1;
+ } info;
+ u64 low;
+ };
+ u64 high;
+};
+
+#define rmpentry_assigned(x) ((x)->info.assigned)
+#define rmpentry_pagesize(x) ((x)->info.pagesize)

#define RMPADJUST_VMSA_PAGE_BIT BIT(16)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 25c7feb367f6..59e7ec6b0326 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -65,6 +65,8 @@
* bookkeeping, the range need to be added during the RMP entry lookup.
*/
#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
+#define RMPENTRY_SHIFT 8
+#define rmptable_page_offset(x) (RMPTABLE_CPU_BOOKKEEPING_SZ + (((unsigned long)x) >> RMPENTRY_SHIFT))

/* For early boot hypervisor communication in SEV-ES enabled guests */
static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
@@ -2386,3 +2388,44 @@ static int __init snp_rmptable_init(void)
* available after subsys_initcall().
*/
fs_initcall(snp_rmptable_init);
+
+static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
+{
+ unsigned long vaddr, paddr = pfn << PAGE_SHIFT;
+ struct rmpentry *entry, *large_entry;
+
+ if (!pfn_valid(pfn))
+ return ERR_PTR(-EINVAL);
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return ERR_PTR(-ENXIO);
+
+ vaddr = rmptable_start + rmptable_page_offset(paddr);
+ if (unlikely(vaddr > rmptable_end))
+ return ERR_PTR(-ENXIO);
+
+ entry = (struct rmpentry *)vaddr;
+
+ /* Read a large RMP entry to get the correct page level used in RMP entry. */
+ vaddr = rmptable_start + rmptable_page_offset(paddr & PMD_MASK);
+ large_entry = (struct rmpentry *)vaddr;
+ *level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
+
+ return entry;
+}
+
+/*
+ * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
+ * and -errno if there is no corresponding RMP entry.
+ */
+int snp_lookup_rmpentry(u64 pfn, int *level)
+{
+ struct rmpentry *e;
+
+ e = __snp_lookup_rmpentry(pfn, level);
+ if (IS_ERR(e))
+ return PTR_ERR(e);
+
+ return !!rmpentry_assigned(e);
+}
+EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
diff --git a/include/linux/sev.h b/include/linux/sev.h
new file mode 100644
index 000000000000..1a68842789e1
--- /dev/null
+++ b/include/linux/sev.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD Secure Encrypted Virtualization
+ *
+ * Author: Brijesh Singh <[email protected]>
+ */
+
+#ifndef __LINUX_SEV_H
+#define __LINUX_SEV_H
+
+/* RMUPDATE detected 4K page and 2MB page overlap. */
+#define RMPUPDATE_FAIL_OVERLAP 7
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+int snp_lookup_rmpentry(u64 pfn, int *level);
+int psmash(u64 pfn);
+int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable);
+int rmp_make_shared(u64 pfn, enum pg_level level);
+#else
+static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
+static inline int psmash(u64 pfn) { return -ENXIO; }
+static inline int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid,
+ bool immutable)
+{
+ return -ENODEV;
+}
+static inline int rmp_make_shared(u64 pfn, enum pg_level level) { return -ENODEV; }
+
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+#endif /* __LINUX_SEV_H */
--
2.25.1

2022-06-20 23:04:23

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 08/49] x86/traps: Define RMP violation #PF error code

From: Brijesh Singh <[email protected]>

Bit 31 in the page fault-error bit will be set when processor encounters
an RMP violation.

While at it, use the BIT_ULL() macro.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/trap_pf.h | 18 +++++++++++-------
arch/x86/mm/fault.c | 1 +
2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..89b705114b3f 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_TRAP_PF_H
#define _ASM_X86_TRAP_PF_H

+#include <linux/bits.h> /* BIT() macro */
+
/*
* Page fault error code bits:
*
@@ -12,15 +14,17 @@
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
* bit 15 == 1: SGX MMU page-fault
+ * bit 31 == 1: fault was due to RMP violation
*/
enum x86_pf_error_code {
- X86_PF_PROT = 1 << 0,
- X86_PF_WRITE = 1 << 1,
- X86_PF_USER = 1 << 2,
- X86_PF_RSVD = 1 << 3,
- X86_PF_INSTR = 1 << 4,
- X86_PF_PK = 1 << 5,
- X86_PF_SGX = 1 << 15,
+ X86_PF_PROT = BIT_ULL(0),
+ X86_PF_WRITE = BIT_ULL(1),
+ X86_PF_USER = BIT_ULL(2),
+ X86_PF_RSVD = BIT_ULL(3),
+ X86_PF_INSTR = BIT_ULL(4),
+ X86_PF_PK = BIT_ULL(5),
+ X86_PF_SGX = BIT_ULL(15),
+ X86_PF_RMP = BIT_ULL(31),
};

#endif /* _ASM_X86_TRAP_PF_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fad8faa29d04..a4c270e99f7f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -546,6 +546,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
!(error_code & X86_PF_PROT) ? "not-present page" :
(error_code & X86_PF_RSVD) ? "reserved bit violation" :
(error_code & X86_PF_PK) ? "protection keys violation" :
+ (error_code & X86_PF_RMP) ? "RMP violation" :
"permissions violation");

if (!(error_code & X86_PF_USER) && user_mode(regs)) {
--
2.25.1

2022-06-20 23:04:39

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

From: Brijesh Singh <[email protected]>

The RMPUPDATE instruction writes a new RMP entry in the RMP Table. The
hypervisor will use the instruction to add pages to the RMP table. See
APM3 for details on the instruction operations.

The PSMASH instruction expands a 2MB RMP entry into a corresponding set of
contiguous 4KB-Page RMP entries. The hypervisor will use this instruction
to adjust the RMP entry without invalidating the previous RMP entry.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev.h | 11 ++++++
arch/x86/kernel/sev.c | 72 ++++++++++++++++++++++++++++++++++++++
2 files changed, 83 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index cb16f0e5b585..6ab872311544 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -85,7 +85,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);

/* RMP page size */
#define RMP_PG_SIZE_4K 0
+#define RMP_PG_SIZE_2M 1
#define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
+#define X86_TO_RMP_PG_LEVEL(level) (((level) == PG_LEVEL_4K) ? RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)

/*
* The RMP entry format is not architectural. The format is defined in PPR
@@ -126,6 +128,15 @@ struct snp_guest_platform_data {
u64 secrets_gpa;
};

+struct rmpupdate {
+ u64 gpa;
+ u8 assigned;
+ u8 pagesize;
+ u8 immutable;
+ u8 rsvd;
+ u32 asid;
+} __packed;
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
extern struct static_key_false sev_es_enable_key;
extern void __sev_es_ist_enter(struct pt_regs *regs);
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 59e7ec6b0326..f6c64a722e94 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2429,3 +2429,75 @@ int snp_lookup_rmpentry(u64 pfn, int *level)
return !!rmpentry_assigned(e);
}
EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
+
+int psmash(u64 pfn)
+{
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ int ret;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return -ENXIO;
+
+ /* Binutils version 2.36 supports the PSMASH mnemonic. */
+ asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
+ : "=a"(ret)
+ : "a"(paddr)
+ : "memory", "cc");
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(psmash);
+
+static int rmpupdate(u64 pfn, struct rmpupdate *val)
+{
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ int ret;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return -ENXIO;
+
+ /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
+ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
+ : "=a"(ret)
+ : "a"(paddr), "c"((unsigned long)val)
+ : "memory", "cc");
+ return ret;
+}
+
+int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable)
+{
+ struct rmpupdate val;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ memset(&val, 0, sizeof(val));
+ val.assigned = 1;
+ val.asid = asid;
+ val.immutable = immutable;
+ val.gpa = gpa;
+ val.pagesize = X86_TO_RMP_PG_LEVEL(level);
+
+ return rmpupdate(pfn, &val);
+}
+EXPORT_SYMBOL_GPL(rmp_make_private);
+
+int rmp_make_shared(u64 pfn, enum pg_level level)
+{
+ struct rmpupdate val;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
+
+ memset(&val, 0, sizeof(val));
+ val.pagesize = X86_TO_RMP_PG_LEVEL(level);
+
+ return rmpupdate(pfn, &val);
+}
+EXPORT_SYMBOL_GPL(rmp_make_shared);
--
2.25.1

2022-06-20 23:04:52

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

From: Brijesh Singh <[email protected]>

The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used with standard x86 and IOMMU page tables to enforce memory
restrictions and page access rights. The RMP check is enforced as soon as
SEV-SNP is enabled globally in the system. When hardware encounters an
RMP checks failure, it raises a page-fault exception.

The rmp_make_private() and rmp_make_shared() helpers are used to add
or remove the pages from the RMP table. Improve the rmp_make_private() to
invalid state so that pages cannot be used in the direct-map after its
added in the RMP table, and restore to its default valid permission after
the pages are removed from the RMP table.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kernel/sev.c | 61 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index f6c64a722e94..734cddd837f5 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2451,10 +2451,42 @@ int psmash(u64 pfn)
}
EXPORT_SYMBOL_GPL(psmash);

+static int restore_direct_map(u64 pfn, int npages)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < npages; i++) {
+ ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));
+ if (ret)
+ goto cleanup;
+ }
+
+cleanup:
+ WARN(ret > 0, "Failed to restore direct map for pfn 0x%llx\n", pfn + i);
+ return ret;
+}
+
+static int invalid_direct_map(unsigned long pfn, int npages)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < npages; i++) {
+ ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));
+ if (ret)
+ goto cleanup;
+ }
+
+ return 0;
+
+cleanup:
+ restore_direct_map(pfn, i);
+ return ret;
+}
+
static int rmpupdate(u64 pfn, struct rmpupdate *val)
{
unsigned long paddr = pfn << PAGE_SHIFT;
- int ret;
+ int ret, level, npages;

if (!pfn_valid(pfn))
return -EINVAL;
@@ -2462,11 +2494,38 @@ static int rmpupdate(u64 pfn, struct rmpupdate *val)
if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
return -ENXIO;

+ level = RMP_TO_X86_PG_LEVEL(val->pagesize);
+ npages = page_level_size(level) / PAGE_SIZE;
+
+ /*
+ * If page is getting assigned in the RMP table then unmap it from the
+ * direct map.
+ */
+ if (val->assigned) {
+ if (invalid_direct_map(pfn, npages)) {
+ pr_err("Failed to unmap pfn 0x%llx pages %d from direct_map\n",
+ pfn, npages);
+ return -EFAULT;
+ }
+ }
+
/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
: "=a"(ret)
: "a"(paddr), "c"((unsigned long)val)
: "memory", "cc");
+
+ /*
+ * Restore the direct map after the page is removed from the RMP table.
+ */
+ if (!ret && !val->assigned) {
+ if (restore_direct_map(pfn, npages)) {
+ pr_err("Failed to map pfn 0x%llx pages %d in direct_map\n",
+ pfn, npages);
+ return -EFAULT;
+ }
+ }
+
return ret;
}

--
2.25.1

2022-06-20 23:05:13

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 10/49] x86/fault: Add support to dump RMP entry on fault

From: Brijesh Singh <[email protected]>

When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. If the hardware encounters the check failure, then it raises
the #PF (with RMP set). Dump the RMP entry at the faulting pfn to help
the debug.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev.h | 7 +++++++
arch/x86/kernel/sev.c | 43 ++++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 17 +++++++++++----
include/linux/sev.h | 2 ++
4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 6ab872311544..c0c4df817159 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -113,6 +113,11 @@ struct __packed rmpentry {

#define rmpentry_assigned(x) ((x)->info.assigned)
#define rmpentry_pagesize(x) ((x)->info.pagesize)
+#define rmpentry_vmsa(x) ((x)->info.vmsa)
+#define rmpentry_asid(x) ((x)->info.asid)
+#define rmpentry_validated(x) ((x)->info.validated)
+#define rmpentry_gpa(x) ((unsigned long)(x)->info.gpa)
+#define rmpentry_immutable(x) ((x)->info.immutable)

#define RMPADJUST_VMSA_PAGE_BIT BIT(16)

@@ -205,6 +210,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void dump_rmpentry(u64 pfn);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -229,6 +235,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+static inline void dump_rmpentry(u64 pfn) {}
#endif

#endif
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 734cddd837f5..6640a639fffc 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2414,6 +2414,49 @@ static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
return entry;
}

+void dump_rmpentry(u64 pfn)
+{
+ unsigned long pfn_end;
+ struct rmpentry *e;
+ int level;
+
+ e = __snp_lookup_rmpentry(pfn, &level);
+ if (!e) {
+ pr_alert("failed to read RMP entry pfn 0x%llx\n", pfn);
+ return;
+ }
+
+ if (rmpentry_assigned(e)) {
+ pr_alert("RMPEntry paddr 0x%llx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
+ " asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
+ rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
+ rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
+ rmpentry_validated(e));
+ return;
+ }
+
+ /*
+ * If the RMP entry at the faulting pfn was not assigned, then we do not
+ * know what caused the RMP violation. To get some useful debug information,
+ * let iterate through the entire 2MB region, and dump the RMP entries if
+ * one of the bit in the RMP entry is set.
+ */
+ pfn = pfn & ~(PTRS_PER_PMD - 1);
+ pfn_end = pfn + PTRS_PER_PMD;
+
+ while (pfn < pfn_end) {
+ e = __snp_lookup_rmpentry(pfn, &level);
+ if (!e)
+ return;
+
+ if (e->low || e->high)
+ pr_alert("RMPEntry paddr 0x%llx: [high=0x%016llx low=0x%016llx]\n",
+ pfn << PAGE_SHIFT, e->high, e->low);
+ pfn++;
+ }
+}
+EXPORT_SYMBOL_GPL(dump_rmpentry);
+
/*
* Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
* and -errno if there is no corresponding RMP entry.
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f5de9673093a..25896a6ba04a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -34,6 +34,7 @@
#include <asm/kvm_para.h> /* kvm_handle_async_pf */
#include <asm/vdso.h> /* fixup_vdso_exception() */
#include <asm/irq_stack.h>
+#include <asm/sev.h> /* dump_rmpentry() */

#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
@@ -290,7 +291,7 @@ static bool low_pfn(unsigned long pfn)
return pfn < max_low_pfn;
}

-static void dump_pagetable(unsigned long address)
+static void dump_pagetable(unsigned long address, bool show_rmpentry)
{
pgd_t *base = __va(read_cr3_pa());
pgd_t *pgd = &base[pgd_index(address)];
@@ -346,10 +347,11 @@ static int bad_address(void *p)
return get_kernel_nofault(dummy, (unsigned long *)p);
}

-static void dump_pagetable(unsigned long address)
+static void dump_pagetable(unsigned long address, bool show_rmpentry)
{
pgd_t *base = __va(read_cr3_pa());
pgd_t *pgd = base + pgd_index(address);
+ unsigned long pfn;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
@@ -367,6 +369,7 @@ static void dump_pagetable(unsigned long address)
if (bad_address(p4d))
goto bad;

+ pfn = p4d_pfn(*p4d);
pr_cont("P4D %lx ", p4d_val(*p4d));
if (!p4d_present(*p4d) || p4d_large(*p4d))
goto out;
@@ -375,6 +378,7 @@ static void dump_pagetable(unsigned long address)
if (bad_address(pud))
goto bad;

+ pfn = pud_pfn(*pud);
pr_cont("PUD %lx ", pud_val(*pud));
if (!pud_present(*pud) || pud_large(*pud))
goto out;
@@ -383,6 +387,7 @@ static void dump_pagetable(unsigned long address)
if (bad_address(pmd))
goto bad;

+ pfn = pmd_pfn(*pmd);
pr_cont("PMD %lx ", pmd_val(*pmd));
if (!pmd_present(*pmd) || pmd_large(*pmd))
goto out;
@@ -391,9 +396,13 @@ static void dump_pagetable(unsigned long address)
if (bad_address(pte))
goto bad;

+ pfn = pte_pfn(*pte);
pr_cont("PTE %lx", pte_val(*pte));
out:
pr_cont("\n");
+
+ if (show_rmpentry)
+ dump_rmpentry(pfn);
return;
bad:
pr_info("BAD\n");
@@ -579,7 +588,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
show_ldttss(&gdt, "TR", tr);
}

- dump_pagetable(address);
+ dump_pagetable(address, error_code & X86_PF_RMP);
}

static noinline void
@@ -596,7 +605,7 @@ pgtable_bad(struct pt_regs *regs, unsigned long error_code,

printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
tsk->comm, address);
- dump_pagetable(address);
+ dump_pagetable(address, false);

if (__die("Bad pagetable", regs, error_code))
sig = 0;
diff --git a/include/linux/sev.h b/include/linux/sev.h
index 1a68842789e1..734b13a69c54 100644
--- a/include/linux/sev.h
+++ b/include/linux/sev.h
@@ -16,6 +16,7 @@ int snp_lookup_rmpentry(u64 pfn, int *level);
int psmash(u64 pfn);
int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable);
int rmp_make_shared(u64 pfn, enum pg_level level);
+void dump_rmpentry(u64 pfn);
#else
static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
static inline int psmash(u64 pfn) { return -ENXIO; }
@@ -25,6 +26,7 @@ static inline int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int as
return -ENODEV;
}
static inline int rmp_make_shared(u64 pfn, enum pg_level level) { return -ENODEV; }
+static inline void dump_rmpentry(u64 pfn) { }

#endif /* CONFIG_AMD_MEM_ENCRYPT */
#endif /* __LINUX_SEV_H */
--
2.25.1

2022-06-20 23:05:14

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 11/49] crypto:ccp: Define the SEV-SNP commands

From: Brijesh Singh <[email protected]>

AMD introduced the next generation of SEV called SEV-SNP (Secure Nested
Paging). SEV-SNP builds upon existing SEV and SEV-ES functionality
while adding new hardware security protection.

Define the commands and structures used to communicate with the AMD-SP
when creating and managing the SEV-SNP guests. The SEV-SNP firmware spec
is available at developer.amd.com/sev.

Signed-off-by: Brijesh Singh <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 14 +++
include/linux/psp-sev.h | 222 +++++++++++++++++++++++++++++++++++
include/uapi/linux/psp-sev.h | 42 +++++++
3 files changed, 278 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index fd928199bf1e..9cb3265f3bef 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -153,6 +153,20 @@ static int sev_cmd_buffer_len(int cmd)
case SEV_CMD_GET_ID: return sizeof(struct sev_data_get_id);
case SEV_CMD_ATTESTATION_REPORT: return sizeof(struct sev_data_attestation_report);
case SEV_CMD_SEND_CANCEL: return sizeof(struct sev_data_send_cancel);
+ case SEV_CMD_SNP_GCTX_CREATE: return sizeof(struct sev_data_snp_gctx_create);
+ case SEV_CMD_SNP_LAUNCH_START: return sizeof(struct sev_data_snp_launch_start);
+ case SEV_CMD_SNP_LAUNCH_UPDATE: return sizeof(struct sev_data_snp_launch_update);
+ case SEV_CMD_SNP_ACTIVATE: return sizeof(struct sev_data_snp_activate);
+ case SEV_CMD_SNP_DECOMMISSION: return sizeof(struct sev_data_snp_decommission);
+ case SEV_CMD_SNP_PAGE_RECLAIM: return sizeof(struct sev_data_snp_page_reclaim);
+ case SEV_CMD_SNP_GUEST_STATUS: return sizeof(struct sev_data_snp_guest_status);
+ case SEV_CMD_SNP_LAUNCH_FINISH: return sizeof(struct sev_data_snp_launch_finish);
+ case SEV_CMD_SNP_DBG_DECRYPT: return sizeof(struct sev_data_snp_dbg);
+ case SEV_CMD_SNP_DBG_ENCRYPT: return sizeof(struct sev_data_snp_dbg);
+ case SEV_CMD_SNP_PAGE_UNSMASH: return sizeof(struct sev_data_snp_page_unsmash);
+ case SEV_CMD_SNP_PLATFORM_STATUS: return sizeof(struct sev_data_snp_platform_status_buf);
+ case SEV_CMD_SNP_GUEST_REQUEST: return sizeof(struct sev_data_snp_guest_request);
+ case SEV_CMD_SNP_CONFIG: return sizeof(struct sev_user_data_snp_config);
default: return 0;
}

diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 1595088c428b..01ba9dc46ca3 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -86,6 +86,34 @@ enum sev_cmd {
SEV_CMD_DBG_DECRYPT = 0x060,
SEV_CMD_DBG_ENCRYPT = 0x061,

+ /* SNP specific commands */
+ SEV_CMD_SNP_INIT = 0x81,
+ SEV_CMD_SNP_SHUTDOWN = 0x82,
+ SEV_CMD_SNP_PLATFORM_STATUS = 0x83,
+ SEV_CMD_SNP_DF_FLUSH = 0x84,
+ SEV_CMD_SNP_INIT_EX = 0x85,
+ SEV_CMD_SNP_DECOMMISSION = 0x90,
+ SEV_CMD_SNP_ACTIVATE = 0x91,
+ SEV_CMD_SNP_GUEST_STATUS = 0x92,
+ SEV_CMD_SNP_GCTX_CREATE = 0x93,
+ SEV_CMD_SNP_GUEST_REQUEST = 0x94,
+ SEV_CMD_SNP_ACTIVATE_EX = 0x95,
+ SEV_CMD_SNP_LAUNCH_START = 0xA0,
+ SEV_CMD_SNP_LAUNCH_UPDATE = 0xA1,
+ SEV_CMD_SNP_LAUNCH_FINISH = 0xA2,
+ SEV_CMD_SNP_DBG_DECRYPT = 0xB0,
+ SEV_CMD_SNP_DBG_ENCRYPT = 0xB1,
+ SEV_CMD_SNP_PAGE_SWAP_OUT = 0xC0,
+ SEV_CMD_SNP_PAGE_SWAP_IN = 0xC1,
+ SEV_CMD_SNP_PAGE_MOVE = 0xC2,
+ SEV_CMD_SNP_PAGE_MD_INIT = 0xC3,
+ SEV_CMD_SNP_PAGE_MD_RECLAIM = 0xC4,
+ SEV_CMD_SNP_PAGE_RO_RECLAIM = 0xC5,
+ SEV_CMD_SNP_PAGE_RO_RESTORE = 0xC6,
+ SEV_CMD_SNP_PAGE_RECLAIM = 0xC7,
+ SEV_CMD_SNP_PAGE_UNSMASH = 0xC8,
+ SEV_CMD_SNP_CONFIG = 0xC9,
+
SEV_CMD_MAX,
};

@@ -531,6 +559,200 @@ struct sev_data_attestation_report {
u32 len; /* In/Out */
} __packed;

+/**
+ * struct sev_data_snp_platform_status_buf - SNP_PLATFORM_STATUS command params
+ *
+ * @address: physical address where the status should be copied
+ */
+struct sev_data_snp_platform_status_buf {
+ u64 status_paddr; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_download_firmware - SNP_DOWNLOAD_FIRMWARE command params
+ *
+ * @address: physical address of firmware image
+ * @len: len of the firmware image
+ */
+struct sev_data_snp_download_firmware {
+ u64 address; /* In */
+ u32 len; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_gctx_create - SNP_GCTX_CREATE command params
+ *
+ * @gctx_paddr: system physical address of the page donated to firmware by
+ * the hypervisor to contain the guest context.
+ */
+struct sev_data_snp_gctx_create {
+ u64 gctx_paddr; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_activate - SNP_ACTIVATE command params
+ *
+ * @gctx_paddr: system physical address guest context page
+ * @asid: ASID to bind to the guest
+ */
+struct sev_data_snp_activate {
+ u64 gctx_paddr; /* In */
+ u32 asid; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_decommission - SNP_DECOMMISSION command params
+ *
+ * @address: system physical address guest context page
+ */
+struct sev_data_snp_decommission {
+ u64 gctx_paddr; /* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_start - SNP_LAUNCH_START command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @policy: guest policy
+ * @ma_gctx_addr: system physical address of migration agent
+ * @imi_en: launch flow is launching an IMI for the purpose of
+ * guest-assisted migration.
+ * @ma_en: the guest is associated with a migration agent
+ */
+struct sev_data_snp_launch_start {
+ u64 gctx_paddr; /* In */
+ u64 policy; /* In */
+ u64 ma_gctx_paddr; /* In */
+ u32 ma_en:1; /* In */
+ u32 imi_en:1; /* In */
+ u32 rsvd:30;
+ u8 gosvw[16]; /* In */
+} __packed;
+
+/* SNP support page type */
+enum {
+ SNP_PAGE_TYPE_NORMAL = 0x1,
+ SNP_PAGE_TYPE_VMSA = 0x2,
+ SNP_PAGE_TYPE_ZERO = 0x3,
+ SNP_PAGE_TYPE_UNMEASURED = 0x4,
+ SNP_PAGE_TYPE_SECRET = 0x5,
+ SNP_PAGE_TYPE_CPUID = 0x6,
+
+ SNP_PAGE_TYPE_MAX
+};
+
+/**
+ * struct sev_data_snp_launch_update - SNP_LAUNCH_UPDATE command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @imi_page: indicates that this page is part of the IMI of the guest
+ * @page_type: encoded page type
+ * @page_size: page size 0 indicates 4K and 1 indicates 2MB page
+ * @address: system physical address of destination page to encrypt
+ * @vmpl1_perms: VMPL permission mask for VMPL1
+ * @vmpl2_perms: VMPL permission mask for VMPL2
+ * @vmpl3_perms: VMPL permission mask for VMPL3
+ */
+struct sev_data_snp_launch_update {
+ u64 gctx_paddr; /* In */
+ u32 page_size:1; /* In */
+ u32 page_type:3; /* In */
+ u32 imi_page:1; /* In */
+ u32 rsvd:27;
+ u32 rsvd2;
+ u64 address; /* In */
+ u32 rsvd3:8;
+ u32 vmpl1_perms:8; /* In */
+ u32 vmpl2_perms:8; /* In */
+ u32 vmpl3_perms:8; /* In */
+ u32 rsvd4;
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_finish - SNP_LAUNCH_FINISH command params
+ *
+ * @gctx_addr: system pphysical address of guest context page
+ */
+struct sev_data_snp_launch_finish {
+ u64 gctx_paddr;
+ u64 id_block_paddr;
+ u64 id_auth_paddr;
+ u8 id_block_en:1;
+ u8 auth_key_en:1;
+ u64 rsvd:62;
+ u8 host_data[32];
+} __packed;
+
+/**
+ * struct sev_data_snp_guest_status - SNP_GUEST_STATUS command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @address: system physical address of guest status page
+ */
+struct sev_data_snp_guest_status {
+ u64 gctx_paddr;
+ u64 address;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_reclaim - SNP_PAGE_RECLAIM command params
+ *
+ * @paddr: system physical address of page to be claimed. The BIT0 indicate
+ * the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_reclaim {
+ u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_unsmash - SNP_PAGE_UNMASH command params
+ *
+ * @paddr: system physical address of page to be unmashed. The BIT0 indicate
+ * the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_unsmash {
+ u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_dbg - DBG_ENCRYPT/DBG_DECRYPT command parameters
+ *
+ * @handle: handle of the VM to perform debug operation
+ * @src_addr: source address of data to operate on
+ * @dst_addr: destination address of data to operate on
+ * @len: len of data to operate on
+ */
+struct sev_data_snp_dbg {
+ u64 gctx_paddr; /* In */
+ u64 src_addr; /* In */
+ u64 dst_addr; /* In */
+ u32 len; /* In */
+} __packed;
+
+/**
+ * struct sev_snp_guest_request - SNP_GUEST_REQUEST command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @req_paddr: system physical address of request page
+ * @res_paddr: system physical address of response page
+ */
+struct sev_data_snp_guest_request {
+ u64 gctx_paddr; /* In */
+ u64 req_paddr; /* In */
+ u64 res_paddr; /* In */
+} __packed;
+
+/**
+ * struuct sev_data_snp_init - SNP_INIT_EX structure
+ *
+ * @init_rmp: indicate that the RMP should be initialized.
+ */
+struct sev_data_snp_init_ex {
+ u32 init_rmp:1;
+ u32 rsvd:31;
+ u8 rsvd1[60];
+} __packed;
+
#ifdef CONFIG_CRYPTO_DEV_SP_PSP

/**
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 91b4c63d5cbf..bed65a891223 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -61,6 +61,13 @@ typedef enum {
SEV_RET_INVALID_PARAM,
SEV_RET_RESOURCE_LIMIT,
SEV_RET_SECURE_DATA_INVALID,
+ SEV_RET_INVALID_PAGE_SIZE,
+ SEV_RET_INVALID_PAGE_STATE,
+ SEV_RET_INVALID_MDATA_ENTRY,
+ SEV_RET_INVALID_PAGE_OWNER,
+ SEV_RET_INVALID_PAGE_AEAD_OFLOW,
+ SEV_RET_RMP_INIT_REQUIRED,
+
SEV_RET_MAX,
} sev_ret_code;

@@ -147,6 +154,41 @@ struct sev_user_data_get_id2 {
__u32 length; /* In/Out */
} __packed;

+/**
+ * struct sev_user_data_snp_status - SNP status
+ *
+ * @major: API major version
+ * @minor: API minor version
+ * @state: current platform state
+ * @build: firmware build id for the API version
+ * @guest_count: the number of guest currently managed by the firmware
+ * @tcb_version: current TCB version
+ */
+struct sev_user_data_snp_status {
+ __u8 api_major; /* Out */
+ __u8 api_minor; /* Out */
+ __u8 state; /* Out */
+ __u8 rsvd;
+ __u32 build_id; /* Out */
+ __u32 rsvd1;
+ __u32 guest_count; /* Out */
+ __u64 tcb_version; /* Out */
+ __u64 rsvd2;
+} __packed;
+
+/*
+ * struct sev_user_data_snp_config - system wide configuration value for SNP.
+ *
+ * @reported_tcb: The TCB version to report in the guest attestation report.
+ * @mask_chip_id: Indicates that the CHID_ID field in the attestation report
+ * will always be zero.
+ */
+struct sev_user_data_snp_config {
+ __u64 reported_tcb; /* In */
+ __u32 mask_chip_id; /* In */
+ __u8 rsvd[52];
+} __packed;
+
/**
* struct sev_issue_cmd - SEV ioctl parameters
*
--
2.25.1

2022-06-20 23:06:07

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 12/49] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP

From: Brijesh Singh <[email protected]>

Before SNP VMs can be launched, the platform must be appropriately
configured and initialized. Platform initialization is accomplished via
the SNP_INIT command. Make sure to do a WBINVD and issue DF_FLUSH command
to prepare for the first SNP guest launch after INIT.

Signed-off-by: Brijesh Singh <[email protected]>
Signed-off by: Ashish Kalra <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 121 +++++++++++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.h | 2 +
include/linux/psp-sev.h | 16 +++++
3 files changed, 139 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 9cb3265f3bef..f1173221d0b9 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -33,6 +33,10 @@
#define SEV_FW_FILE "amd/sev.fw"
#define SEV_FW_NAME_SIZE 64

+/* Minimum firmware version required for the SEV-SNP support */
+#define SNP_MIN_API_MAJOR 1
+#define SNP_MIN_API_MINOR 51
+
static DEFINE_MUTEX(sev_cmd_mutex);
static struct sev_misc_dev *misc_dev;

@@ -775,6 +779,98 @@ static int sev_update_firmware(struct device *dev)
return ret;
}

+static void snp_set_hsave_pa(void *arg)
+{
+ wrmsrl(MSR_VM_HSAVE_PA, 0);
+}
+
+static int __sev_snp_init_locked(int *error)
+{
+ struct psp_device *psp = psp_master;
+ struct sev_device *sev;
+ int rc = 0;
+
+ if (!psp || !psp->sev_data)
+ return -ENODEV;
+
+ sev = psp->sev_data;
+
+ if (sev->snp_inited)
+ return 0;
+
+ /*
+ * The SNP_INIT requires the MSR_VM_HSAVE_PA must be set to 0h
+ * across all cores.
+ */
+ on_each_cpu(snp_set_hsave_pa, NULL, 1);
+
+ /* Issue the SNP_INIT firmware command. */
+ rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT, NULL, error);
+ if (rc)
+ return rc;
+
+ /* Prepare for first SNP guest launch after INIT */
+ wbinvd_on_all_cpus();
+ rc = __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, error);
+ if (rc)
+ return rc;
+
+ sev->snp_inited = true;
+ dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
+
+ return rc;
+}
+
+int sev_snp_init(int *error)
+{
+ int rc;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return -ENODEV;
+
+ mutex_lock(&sev_cmd_mutex);
+ rc = __sev_snp_init_locked(error);
+ mutex_unlock(&sev_cmd_mutex);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(sev_snp_init);
+
+static int __sev_snp_shutdown_locked(int *error)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ int ret;
+
+ if (!sev->snp_inited)
+ return 0;
+
+ /* SHUTDOWN requires the DF_FLUSH */
+ wbinvd_on_all_cpus();
+ __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
+
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_SHUTDOWN, NULL, error);
+ if (ret) {
+ dev_err(sev->dev, "SEV-SNP firmware shutdown failed\n");
+ return ret;
+ }
+
+ sev->snp_inited = false;
+ dev_dbg(sev->dev, "SEV-SNP firmware shutdown\n");
+
+ return ret;
+}
+
+static int sev_snp_shutdown(int *error)
+{
+ int rc;
+
+ mutex_lock(&sev_cmd_mutex);
+ rc = __sev_snp_shutdown_locked(NULL);
+ mutex_unlock(&sev_cmd_mutex);
+
+ return rc;
+}
+
static int sev_ioctl_do_pek_import(struct sev_issue_cmd *argp, bool writable)
{
struct sev_device *sev = psp_master->sev_data;
@@ -1231,6 +1327,8 @@ static void sev_firmware_shutdown(struct sev_device *sev)
get_order(NV_LENGTH));
sev_init_ex_buffer = NULL;
}
+
+ sev_snp_shutdown(NULL);
}

void sev_dev_destroy(struct psp_device *psp)
@@ -1287,6 +1385,26 @@ void sev_pci_init(void)
}
}

+ /*
+ * If boot CPU supports the SNP, then first attempt to initialize
+ * the SNP firmware.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SEV_SNP)) {
+ if (!sev_version_greater_or_equal(SNP_MIN_API_MAJOR, SNP_MIN_API_MINOR)) {
+ dev_err(sev->dev, "SEV-SNP support requires firmware version >= %d:%d\n",
+ SNP_MIN_API_MAJOR, SNP_MIN_API_MINOR);
+ } else {
+ rc = sev_snp_init(&error);
+ if (rc) {
+ /*
+ * If we failed to INIT SNP then don't abort the probe.
+ * Continue to initialize the legacy SEV firmware.
+ */
+ dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
+ }
+ }
+ }
+
/* Obtain the TMR memory area for SEV-ES use */
sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
if (!sev_es_tmr)
@@ -1302,6 +1420,9 @@ void sev_pci_init(void)
dev_err(sev->dev, "SEV: failed to INIT error %#x, rc %d\n",
error, rc);

+ dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_inited ?
+ "-SNP" : "", sev->api_major, sev->api_minor, sev->build);
+
return;

err:
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 666c21eb81ab..186ad20cbd24 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -52,6 +52,8 @@ struct sev_device {
u8 build;

void *cmd_buf;
+
+ bool snp_inited;
};

int sev_dev_init(struct psp_device *psp);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 01ba9dc46ca3..ef4d42e8c96e 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -769,6 +769,20 @@ struct sev_data_snp_init_ex {
*/
int sev_platform_init(int *error);

+/**
+ * sev_snp_init - perform SEV SNP_INIT command
+ *
+ * @error: SEV command return code
+ *
+ * Returns:
+ * 0 if the SEV successfully processed the command
+ * -%ENODEV if the SEV device is not available
+ * -%ENOTSUPP if the SEV does not support SEV
+ * -%ETIMEDOUT if the SEV command timed out
+ * -%EIO if the SEV returned a non-zero return code
+ */
+int sev_snp_init(int *error);
+
/**
* sev_platform_status - perform SEV PLATFORM_STATUS command
*
@@ -876,6 +890,8 @@ sev_platform_status(struct sev_user_data_status *status, int *error) { return -E

static inline int sev_platform_init(int *error) { return -ENODEV; }

+static inline int sev_snp_init(int *error) { return -ENODEV; }
+
static inline int
sev_guest_deactivate(struct sev_data_deactivate *data, int *error) { return -ENODEV; }

--
2.25.1

2022-06-20 23:06:07

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

From: Brijesh Singh <[email protected]>

When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. When the host writes to pages, hardware checks the following
conditions at the end of page walk:

1. Assigned bit in the RMP table is zero (i.e page is shared).
2. If the page table entry that gives the sPA indicates that the target
page size is a large page, then all RMP entries for the 4KB
constituting pages of the target must have the assigned bit 0.
3. Immutable bit in the RMP table is not zero.

The hardware will raise page fault if one of the above conditions is not
met. Try resolving the fault instead of taking fault again and again. If
the host attempts to write to the guest private memory then send the
SIGBUS signal to kill the process. If the page level between the host and
RMP entry does not match, then split the address to keep the RMP and host
page levels in sync.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/mm/fault.c | 66 ++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 3 +-
include/linux/mm_types.h | 3 ++
mm/memory.c | 13 ++++++++
4 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a4c270e99f7f..f5de9673093a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/sev.h> /* snp_lookup_rmpentry() */

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1209,6 +1210,60 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
}
NOKPROBE_SYMBOL(do_kern_addr_fault);

+static inline size_t pages_per_hpage(int level)
+{
+ return page_level_size(level) / PAGE_SIZE;
+}
+
+/*
+ * Return 1 if the caller need to retry, 0 if it the address need to be split
+ * in order to resolve the fault.
+ */
+static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
+ unsigned long address)
+{
+ int rmp_level, level;
+ pte_t *pte;
+ u64 pfn;
+
+ pte = lookup_address_in_mm(current->mm, address, &level);
+
+ /*
+ * It can happen if there was a race between an unmap event and
+ * the RMP fault delivery.
+ */
+ if (!pte || !pte_present(*pte))
+ return 1;
+
+ pfn = pte_pfn(*pte);
+
+ /* If its large page then calculte the fault pfn */
+ if (level > PG_LEVEL_4K) {
+ unsigned long mask;
+
+ mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+ pfn |= (address >> PAGE_SHIFT) & mask;
+ }
+
+ /*
+ * If its a guest private page, then the fault cannot be resolved.
+ * Send a SIGBUS to terminate the process.
+ */
+ if (snp_lookup_rmpentry(pfn, &rmp_level)) {
+ do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
+ return 1;
+ }
+
+ /*
+ * The backing page level is higher than the RMP page level, request
+ * to split the page.
+ */
+ if (level > rmp_level)
+ return 0;
+
+ return 1;
+}
+
/*
* Handle faults in the user portion of the address space. Nothing in here
* should check X86_PF_USER without a specific justification: for almost
@@ -1306,6 +1361,17 @@ void do_user_addr_fault(struct pt_regs *regs,
if (error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;

+ /*
+ * If its an RMP violation, try resolving it.
+ */
+ if (error_code & X86_PF_RMP) {
+ if (handle_user_rmp_page_fault(regs, error_code, address))
+ return;
+
+ /* Ask to split the page */
+ flags |= FAULT_FLAG_PAGE_SPLIT;
+ }
+
#ifdef CONFIG_X86_64
/*
* Faults in the vsyscall page might need emulation. The
diff --git a/include/linux/mm.h b/include/linux/mm.h
index de32c0383387..2ccc562d166f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -463,7 +463,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" }

/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6dfaf271ebf8..aa2d8d48ce3e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -818,6 +818,8 @@ typedef struct {
* mapped R/O.
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
* We should only access orig_pte if this flag set.
+ * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
+ * region to smaller page size and retry.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -855,6 +857,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
FAULT_FLAG_UNSHARE = 1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+ FAULT_FLAG_PAGE_SPLIT = 1 << 12,
};

typedef unsigned int __bitwise zap_flags_t;
diff --git a/mm/memory.c b/mm/memory.c
index 7274f2b52bca..c2187ffcbb8e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4945,6 +4945,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return 0;
}

+static int handle_split_page_fault(struct vm_fault *vmf)
+{
+ if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+ return VM_FAULT_SIGBUS;
+
+ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+ return 0;
+}
+
/*
* By the time we get here, we already hold the mm semaphore
*
@@ -5024,6 +5033,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}
+
+ if (flags & FAULT_FLAG_PAGE_SPLIT)
+ return handle_split_page_fault(&vmf);
+
if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf);
--
2.25.1

2022-06-20 23:06:21

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

From: Brijesh Singh <[email protected]>

Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
commands for SEV-SNP is defined in the SEV-SNP firmware specification.

Signed-off-by: Brijesh Singh <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 24 ++++++++++++
include/linux/psp-sev.h | 73 ++++++++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index f1173221d0b9..35d76333e120 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1205,6 +1205,30 @@ int sev_guest_df_flush(int *error)
}
EXPORT_SYMBOL_GPL(sev_guest_df_flush);

+int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error)
+{
+ return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_decommission);
+
+int snp_guest_df_flush(int *error)
+{
+ return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_df_flush);
+
+int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
+{
+ return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);
+
+int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+{
+ return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
+
static void sev_exit(struct kref *ref)
{
misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index ef4d42e8c96e..9f921d221b75 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -881,6 +881,64 @@ int sev_guest_df_flush(int *error);
*/
int sev_guest_decommission(struct sev_data_decommission *data, int *error);

+/**
+ * snp_guest_df_flush - perform SNP DF_FLUSH command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV if the sev device is not available
+ * -%ENOTSUPP if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO if the sev returned a non-zero return code
+ */
+int snp_guest_df_flush(int *error);
+
+/**
+ * snp_guest_decommission - perform SNP_DECOMMISSION command
+ *
+ * @decommission: sev_data_decommission structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV if the sev device is not available
+ * -%ENOTSUPP if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO if the sev returned a non-zero return code
+ */
+int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error);
+
+/**
+ * snp_guest_page_reclaim - perform SNP_PAGE_RECLAIM command
+ *
+ * @decommission: sev_snp_page_reclaim structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV if the sev device is not available
+ * -%ENOTSUPP if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO if the sev returned a non-zero return code
+ */
+int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
+
+/**
+ * snp_guest_dbg_decrypt - perform SEV SNP_DBG_DECRYPT command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV if the sev device is not available
+ * -%ENOTSUPP if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO if the sev returned a non-zero return code
+ */
+int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
+
void *psp_copy_user_blob(u64 uaddr, u32 len);

#else /* !CONFIG_CRYPTO_DEV_SP_PSP */
@@ -908,6 +966,21 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int

static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }

+static inline int
+snp_guest_decommission(struct sev_data_snp_decommission *data, int *error) { return -ENODEV; }
+
+static inline int snp_guest_df_flush(int *error) { return -ENODEV; }
+
+static inline int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
+{
+ return -ENODEV;
+}
+
+static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_CRYPTO_DEV_SP_PSP */

#endif /* __PSP_SEV_H__ */
--
2.25.1

2022-06-20 23:06:58

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

From: Brijesh Singh <[email protected]>

The behavior and requirement for the SEV-legacy command is altered when
the SNP firmware is in the INIT state. See SEV-SNP firmware specification
for more details.

Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region
when SNP is enabled to satify new requirements for the SNP. Continue
allocating a 1mb region for !SNP configuration.

While at it, provide API that can be used by others to allocate a page
that can be used by the firmware. The immediate user for this API will
be the KVM driver. The KVM driver to need to allocate a firmware context
page during the guest creation. The context page need to be updated
by the firmware. See the SEV-SNP specification for further details.

Signed-off-by: Brijesh Singh <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 173 +++++++++++++++++++++++++++++++++--
include/linux/psp-sev.h | 11 +++
2 files changed, 178 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 35d76333e120..0dbd99f29b25 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -79,6 +79,14 @@ static void *sev_es_tmr;
#define NV_LENGTH (32 * 1024)
static void *sev_init_ex_buffer;

+/* When SEV-SNP is enabled the TMR needs to be 2MB aligned and 2MB size. */
+#define SEV_SNP_ES_TMR_SIZE (2 * 1024 * 1024)
+
+static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;
+
+static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
+static int sev_do_cmd(int cmd, void *data, int *psp_ret);
+
static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
{
struct sev_device *sev = psp_master->sev_data;
@@ -177,11 +185,161 @@ static int sev_cmd_buffer_len(int cmd)
return 0;
}

+static void snp_leak_pages(unsigned long pfn, unsigned int npages)
+{
+ WARN(1, "psc failed, pfn 0x%lx pages %d (leaking)\n", pfn, npages);
+ while (npages--) {
+ memory_failure(pfn, 0);
+ dump_rmpentry(pfn);
+ pfn++;
+ }
+}
+
+static int snp_reclaim_pages(unsigned long pfn, unsigned int npages, bool locked)
+{
+ struct sev_data_snp_page_reclaim data;
+ int ret, err, i, n = 0;
+
+ for (i = 0; i < npages; i++) {
+ memset(&data, 0, sizeof(data));
+ data.paddr = pfn << PAGE_SHIFT;
+
+ if (locked)
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
+ else
+ ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
+ if (ret)
+ goto cleanup;
+
+ ret = rmp_make_shared(pfn, PG_LEVEL_4K);
+ if (ret)
+ goto cleanup;
+
+ pfn++;
+ n++;
+ }
+
+ return 0;
+
+cleanup:
+ /*
+ * If failed to reclaim the page then page is no longer safe to
+ * be released, leak it.
+ */
+ snp_leak_pages(pfn, npages - n);
+ return ret;
+}
+
+static inline int rmp_make_firmware(unsigned long pfn, int level)
+{
+ return rmp_make_private(pfn, 0, level, 0, true);
+}
+
+static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
+ bool need_reclaim)
+{
+ unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT; /* Cbit maybe set in the paddr */
+ int rc, n = 0, i;
+
+ for (i = 0; i < npages; i++) {
+ if (to_fw)
+ rc = rmp_make_firmware(pfn, PG_LEVEL_4K);
+ else
+ rc = need_reclaim ? snp_reclaim_pages(pfn, 1, locked) :
+ rmp_make_shared(pfn, PG_LEVEL_4K);
+ if (rc)
+ goto cleanup;
+
+ pfn++;
+ n++;
+ }
+
+ return 0;
+
+cleanup:
+ /* Try unrolling the firmware state changes */
+ if (to_fw) {
+ /*
+ * Reclaim the pages which were already changed to the
+ * firmware state.
+ */
+ snp_reclaim_pages(paddr >> PAGE_SHIFT, n, locked);
+
+ return rc;
+ }
+
+ /*
+ * If failed to change the page state to shared, then its not safe
+ * to release the page back to the system, leak it.
+ */
+ snp_leak_pages(pfn, npages - n);
+
+ return rc;
+}
+
+static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
+{
+ unsigned long npages = 1ul << order, paddr;
+ struct sev_device *sev;
+ struct page *page;
+
+ if (!psp_master || !psp_master->sev_data)
+ return NULL;
+
+ page = alloc_pages(gfp_mask, order);
+ if (!page)
+ return NULL;
+
+ /* If SEV-SNP is initialized then add the page in RMP table. */
+ sev = psp_master->sev_data;
+ if (!sev->snp_inited)
+ return page;
+
+ paddr = __pa((unsigned long)page_address(page));
+ if (snp_set_rmp_state(paddr, npages, true, locked, false))
+ return NULL;
+
+ return page;
+}
+
+void *snp_alloc_firmware_page(gfp_t gfp_mask)
+{
+ struct page *page;
+
+ page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
+
+ return page ? page_address(page) : NULL;
+}
+EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
+
+static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
+{
+ unsigned long paddr, npages = 1ul << order;
+
+ if (!page)
+ return;
+
+ paddr = __pa((unsigned long)page_address(page));
+ if (snp_set_rmp_state(paddr, npages, false, locked, true))
+ return;
+
+ __free_pages(page, order);
+}
+
+void snp_free_firmware_page(void *addr)
+{
+ if (!addr)
+ return;
+
+ __snp_free_firmware_pages(virt_to_page(addr), 0, false);
+}
+EXPORT_SYMBOL(snp_free_firmware_page);
+
static void *sev_fw_alloc(unsigned long len)
{
struct page *page;

- page = alloc_pages(GFP_KERNEL, get_order(len));
+ page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(len), false);
if (!page)
return NULL;

@@ -393,7 +551,7 @@ static int __sev_init_locked(int *error)
data.tmr_address = __pa(sev_es_tmr);

data.flags |= SEV_INIT_FLAGS_SEV_ES;
- data.tmr_len = SEV_ES_TMR_SIZE;
+ data.tmr_len = sev_es_tmr_size;
}

return __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
@@ -421,7 +579,7 @@ static int __sev_init_ex_locked(int *error)
data.tmr_address = __pa(sev_es_tmr);

data.flags |= SEV_INIT_FLAGS_SEV_ES;
- data.tmr_len = SEV_ES_TMR_SIZE;
+ data.tmr_len = sev_es_tmr_size;
}

return __sev_do_cmd_locked(SEV_CMD_INIT_EX, &data, error);
@@ -818,6 +976,8 @@ static int __sev_snp_init_locked(int *error)
sev->snp_inited = true;
dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");

+ sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
+
return rc;
}

@@ -1341,8 +1501,9 @@ static void sev_firmware_shutdown(struct sev_device *sev)
/* The TMR area was encrypted, flush it from the cache */
wbinvd_on_all_cpus();

- free_pages((unsigned long)sev_es_tmr,
- get_order(SEV_ES_TMR_SIZE));
+ __snp_free_firmware_pages(virt_to_page(sev_es_tmr),
+ get_order(sev_es_tmr_size),
+ false);
sev_es_tmr = NULL;
}

@@ -1430,7 +1591,7 @@ void sev_pci_init(void)
}

/* Obtain the TMR memory area for SEV-ES use */
- sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
+ sev_es_tmr = sev_fw_alloc(sev_es_tmr_size);
if (!sev_es_tmr)
dev_warn(sev->dev,
"SEV: TMR allocation failed, SEV-ES support unavailable\n");
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 9f921d221b75..a3bb792bb842 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -12,6 +12,8 @@
#ifndef __PSP_SEV_H__
#define __PSP_SEV_H__

+#include <linux/sev.h>
+
#include <uapi/linux/psp-sev.h>

#ifdef CONFIG_X86
@@ -940,6 +942,8 @@ int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);

void *psp_copy_user_blob(u64 uaddr, u32 len);
+void *snp_alloc_firmware_page(gfp_t mask);
+void snp_free_firmware_page(void *addr);

#else /* !CONFIG_CRYPTO_DEV_SP_PSP */

@@ -981,6 +985,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
return -ENODEV;
}

+static inline void *snp_alloc_firmware_page(gfp_t mask)
+{
+ return NULL;
+}
+
+static inline void snp_free_firmware_page(void *addr) { }
+
#endif /* CONFIG_CRYPTO_DEV_SP_PSP */

#endif /* __PSP_SEV_H__ */
--
2.25.1

2022-06-20 23:09:44

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 15/49] crypto: ccp: Handle the legacy SEV command when SNP is enabled

From: Brijesh Singh <[email protected]>

The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after the
command completes.

For the case #2, use a bounce buffer to complete the request.

Signed-off-by: Brijesh Singh <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 346 ++++++++++++++++++++++++++++++++++-
drivers/crypto/ccp/sev-dev.h | 12 ++
2 files changed, 348 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 0dbd99f29b25..75f5c4ed9ac3 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -441,12 +441,295 @@ static void sev_write_init_ex_file_if_required(int cmd_id)
sev_write_init_ex_file();
}

+static int alloc_snp_host_map(struct sev_device *sev)
+{
+ struct page *page;
+ int i;
+
+ for (i = 0; i < MAX_SNP_HOST_MAP_BUFS; i++) {
+ struct snp_host_map *map = &sev->snp_host_map[i];
+
+ memset(map, 0, sizeof(*map));
+
+ page = alloc_pages(GFP_KERNEL_ACCOUNT, get_order(SEV_FW_BLOB_MAX_SIZE));
+ if (!page)
+ return -ENOMEM;
+
+ map->host = page_address(page);
+ }
+
+ return 0;
+}
+
+static void free_snp_host_map(struct sev_device *sev)
+{
+ int i;
+
+ for (i = 0; i < MAX_SNP_HOST_MAP_BUFS; i++) {
+ struct snp_host_map *map = &sev->snp_host_map[i];
+
+ if (map->host) {
+ __free_pages(virt_to_page(map->host), get_order(SEV_FW_BLOB_MAX_SIZE));
+ memset(map, 0, sizeof(*map));
+ }
+ }
+}
+
+static int map_firmware_writeable(u64 *paddr, u32 len, bool guest, struct snp_host_map *map)
+{
+ unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+ map->active = false;
+
+ if (!paddr || !len)
+ return 0;
+
+ map->paddr = *paddr;
+ map->len = len;
+
+ /* If paddr points to a guest memory then change the page state to firmwware. */
+ if (guest) {
+ if (snp_set_rmp_state(*paddr, npages, true, true, false))
+ return -EFAULT;
+
+ goto done;
+ }
+
+ if (!map->host)
+ return -ENOMEM;
+
+ /* Check if the pre-allocated buffer can be used to fullfil the request. */
+ if (len > SEV_FW_BLOB_MAX_SIZE)
+ return -EINVAL;
+
+ /* Transition the pre-allocated buffer to the firmware state. */
+ if (snp_set_rmp_state(__pa(map->host), npages, true, true, false))
+ return -EFAULT;
+
+ /* Set the paddr to use pre-allocated firmware buffer */
+ *paddr = __psp_pa(map->host);
+
+done:
+ map->active = true;
+ return 0;
+}
+
+static int unmap_firmware_writeable(u64 *paddr, u32 len, bool guest, struct snp_host_map *map)
+{
+ unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+ if (!map->active)
+ return 0;
+
+ /* If paddr points to a guest memory then restore the page state to hypervisor. */
+ if (guest) {
+ if (snp_set_rmp_state(*paddr, npages, false, true, true))
+ return -EFAULT;
+
+ goto done;
+ }
+
+ /*
+ * Transition the pre-allocated buffer to hypervisor state before the access.
+ *
+ * This is because while changing the page state to firmware, the kernel unmaps
+ * the pages from the direct map, and to restore the direct map we must
+ * transition the pages to shared state.
+ */
+ if (snp_set_rmp_state(__pa(map->host), npages, false, true, true))
+ return -EFAULT;
+
+ /* Copy the response data firmware buffer to the callers buffer. */
+ memcpy(__va(__sme_clr(map->paddr)), map->host, min_t(size_t, len, map->len));
+ *paddr = map->paddr;
+
+done:
+ map->active = false;
+ return 0;
+}
+
+static bool sev_legacy_cmd_buf_writable(int cmd)
+{
+ switch (cmd) {
+ case SEV_CMD_PLATFORM_STATUS:
+ case SEV_CMD_GUEST_STATUS:
+ case SEV_CMD_LAUNCH_START:
+ case SEV_CMD_RECEIVE_START:
+ case SEV_CMD_LAUNCH_MEASURE:
+ case SEV_CMD_SEND_START:
+ case SEV_CMD_SEND_UPDATE_DATA:
+ case SEV_CMD_SEND_UPDATE_VMSA:
+ case SEV_CMD_PEK_CSR:
+ case SEV_CMD_PDH_CERT_EXPORT:
+ case SEV_CMD_GET_ID:
+ case SEV_CMD_ATTESTATION_REPORT:
+ return true;
+ default:
+ return false;
+ }
+}
+
+#define prep_buffer(name, addr, len, guest, map) \
+ func(&((typeof(name *))cmd_buf)->addr, ((typeof(name *))cmd_buf)->len, guest, map)
+
+static int __snp_cmd_buf_copy(int cmd, void *cmd_buf, bool to_fw, int fw_err)
+{
+ int (*func)(u64 *paddr, u32 len, bool guest, struct snp_host_map *map);
+ struct sev_device *sev = psp_master->sev_data;
+ bool from_fw = !to_fw;
+
+ /*
+ * After the command is completed, change the command buffer memory to
+ * hypervisor state.
+ *
+ * The immutable bit is automatically cleared by the firmware, so
+ * no not need to reclaim the page.
+ */
+ if (from_fw && sev_legacy_cmd_buf_writable(cmd)) {
+ if (snp_set_rmp_state(__pa(cmd_buf), 1, false, true, false))
+ return -EFAULT;
+
+ /* No need to go further if firmware failed to execute command. */
+ if (fw_err)
+ return 0;
+ }
+
+ if (to_fw)
+ func = map_firmware_writeable;
+ else
+ func = unmap_firmware_writeable;
+
+ /*
+ * A command buffer may contains a system physical address. If the address
+ * points to a host memory then use an intermediate firmware page otherwise
+ * change the page state in the RMP table.
+ */
+ switch (cmd) {
+ case SEV_CMD_PDH_CERT_EXPORT:
+ if (prep_buffer(struct sev_data_pdh_cert_export, pdh_cert_address,
+ pdh_cert_len, false, &sev->snp_host_map[0]))
+ goto err;
+ if (prep_buffer(struct sev_data_pdh_cert_export, cert_chain_address,
+ cert_chain_len, false, &sev->snp_host_map[1]))
+ goto err;
+ break;
+ case SEV_CMD_GET_ID:
+ if (prep_buffer(struct sev_data_get_id, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_PEK_CSR:
+ if (prep_buffer(struct sev_data_pek_csr, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_UPDATE_DATA:
+ if (prep_buffer(struct sev_data_launch_update_data, address, len,
+ true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_UPDATE_VMSA:
+ if (prep_buffer(struct sev_data_launch_update_vmsa, address, len,
+ true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_MEASURE:
+ if (prep_buffer(struct sev_data_launch_measure, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_LAUNCH_UPDATE_SECRET:
+ if (prep_buffer(struct sev_data_launch_secret, guest_address, guest_len,
+ true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_DBG_DECRYPT:
+ if (prep_buffer(struct sev_data_dbg, dst_addr, len, false,
+ &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_DBG_ENCRYPT:
+ if (prep_buffer(struct sev_data_dbg, dst_addr, len, true,
+ &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_ATTESTATION_REPORT:
+ if (prep_buffer(struct sev_data_attestation_report, address, len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_SEND_START:
+ if (prep_buffer(struct sev_data_send_start, session_address,
+ session_len, false, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_SEND_UPDATE_DATA:
+ if (prep_buffer(struct sev_data_send_update_data, hdr_address, hdr_len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ if (prep_buffer(struct sev_data_send_update_data, trans_address,
+ trans_len, false, &sev->snp_host_map[1]))
+ goto err;
+ break;
+ case SEV_CMD_SEND_UPDATE_VMSA:
+ if (prep_buffer(struct sev_data_send_update_vmsa, hdr_address, hdr_len,
+ false, &sev->snp_host_map[0]))
+ goto err;
+ if (prep_buffer(struct sev_data_send_update_vmsa, trans_address,
+ trans_len, false, &sev->snp_host_map[1]))
+ goto err;
+ break;
+ case SEV_CMD_RECEIVE_UPDATE_DATA:
+ if (prep_buffer(struct sev_data_receive_update_data, guest_address,
+ guest_len, true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ case SEV_CMD_RECEIVE_UPDATE_VMSA:
+ if (prep_buffer(struct sev_data_receive_update_vmsa, guest_address,
+ guest_len, true, &sev->snp_host_map[0]))
+ goto err;
+ break;
+ default:
+ break;
+ }
+
+ /* The command buffer need to be in the firmware state. */
+ if (to_fw && sev_legacy_cmd_buf_writable(cmd)) {
+ if (snp_set_rmp_state(__pa(cmd_buf), 1, true, true, false))
+ return -EFAULT;
+ }
+
+ return 0;
+
+err:
+ return -EINVAL;
+}
+
+static inline bool need_firmware_copy(int cmd)
+{
+ struct sev_device *sev = psp_master->sev_data;
+
+ /* After SNP is INIT'ed, the behavior of legacy SEV command is changed. */
+ return ((cmd < SEV_CMD_SNP_INIT) && sev->snp_inited) ? true : false;
+}
+
+static int snp_aware_copy_to_firmware(int cmd, void *data)
+{
+ return __snp_cmd_buf_copy(cmd, data, true, 0);
+}
+
+static int snp_aware_copy_from_firmware(int cmd, void *data, int fw_err)
+{
+ return __snp_cmd_buf_copy(cmd, data, false, fw_err);
+}
+
static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
{
struct psp_device *psp = psp_master;
struct sev_device *sev;
unsigned int phys_lsb, phys_msb;
unsigned int reg, ret = 0;
+ void *cmd_buf;
int buf_len;

if (!psp || !psp->sev_data)
@@ -466,12 +749,28 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
* work for some memory, e.g. vmalloc'd addresses, and @data may not be
* physically contiguous.
*/
- if (data)
- memcpy(sev->cmd_buf, data, buf_len);
+ if (data) {
+ if (sev->cmd_buf_active > 2)
+ return -EBUSY;
+
+ cmd_buf = sev->cmd_buf_active ? sev->cmd_buf_backup : sev->cmd_buf;
+
+ memcpy(cmd_buf, data, buf_len);
+ sev->cmd_buf_active++;
+
+ /*
+ * The behavior of the SEV-legacy commands is altered when the
+ * SNP firmware is in the INIT state.
+ */
+ if (need_firmware_copy(cmd) && snp_aware_copy_to_firmware(cmd, sev->cmd_buf))
+ return -EFAULT;
+ } else {
+ cmd_buf = sev->cmd_buf;
+ }

/* Get the physical address of the command buffer */
- phys_lsb = data ? lower_32_bits(__psp_pa(sev->cmd_buf)) : 0;
- phys_msb = data ? upper_32_bits(__psp_pa(sev->cmd_buf)) : 0;
+ phys_lsb = data ? lower_32_bits(__psp_pa(cmd_buf)) : 0;
+ phys_msb = data ? upper_32_bits(__psp_pa(cmd_buf)) : 0;

dev_dbg(sev->dev, "sev command id %#x buffer 0x%08x%08x timeout %us\n",
cmd, phys_msb, phys_lsb, psp_timeout);
@@ -514,15 +813,24 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
sev_write_init_ex_file_if_required(cmd);
}

- print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
- buf_len, false);
-
/*
* Copy potential output from the PSP back to data. Do this even on
* failure in case the caller wants to glean something from the error.
*/
- if (data)
- memcpy(data, sev->cmd_buf, buf_len);
+ if (data) {
+ /*
+ * Restore the page state after the command completes.
+ */
+ if (need_firmware_copy(cmd) &&
+ snp_aware_copy_from_firmware(cmd, cmd_buf, ret))
+ return -EFAULT;
+
+ memcpy(data, cmd_buf, buf_len);
+ sev->cmd_buf_active--;
+ }
+
+ print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
+ buf_len, false);

return ret;
}
@@ -1451,10 +1759,12 @@ int sev_dev_init(struct psp_device *psp)
if (!sev)
goto e_err;

- sev->cmd_buf = (void *)devm_get_free_pages(dev, GFP_KERNEL, 0);
+ sev->cmd_buf = (void *)devm_get_free_pages(dev, GFP_KERNEL, 1);
if (!sev->cmd_buf)
goto e_sev;

+ sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
+
psp->sev_data = sev;

sev->dev = dev;
@@ -1513,6 +1823,12 @@ static void sev_firmware_shutdown(struct sev_device *sev)
sev_init_ex_buffer = NULL;
}

+ /*
+ * The host map need to clear the immutable bit so it must be free'd before the
+ * SNP firmware shutdown.
+ */
+ free_snp_host_map(sev);
+
sev_snp_shutdown(NULL);
}

@@ -1588,6 +1904,14 @@ void sev_pci_init(void)
dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
}
}
+
+ /*
+ * Allocate the intermediate buffers used for the legacy command handling.
+ */
+ if (alloc_snp_host_map(sev)) {
+ dev_notice(sev->dev, "Failed to alloc host map (disabling legacy SEV)\n");
+ goto skip_legacy;
+ }
}

/* Obtain the TMR memory area for SEV-ES use */
@@ -1605,12 +1929,14 @@ void sev_pci_init(void)
dev_err(sev->dev, "SEV: failed to INIT error %#x, rc %d\n",
error, rc);

+skip_legacy:
dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_inited ?
"-SNP" : "", sev->api_major, sev->api_minor, sev->build);

return;

err:
+ free_snp_host_map(sev);
psp_master->sev_data = NULL;
}

diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 186ad20cbd24..fe5d7a3ebace 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -29,11 +29,20 @@
#define SEV_CMDRESP_CMD_SHIFT 16
#define SEV_CMDRESP_IOC BIT(0)

+#define MAX_SNP_HOST_MAP_BUFS 2
+
struct sev_misc_dev {
struct kref refcount;
struct miscdevice misc;
};

+struct snp_host_map {
+ u64 paddr;
+ u32 len;
+ void *host;
+ bool active;
+};
+
struct sev_device {
struct device *dev;
struct psp_device *psp;
@@ -52,8 +61,11 @@ struct sev_device {
u8 build;

void *cmd_buf;
+ void *cmd_buf_backup;
+ int cmd_buf_active;

bool snp_inited;
+ struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
};

int sev_dev_init(struct psp_device *psp);
--
2.25.1

2022-06-20 23:09:44

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 18/49] crypto: ccp: Provide APIs to query extended attestation report

From: Brijesh Singh <[email protected]>

Version 2 of the GHCB specification defines VMGEXIT that is used to get
the extended attestation report. The extended attestation report includes
the certificate blobs provided through the SNP_SET_EXT_CONFIG.

The snp_guest_ext_guest_request() will be used by the hypervisor to get
the extended attestation report. See the GHCB specification for more
details.

Signed-off-by: Brijesh Singh <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 43 ++++++++++++++++++++++++++++++++++++
include/linux/psp-sev.h | 24 ++++++++++++++++++++
2 files changed, 67 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 97b479d5aa86..f6306b820b86 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -25,6 +25,7 @@
#include <linux/fs.h>

#include <asm/smp.h>
+#include <asm/sev.h>

#include "psp-dev.h"
#include "sev-dev.h"
@@ -1857,6 +1858,48 @@ int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
}
EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);

+int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+ unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
+{
+ unsigned long expected_npages;
+ struct sev_device *sev;
+ int rc;
+
+ if (!psp_master || !psp_master->sev_data)
+ return -ENODEV;
+
+ sev = psp_master->sev_data;
+
+ if (!sev->snp_inited)
+ return -EINVAL;
+
+ /*
+ * Check if there is enough space to copy the certificate chain. Otherwise
+ * return ERROR code defined in the GHCB specification.
+ */
+ expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
+ if (*npages < expected_npages) {
+ *npages = expected_npages;
+ *fw_err = SNP_GUEST_REQ_INVALID_LEN;
+ return -EINVAL;
+ }
+
+ rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)&fw_err);
+ if (rc)
+ return rc;
+
+ /* Copy the certificate blob */
+ if (sev->snp_certs_data) {
+ *npages = expected_npages;
+ memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
+ } else {
+ *npages = 0;
+ }
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
+
static void sev_exit(struct kref *ref)
{
misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index a3bb792bb842..cd37ccd1fa1f 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -945,6 +945,23 @@ void *psp_copy_user_blob(u64 uaddr, u32 len);
void *snp_alloc_firmware_page(gfp_t mask);
void snp_free_firmware_page(void *addr);

+/**
+ * snp_guest_ext_guest_request - perform the SNP extended guest request command
+ * defined in the GHCB specification.
+ *
+ * @data: the input guest request structure
+ * @vaddr: address where the certificate blob need to be copied.
+ * @npages: number of pages for the certificate blob.
+ * If the specified page count is less than the certificate blob size, then the
+ * required page count is returned with error code defined in the GHCB spec.
+ * If the specified page count is more than the certificate blob size, then
+ * page count is updated to reflect the amount of valid data copied in the
+ * vaddr.
+ */
+int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+ unsigned long vaddr, unsigned long *npages,
+ unsigned long *error);
+
#else /* !CONFIG_CRYPTO_DEV_SP_PSP */

static inline int
@@ -992,6 +1009,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)

static inline void snp_free_firmware_page(void *addr) { }

+static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+ unsigned long vaddr, unsigned long *n,
+ unsigned long *error)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_CRYPTO_DEV_SP_PSP */

#endif /* __PSP_SEV_H__ */
--
2.25.1

2022-06-20 23:11:12

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 16/49] crypto: ccp: Add the SNP_PLATFORM_STATUS command

From: Brijesh Singh <[email protected]>

The command can be used by the userspace to query the SNP platform status
report. See the SEV-SNP spec for more details.

Signed-off-by: Brijesh Singh <[email protected]>
---
Documentation/virt/coco/sevguest.rst | 27 +++++++++++++++++
drivers/crypto/ccp/sev-dev.c | 45 ++++++++++++++++++++++++++++
include/uapi/linux/psp-sev.h | 1 +
3 files changed, 73 insertions(+)

diff --git a/Documentation/virt/coco/sevguest.rst b/Documentation/virt/coco/sevguest.rst
index bf593e88cfd9..11ea67c944df 100644
--- a/Documentation/virt/coco/sevguest.rst
+++ b/Documentation/virt/coco/sevguest.rst
@@ -61,6 +61,22 @@ counter (e.g. counter overflow), then -EIO will be returned.
__u64 fw_err;
};

+The host ioctl should be called to /dev/sev device. The ioctl accepts command
+id and command input structure.
+
+::
+ struct sev_issue_cmd {
+ /* Command ID */
+ __u32 cmd;
+
+ /* Command request structure */
+ __u64 data;
+
+ /* firmware error code on failure (see psp-sev.h) */
+ __u32 error;
+ };
+
+
2.1 SNP_GET_REPORT
------------------

@@ -118,6 +134,17 @@ be updated with the expected value.

See GHCB specification for further detail on how to parse the certificate blob.

+2.4 SNP_PLATFORM_STATUS
+-----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_platform_status
+:Returns (out): 0 on success, -negative on error
+
+The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
+status includes API major, minor version and more. See the SEV-SNP
+specification for further details.
+
3. SEV-SNP CPUID Enforcement
============================

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 75f5c4ed9ac3..b9b6fab31a82 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1574,6 +1574,48 @@ static int sev_ioctl_do_pdh_export(struct sev_issue_cmd *argp, bool writable)
return ret;
}

+static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_data_snp_platform_status_buf buf;
+ struct page *status_page;
+ void *data;
+ int ret;
+
+ if (!sev->snp_inited || !argp->data)
+ return -EINVAL;
+
+ status_page = alloc_page(GFP_KERNEL_ACCOUNT);
+ if (!status_page)
+ return -ENOMEM;
+
+ data = page_address(status_page);
+ if (snp_set_rmp_state(__pa(data), 1, true, true, false)) {
+ __free_pages(status_page, 0);
+ return -EFAULT;
+ }
+
+ buf.status_paddr = __psp_pa(data);
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_PLATFORM_STATUS, &buf, &argp->error);
+
+ /* Change the page state before accessing it */
+ if (snp_set_rmp_state(__pa(data), 1, false, true, true)) {
+ snp_leak_pages(__pa(data) >> PAGE_SHIFT, 1);
+ return -EFAULT;
+ }
+
+ if (ret)
+ goto cleanup;
+
+ if (copy_to_user((void __user *)argp->data, data,
+ sizeof(struct sev_user_data_snp_status)))
+ ret = -EFAULT;
+
+cleanup:
+ __free_pages(status_page, 0);
+ return ret;
+}
+
static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
{
void __user *argp = (void __user *)arg;
@@ -1625,6 +1667,9 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
case SEV_GET_ID2:
ret = sev_ioctl_do_get_id2(&input);
break;
+ case SNP_PLATFORM_STATUS:
+ ret = sev_ioctl_snp_platform_status(&input);
+ break;
default:
ret = -EINVAL;
goto out;
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index bed65a891223..ffd60e8b0a31 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -28,6 +28,7 @@ enum {
SEV_PEK_CERT_IMPORT,
SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
SEV_GET_ID2,
+ SNP_PLATFORM_STATUS,

SEV_MAX,
};
--
2.25.1

2022-06-20 23:11:23

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 19/49] KVM: SVM: Add support to handle AP reset MSR protocol

From: Tom Lendacky <[email protected]>

Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
available in version 2 of the GHCB specification.

Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev-common.h | 2 ++
arch/x86/kvm/svm/sev.c | 56 ++++++++++++++++++++++++++-----
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..e15548d88f2a 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -56,6 +56,8 @@
/* AP Reset Hold */
#define GHCB_MSR_AP_RESET_HOLD_REQ 0x006
#define GHCB_MSR_AP_RESET_HOLD_RESP 0x007
+#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS 12
+#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0)

/* GHCB GPA Register */
#define GHCB_MSR_REG_GPA_REQ 0x012
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 609471204c6e..a1318236acd2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -56,6 +56,10 @@ module_param_named(sev_es, sev_es_enabled, bool, 0444);
#define sev_es_enabled false
#endif /* CONFIG_KVM_AMD_SEV */

+#define AP_RESET_HOLD_NONE 0
+#define AP_RESET_HOLD_NAE_EVENT 1
+#define AP_RESET_HOLD_MSR_PROTO 2
+
static u8 sev_enc_bit;
static DECLARE_RWSEM(sev_deactivate_lock);
static DEFINE_MUTEX(sev_bitmap_lock);
@@ -2511,6 +2515,9 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)

void sev_es_unmap_ghcb(struct vcpu_svm *svm)
{
+ /* Clear any indication that the vCPU is in a type of AP Reset Hold */
+ svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_NONE;
+
if (!svm->sev_es.ghcb)
return;

@@ -2723,6 +2730,22 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_POS);
break;
}
+ case GHCB_MSR_AP_RESET_HOLD_REQ:
+ svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
+ ret = kvm_emulate_ap_reset_hold(&svm->vcpu);
+
+ /*
+ * Preset the result to a non-SIPI return and then only set
+ * the result to non-zero when delivering a SIPI.
+ */
+ set_ghcb_msr_bits(svm, 0,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
+
+ set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
+ GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;

@@ -2823,6 +2846,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
break;
case SVM_VMGEXIT_AP_HLT_LOOP:
+ svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_NAE_EVENT;
ret = kvm_emulate_ap_reset_hold(vcpu);
break;
case SVM_VMGEXIT_AP_JUMP_TABLE: {
@@ -2966,13 +2990,29 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
return;
}

- /*
- * Subsequent SIPI: Return from an AP Reset Hold VMGEXIT, where
- * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
- * non-zero value.
- */
- if (!svm->sev_es.ghcb)
- return;
+ /* Subsequent SIPI */
+ switch (svm->sev_es.ap_reset_hold_type) {
+ case AP_RESET_HOLD_NAE_EVENT:
+ /*
+ * Return from an AP Reset Hold VMGEXIT, where the guest will
+ * set the CS and RIP. Set SW_EXIT_INFO_2 to a non-zero value.
+ */
+ ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
+ break;
+ case AP_RESET_HOLD_MSR_PROTO:
+ /*
+ * Return from an AP Reset Hold VMGEXIT, where the guest will
+ * set the CS and RIP. Set GHCB data field to a non-zero value.
+ */
+ set_ghcb_msr_bits(svm, 1,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
+ GHCB_MSR_AP_RESET_HOLD_RESULT_POS);

- ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
+ set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
+ GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ default:
+ break;
+ }
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bb9ec9139af3..9f7eb1f18893 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -186,6 +186,7 @@ struct vcpu_sev_es_state {
struct ghcb *ghcb;
struct kvm_host_map ghcb_map;
bool received_first_sipi;
+ unsigned int ap_reset_hold_type;

/* SEV-ES scratch area support */
void *ghcb_sa;
--
2.25.1

2022-06-20 23:11:48

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 17/49] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command

From: Brijesh Singh <[email protected]>

The SEV-SNP firmware provides the SNP_CONFIG command used to set the
system-wide configuration value for SNP guests. The information includes
the TCB version string to be reported in guest attestation reports.

Version 2 of the GHCB specification adds an NAE (SNP extended guest
request) that a guest can use to query the reports that include additional
certificates.

In both cases, userspace provided additional data is included in the
attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
command to give the certificate blob and the reported TCB version string
at once. Note that the specification defines certificate blob with a
specific GUID format; the userspace is responsible for building the
proper certificate blob. The ioctl treats it an opaque blob.

While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
command that can be used to obtain the data programmed through the
SNP_SET_EXT_CONFIG.

Signed-off-by: Brijesh Singh <[email protected]>
---
Documentation/virt/coco/sevguest.rst | 27 +++++++
drivers/crypto/ccp/sev-dev.c | 115 +++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.h | 3 +
include/uapi/linux/psp-sev.h | 17 ++++
4 files changed, 162 insertions(+)

diff --git a/Documentation/virt/coco/sevguest.rst b/Documentation/virt/coco/sevguest.rst
index 11ea67c944df..3014de47e4ce 100644
--- a/Documentation/virt/coco/sevguest.rst
+++ b/Documentation/virt/coco/sevguest.rst
@@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
status includes API major, minor version and more. See the SEV-SNP
specification for further details.

+2.5 SNP_SET_EXT_CONFIG
+----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_ext_config
+:Returns (out): 0 on success, -negative on error
+
+The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
+reported TCB version in the attestation report. The command is similar to
+SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
+command also accepts an additional certificate blob defined in the GHCB
+specification.
+
+If the certs_address is zero, then previous certificate blob will deleted.
+For more information on the certificate blob layout, see the GHCB spec
+(extended guest request message).
+
+2.6 SNP_GET_EXT_CONFIG
+----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_ext_config
+:Returns (out): 0 on success, -negative on error
+
+The SNP_SET_EXT_CONFIG is used to query the system-wide configuration set
+through the SNP_SET_EXT_CONFIG.
+
3. SEV-SNP CPUID Enforcement
============================

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index b9b6fab31a82..97b479d5aa86 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1312,6 +1312,10 @@ static int __sev_snp_shutdown_locked(int *error)
if (!sev->snp_inited)
return 0;

+ /* Free the memory used for caching the certificate data */
+ kfree(sev->snp_certs_data);
+ sev->snp_certs_data = NULL;
+
/* SHUTDOWN requires the DF_FLUSH */
wbinvd_on_all_cpus();
__sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
@@ -1616,6 +1620,111 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
return ret;
}

+static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_user_data_ext_snp_config input;
+ int ret;
+
+ if (!sev->snp_inited || !argp->data)
+ return -EINVAL;
+
+ if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
+ return -EFAULT;
+
+ /* Copy the TCB version programmed through the SET_CONFIG to userspace */
+ if (input.config_address) {
+ if (copy_to_user((void * __user)input.config_address,
+ &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
+ return -EFAULT;
+ }
+
+ /* Copy the extended certs programmed through the SNP_SET_CONFIG */
+ if (input.certs_address && sev->snp_certs_data) {
+ if (input.certs_len < sev->snp_certs_len) {
+ /* Return the certs length to userspace */
+ input.certs_len = sev->snp_certs_len;
+
+ ret = -ENOSR;
+ goto e_done;
+ }
+
+ if (copy_to_user((void * __user)input.certs_address,
+ sev->snp_certs_data, sev->snp_certs_len))
+ return -EFAULT;
+ }
+
+ ret = 0;
+
+e_done:
+ if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
+static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
+{
+ struct sev_device *sev = psp_master->sev_data;
+ struct sev_user_data_ext_snp_config input;
+ struct sev_user_data_snp_config config;
+ void *certs = NULL;
+ int ret = 0;
+
+ if (!sev->snp_inited || !argp->data)
+ return -EINVAL;
+
+ if (!writable)
+ return -EPERM;
+
+ if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
+ return -EFAULT;
+
+ /* Copy the certs from userspace */
+ if (input.certs_address) {
+ if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
+ return -EINVAL;
+
+ certs = psp_copy_user_blob(input.certs_address, input.certs_len);
+ if (IS_ERR(certs))
+ return PTR_ERR(certs);
+ }
+
+ /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
+ if (input.config_address) {
+ if (copy_from_user(&config,
+ (void __user *)input.config_address, sizeof(config))) {
+ ret = -EFAULT;
+ goto e_free;
+ }
+
+ ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
+ if (ret)
+ goto e_free;
+
+ memcpy(&sev->snp_config, &config, sizeof(config));
+ }
+
+ /*
+ * If the new certs are passed then cache it else free the old certs.
+ */
+ if (certs) {
+ kfree(sev->snp_certs_data);
+ sev->snp_certs_data = certs;
+ sev->snp_certs_len = input.certs_len;
+ } else {
+ kfree(sev->snp_certs_data);
+ sev->snp_certs_data = NULL;
+ sev->snp_certs_len = 0;
+ }
+
+ return 0;
+
+e_free:
+ kfree(certs);
+ return ret;
+}
+
static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
{
void __user *argp = (void __user *)arg;
@@ -1670,6 +1779,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
case SNP_PLATFORM_STATUS:
ret = sev_ioctl_snp_platform_status(&input);
break;
+ case SNP_SET_EXT_CONFIG:
+ ret = sev_ioctl_snp_set_config(&input, writable);
+ break;
+ case SNP_GET_EXT_CONFIG:
+ ret = sev_ioctl_snp_get_config(&input);
+ break;
default:
ret = -EINVAL;
goto out;
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index fe5d7a3ebace..d2fe1706311a 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -66,6 +66,9 @@ struct sev_device {

bool snp_inited;
struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
+ void *snp_certs_data;
+ u32 snp_certs_len;
+ struct sev_user_data_snp_config snp_config;
};

int sev_dev_init(struct psp_device *psp);
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index ffd60e8b0a31..60e7a8d1a18e 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -29,6 +29,8 @@ enum {
SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
SEV_GET_ID2,
SNP_PLATFORM_STATUS,
+ SNP_SET_EXT_CONFIG,
+ SNP_GET_EXT_CONFIG,

SEV_MAX,
};
@@ -190,6 +192,21 @@ struct sev_user_data_snp_config {
__u8 rsvd[52];
} __packed;

+/**
+ * struct sev_data_snp_ext_config - system wide configuration value for SNP.
+ *
+ * @config_address: address of the struct sev_user_data_snp_config or 0 when
+ * reported_tcb does not need to be updated.
+ * @certs_address: address of extended guest request certificate chain or
+ * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
+ * @certs_len: length of the certs
+ */
+struct sev_user_data_ext_snp_config {
+ __u64 config_address; /* In */
+ __u64 certs_address; /* In */
+ __u32 certs_len; /* In */
+};
+
/**
* struct sev_issue_cmd - SEV ioctl parameters
*
--
2.25.1

2022-06-20 23:15:01

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 20/49] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT

From: Brijesh Singh <[email protected]>

Version 2 of the GHCB specification introduced advertisement of features
that are supported by the Hypervisor.

Now that KVM supports version 2 of the GHCB specification, bump the
maximum supported protocol version.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev-common.h | 2 ++
arch/x86/kvm/svm/sev.c | 14 ++++++++++++++
arch/x86/kvm/svm/svm.h | 3 ++-
3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index e15548d88f2a..539de6b93420 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,6 +101,8 @@ enum psc_op {
/* GHCB Hypervisor Feature Request/Response */
#define GHCB_MSR_HV_FT_REQ 0x080
#define GHCB_MSR_HV_FT_RESP 0x081
+#define GHCB_MSR_HV_FT_POS 12
+#define GHCB_MSR_HV_FT_MASK GENMASK_ULL(51, 0)
#define GHCB_MSR_HV_FT_RESP_VAL(v) \
/* GHCBData[63:12] */ \
(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a1318236acd2..b49c370d5ae9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2480,6 +2480,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
case SVM_VMGEXIT_AP_HLT_LOOP:
case SVM_VMGEXIT_AP_JUMP_TABLE:
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
+ case SVM_VMGEXIT_HV_FEATURES:
break;
default:
reason = GHCB_ERR_INVALID_EVENT;
@@ -2746,6 +2747,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_MASK,
GHCB_MSR_INFO_POS);
break;
+ case GHCB_MSR_HV_FT_REQ: {
+ set_ghcb_msr_bits(svm, GHCB_HV_FT_SUPPORTED,
+ GHCB_MSR_HV_FT_MASK, GHCB_MSR_HV_FT_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_HV_FT_RESP,
+ GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
+ break;
+ }
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;

@@ -2871,6 +2879,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = 1;
break;
}
+ case SVM_VMGEXIT_HV_FEATURES: {
+ ghcb_set_sw_exit_info_2(ghcb, GHCB_HV_FT_SUPPORTED);
+
+ ret = 1;
+ break;
+ }
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9f7eb1f18893..1f4a8bd09c9e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -629,9 +629,10 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);

/* sev.c */

-#define GHCB_VERSION_MAX 1ULL
+#define GHCB_VERSION_MAX 2ULL
#define GHCB_VERSION_MIN 1ULL

+#define GHCB_HV_FT_SUPPORTED 0

extern unsigned int max_sev_asid;

--
2.25.1

2022-06-20 23:15:04

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 22/49] KVM: SVM: Add initial SEV-SNP support

From: Brijesh Singh <[email protected]>

The next generation of SEV is called SEV-SNP (Secure Nested Paging).
SEV-SNP builds upon existing SEV and SEV-ES functionality while adding new
hardware based security protection. SEV-SNP adds strong memory encryption
integrity protection to help prevent malicious hypervisor-based attacks
such as data replay, memory re-mapping, and more, to create an isolated
execution environment.

The SNP feature is added incrementally, the later patches adds a new module
parameters that can be used to enabled SEV-SNP in the KVM.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 10 +++++++++-
arch/x86/kvm/svm/svm.h | 8 ++++++++
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 93365996bd59..dc1f69a28aa7 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -56,6 +56,9 @@ module_param_named(sev_es, sev_es_enabled, bool, 0444);
#define sev_es_enabled false
#endif /* CONFIG_KVM_AMD_SEV */

+/* enable/disable SEV-SNP support */
+static bool sev_snp_enabled;
+
#define AP_RESET_HOLD_NONE 0
#define AP_RESET_HOLD_NAE_EVENT 1
#define AP_RESET_HOLD_MSR_PROTO 2
@@ -2120,6 +2123,7 @@ void __init sev_hardware_setup(void)
{
#ifdef CONFIG_KVM_AMD_SEV
unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
+ bool sev_snp_supported = false;
bool sev_es_supported = false;
bool sev_supported = false;

@@ -2190,12 +2194,16 @@ void __init sev_hardware_setup(void)
if (misc_cg_set_capacity(MISC_CG_RES_SEV_ES, sev_es_asid_count))
goto out;

- pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
sev_es_supported = true;
+ sev_snp_supported = sev_snp_enabled && cpu_feature_enabled(X86_FEATURE_SEV_SNP);
+
+ pr_info("SEV-ES %ssupported: %u ASIDs\n",
+ sev_snp_supported ? "and SEV-SNP " : "", sev_es_asid_count);

out:
sev_enabled = sev_supported;
sev_es_enabled = sev_es_supported;
+ sev_snp_enabled = sev_snp_supported;
#endif
}

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9672e25a338d..edecc5066517 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -75,6 +75,7 @@ enum {
struct kvm_sev_info {
bool active; /* SEV enabled guest */
bool es_active; /* SEV-ES enabled guest */
+ bool snp_active; /* SEV-SNP enabled guest */
unsigned int asid; /* ASID used for this guest */
unsigned int handle; /* SEV firmware handle */
int fd; /* SEV device fd */
@@ -314,6 +315,13 @@ static __always_inline bool sev_es_guest(struct kvm *kvm)
#endif
}

+static inline bool sev_snp_guest(struct kvm *kvm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+ return sev_es_guest(kvm) && sev->snp_active;
+}
+
static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
{
vmcb->control.clean = 0;
--
2.25.1

2022-06-20 23:15:04

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 24/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command

From: Brijesh Singh <[email protected]>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. If the guest is expected to be migrated,
the command also binds a migration agent (MA) to the guest.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 24 ++++
arch/x86/kvm/svm/sev.c | 115 +++++++++++++++++-
arch/x86/kvm/svm/svm.h | 1 +
include/uapi/linux/kvm.h | 10 ++
4 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 903023f524af..878711f2dca6 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -462,6 +462,30 @@ The flags bitmap is defined as::
If the specified flags is not supported then return -EOPNOTSUPP, and the supported
flags are returned.

+19. KVM_SNP_LAUNCH_START
+------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. To create the encryption context, user must
+provide a guest policy, migration agent (if any) and guest OS visible
+workarounds value as defined SEV-SNP specification.
+
+Parameters (in): struct kvm_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_start {
+ __u64 policy; /* Guest policy to use. */
+ __u64 ma_uaddr; /* userspace address of migration agent */
+ __u8 ma_en; /* 1 if the migtation agent is enabled */
+ __u8 imi_en; /* set IMI to 1. */
+ __u8 gosvw[16]; /* guest OS visible workarounds */
+ };
+
+See the SEV-SNP specification for further detail on the launch input.
+
References
==========

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 813bda7f7b55..9e6fc7a94ed7 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -21,6 +21,7 @@
#include <asm/pkru.h>
#include <asm/trapnr.h>
#include <asm/fpu/xcr.h>
+#include <asm/sev.h>

#include "x86.h"
#include "svm.h"
@@ -73,6 +74,8 @@ static unsigned int nr_asids;
static unsigned long *sev_asid_bitmap;
static unsigned long *sev_reclaim_asid_bitmap;

+static int snp_decommission_context(struct kvm *kvm);
+
struct enc_region {
struct list_head list;
unsigned long npages;
@@ -98,12 +101,17 @@ static int sev_flush_asids(int min_asid, int max_asid)
down_write(&sev_deactivate_lock);

wbinvd_on_all_cpus();
- ret = sev_guest_df_flush(&error);
+
+ if (sev_snp_enabled)
+ ret = snp_guest_df_flush(&error);
+ else
+ ret = sev_guest_df_flush(&error);

up_write(&sev_deactivate_lock);

if (ret)
- pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+ pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+ sev_snp_enabled ? "-SNP" : "", ret, error);

return ret;
}
@@ -1825,6 +1833,74 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
return ret;
}

+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct sev_data_snp_gctx_create data = {};
+ void *context;
+ int rc;
+
+ /* Allocate memory for context page */
+ context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+ if (!context)
+ return NULL;
+
+ data.gctx_paddr = __psp_pa(context);
+ rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+ if (rc) {
+ snp_free_firmware_page(context);
+ return NULL;
+ }
+
+ return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_activate data = {0};
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+ data.asid = sev_get_asid(kvm);
+ return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_start start = {0};
+ struct kvm_sev_snp_launch_start params;
+ int rc;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ sev->snp_context = snp_context_create(kvm, argp);
+ if (!sev->snp_context)
+ return -ENOTTY;
+
+ start.gctx_paddr = __psp_pa(sev->snp_context);
+ start.policy = params.policy;
+ memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+ rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+ if (rc)
+ goto e_free_context;
+
+ sev->fd = argp->sev_fd;
+ rc = snp_bind_asid(kvm, &argp->error);
+ if (rc)
+ goto e_free_context;
+
+ return 0;
+
+e_free_context:
+ snp_decommission_context(kvm);
+
+ return rc;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -1915,6 +1991,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_RECEIVE_FINISH:
r = sev_receive_finish(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_START:
+ r = snp_launch_start(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2106,6 +2185,28 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
return ret;
}

+static int snp_decommission_context(struct kvm *kvm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_decommission data = {};
+ int ret;
+
+ /* If context is not created then do nothing */
+ if (!sev->snp_context)
+ return 0;
+
+ data.gctx_paddr = __sme_pa(sev->snp_context);
+ ret = snp_guest_decommission(&data, NULL);
+ if (WARN_ONCE(ret, "failed to release guest context"))
+ return ret;
+
+ /* free the context page now */
+ snp_free_firmware_page(sev->snp_context);
+ sev->snp_context = NULL;
+
+ return 0;
+}
+
void sev_vm_destroy(struct kvm *kvm)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2147,7 +2248,15 @@ void sev_vm_destroy(struct kvm *kvm)
}
}

- sev_unbind_asid(kvm, sev->handle);
+ if (sev_snp_guest(kvm)) {
+ if (snp_decommission_context(kvm)) {
+ WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+ return;
+ }
+ } else {
+ sev_unbind_asid(kvm, sev->handle);
+ }
+
sev_asid_free(sev);
}

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 2f45589ee596..71c011af098e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -91,6 +91,7 @@ struct kvm_sev_info {
struct misc_cg *misc_cg; /* For misc cgroup accounting */
atomic_t migration_in_progress;
u64 snp_init_flags;
+ void *snp_context; /* SNP guest context page */
};

struct kvm_svm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0f912cefc544..0cb119d66ae5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1812,6 +1812,7 @@ enum sev_cmd_id {

/* SNP specific commands */
KVM_SEV_SNP_INIT,
+ KVM_SEV_SNP_LAUNCH_START,

KVM_SEV_NR_MAX,
};
@@ -1919,6 +1920,15 @@ struct kvm_snp_init {
__u64 flags;
};

+struct kvm_sev_snp_launch_start {
+ __u64 policy;
+ __u64 ma_uaddr;
+ __u8 ma_en;
+ __u8 imi_en;
+ __u8 gosvw[16];
+ __u8 pad[6];
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1

2022-06-20 23:15:04

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 26/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command

From: Brijesh Singh <[email protected]>

The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
guest's memory. The data is encrypted with the cryptographic context
created with the KVM_SEV_SNP_LAUNCH_START.

In addition to the inserting data, it can insert a two special pages
into the guests memory: the secrets page and the CPUID page.

While terminating the guest, reclaim the guest pages added in the RMP
table. If the reclaim fails, then the page is no longer safe to be
released back to the system and leak them.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 29 +++
arch/x86/kvm/svm/sev.c | 187 ++++++++++++++++++
include/uapi/linux/kvm.h | 19 ++
3 files changed, 235 insertions(+)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 878711f2dca6..62abd5c1f72b 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -486,6 +486,35 @@ Returns: 0 on success, -negative on error

See the SEV-SNP specification for further detail on the launch input.

+20. KVM_SNP_LAUNCH_UPDATE
+-------------------------
+
+The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
+calculates a measurement of the memory contents. The measurement is a signature
+of the memory contents that can be sent to the guest owner as an attestation
+that the memory was encrypted correctly by the firmware.
+
+Parameters (in): struct kvm_snp_launch_update
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_update {
+ __u64 start_gfn; /* Guest page number to start from. */
+ __u64 uaddr; /* userspace address need to be encrypted */
+ __u32 len; /* length of memory region */
+ __u8 imi_page; /* 1 if memory is part of the IMI */
+ __u8 page_type; /* page type */
+ __u8 vmpl3_perms; /* VMPL3 permission mask */
+ __u8 vmpl2_perms; /* VMPL2 permission mask */
+ __u8 vmpl1_perms; /* VMPL1 permission mask */
+ };
+
+See the SEV-SNP spec for further details on how to build the VMPL permission
+mask and page type.
+
+
References
==========

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 41b83aa6b5f4..b5f0707d7ed6 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -18,6 +18,7 @@
#include <linux/processor.h>
#include <linux/trace_events.h>
#include <linux/hugetlb.h>
+#include <linux/sev.h>

#include <asm/pkru.h>
#include <asm/trapnr.h>
@@ -233,6 +234,49 @@ static void sev_decommission(unsigned int handle)
sev_guest_decommission(&decommission, NULL);
}

+static inline void snp_leak_pages(u64 pfn, enum pg_level level)
+{
+ unsigned int npages = page_level_size(level) >> PAGE_SHIFT;
+
+ WARN(1, "psc failed pfn 0x%llx pages %d (leaking)\n", pfn, npages);
+
+ while (npages) {
+ memory_failure(pfn, 0);
+ dump_rmpentry(pfn);
+ npages--;
+ pfn++;
+ }
+}
+
+static int snp_page_reclaim(u64 pfn)
+{
+ struct sev_data_snp_page_reclaim data = {0};
+ int err, rc;
+
+ data.paddr = __sme_set(pfn << PAGE_SHIFT);
+ rc = snp_guest_page_reclaim(&data, &err);
+ if (rc) {
+ /*
+ * If the reclaim failed, then page is no longer safe
+ * to use.
+ */
+ snp_leak_pages(pfn, PG_LEVEL_4K);
+ }
+
+ return rc;
+}
+
+static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
+{
+ int rc;
+
+ rc = rmp_make_shared(pfn, level);
+ if (rc && leak)
+ snp_leak_pages(pfn, level);
+
+ return rc;
+}
+
static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
{
struct sev_data_deactivate deactivate;
@@ -1902,6 +1946,123 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
return rc;
}

+static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct list_head *head = &sev->regions_list;
+ struct enc_region *i;
+
+ lockdep_assert_held(&kvm->lock);
+
+ list_for_each_entry(i, head, list) {
+ u64 start = i->uaddr;
+ u64 end = start + i->size;
+
+ if (start <= hva && end >= (hva + len))
+ return true;
+ }
+
+ return false;
+}
+
+static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_update data = {0};
+ struct kvm_sev_snp_launch_update params;
+ unsigned long npages, pfn, n = 0;
+ int *error = &argp->error;
+ struct page **inpages;
+ int ret, i, level;
+ u64 gfn;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ /* Verify that the specified address range is registered. */
+ if (!is_hva_registered(kvm, params.uaddr, params.len))
+ return -EINVAL;
+
+ /*
+ * The userspace memory is already locked so technically we don't
+ * need to lock it again. Later part of the function needs to know
+ * pfn so call the sev_pin_memory() so that we can get the list of
+ * pages to iterate through.
+ */
+ inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
+ if (!inpages)
+ return -ENOMEM;
+
+ /*
+ * Verify that all the pages are marked shared in the RMP table before
+ * going further. This is avoid the cases where the userspace may try
+ * updating the same page twice.
+ */
+ for (i = 0; i < npages; i++) {
+ if (snp_lookup_rmpentry(page_to_pfn(inpages[i]), &level) != 0) {
+ sev_unpin_memory(kvm, inpages, npages);
+ return -EFAULT;
+ }
+ }
+
+ gfn = params.start_gfn;
+ level = PG_LEVEL_4K;
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+
+ for (i = 0; i < npages; i++) {
+ pfn = page_to_pfn(inpages[i]);
+
+ ret = rmp_make_private(pfn, gfn << PAGE_SHIFT, level, sev_get_asid(kvm), true);
+ if (ret) {
+ ret = -EFAULT;
+ goto e_unpin;
+ }
+
+ n++;
+ data.address = __sme_page_pa(inpages[i]);
+ data.page_size = X86_TO_RMP_PG_LEVEL(level);
+ data.page_type = params.page_type;
+ data.vmpl3_perms = params.vmpl3_perms;
+ data.vmpl2_perms = params.vmpl2_perms;
+ data.vmpl1_perms = params.vmpl1_perms;
+ ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
+ if (ret) {
+ /*
+ * If the command failed then need to reclaim the page.
+ */
+ snp_page_reclaim(pfn);
+ goto e_unpin;
+ }
+
+ gfn++;
+ }
+
+e_unpin:
+ /* Content of memory is updated, mark pages dirty */
+ for (i = 0; i < n; i++) {
+ set_page_dirty_lock(inpages[i]);
+ mark_page_accessed(inpages[i]);
+
+ /*
+ * If its an error, then update RMP entry to change page ownership
+ * to the hypervisor.
+ */
+ if (ret)
+ host_rmp_make_shared(pfn, level, true);
+ }
+
+ /* Unlock the user pages */
+ sev_unpin_memory(kvm, inpages, npages);
+
+ return ret;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -1995,6 +2156,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_START:
r = snp_launch_start(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_UPDATE:
+ r = snp_launch_update(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2113,6 +2277,29 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
static void __unregister_enc_region_locked(struct kvm *kvm,
struct enc_region *region)
{
+ unsigned long i, pfn;
+ int level;
+
+ /*
+ * The guest memory pages are assigned in the RMP table. Unassign it
+ * before releasing the memory.
+ */
+ if (sev_snp_guest(kvm)) {
+ for (i = 0; i < region->npages; i++) {
+ pfn = page_to_pfn(region->pages[i]);
+
+ if (!snp_lookup_rmpentry(pfn, &level))
+ continue;
+
+ cond_resched();
+
+ if (level > PG_LEVEL_4K)
+ pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+ host_rmp_make_shared(pfn, level, true);
+ }
+ }
+
sev_unpin_memory(kvm, region->pages, region->npages);
list_del(&region->list);
kfree(region);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0cb119d66ae5..9b36b07414ea 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1813,6 +1813,7 @@ enum sev_cmd_id {
/* SNP specific commands */
KVM_SEV_SNP_INIT,
KVM_SEV_SNP_LAUNCH_START,
+ KVM_SEV_SNP_LAUNCH_UPDATE,

KVM_SEV_NR_MAX,
};
@@ -1929,6 +1930,24 @@ struct kvm_sev_snp_launch_start {
__u8 pad[6];
};

+#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
+#define KVM_SEV_SNP_PAGE_TYPE_VMSA 0x2
+#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
+#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
+#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
+#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6
+
+struct kvm_sev_snp_launch_update {
+ __u64 start_gfn;
+ __u64 uaddr;
+ __u32 len;
+ __u8 imi_page;
+ __u8 page_type;
+ __u8 vmpl3_perms;
+ __u8 vmpl2_perms;
+ __u8 vmpl1_perms;
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1

2022-06-20 23:15:05

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 21/49] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe

From: Brijesh Singh <[email protected]>

Implement a workaround for an SNP erratum where the CPU will incorrectly
signal an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
RMP entry of a VMCB, VMSA or AVIC backing page.

When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
backing pages as "in-use" in the RMP after a successful VMRUN. This
is done for _all_ VMs, not just SNP-Active VMs.

If the hypervisor accesses an in-use page through a writable
translation, the CPU will throw an RMP violation #PF. On early SNP
hardware, if an in-use page is 2mb aligned and software accesses any
part of the associated 2mb region with a hupage, the CPU will
incorrectly treat the entire 2mb region as in-use and signal a spurious
RMP violation #PF.

The recommended is to not use the hugepage for the VMCB, VMSA or
AVIC backing page. Add a generic allocator that will ensure that the
page returns is not hugepage (2mb or 1gb) and is safe to be used when
SEV-SNP is enabled.

Co-developed-by: Marc Orr <[email protected]>
Signed-off-by: Marc Orr <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/lapic.c | 5 ++++-
arch/x86/kvm/svm/sev.c | 35 ++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 16 ++++++++++++--
arch/x86/kvm/svm/svm.h | 1 +
6 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index da47f60a4650..a66292dae698 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -128,6 +128,7 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP(alloc_apic_backing_page)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c24a72ddc93b..0205e2944067 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1512,6 +1512,8 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 66b0eb0bda94..7c7fc6c4a7f9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2506,7 +2506,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)

vcpu->arch.apic = apic;

- apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+ if (kvm_x86_ops.alloc_apic_backing_page)
+ apic->regs = static_call(kvm_x86_alloc_apic_backing_page)(vcpu);
+ else
+ apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
if (!apic->regs) {
printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
vcpu->vcpu_id);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b49c370d5ae9..93365996bd59 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3030,3 +3030,38 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
break;
}
}
+
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
+{
+ unsigned long pfn;
+ struct page *p;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+
+ /*
+ * Allocate an SNP safe page to workaround the SNP erratum where
+ * the CPU will incorrectly signal an RMP violation #PF if a
+ * hugepage (2mb or 1gb) collides with the RMP entry of VMCB, VMSA
+ * or AVIC backing page. The recommeded workaround is to not use the
+ * hugepage.
+ *
+ * Allocate one extra page, use a page which is not 2mb aligned
+ * and free the other.
+ */
+ p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
+ if (!p)
+ return NULL;
+
+ split_page(p, 1);
+
+ pfn = page_to_pfn(p);
+ if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
+ pfn++;
+ __free_page(p);
+ } else {
+ __free_page(pfn_to_page(pfn + 1));
+ }
+
+ return pfn_to_page(pfn);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index efc7623d0f90..b4bd64f94d3a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1260,7 +1260,7 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
svm = to_svm(vcpu);

err = -ENOMEM;
- vmcb01_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ vmcb01_page = snp_safe_alloc_page(vcpu);
if (!vmcb01_page)
goto out;

@@ -1269,7 +1269,7 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
* SEV-ES guests require a separate VMSA page used to contain
* the encrypted register state of the guest.
*/
- vmsa_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ vmsa_page = snp_safe_alloc_page(vcpu);
if (!vmsa_page)
goto error_free_vmcb_page;

@@ -4598,6 +4598,16 @@ static int svm_vm_init(struct kvm *kvm)
return 0;
}

+static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
+{
+ struct page *page = snp_safe_alloc_page(vcpu);
+
+ if (!page)
+ return NULL;
+
+ return page_address(page);
+}
+
static struct kvm_x86_ops svm_x86_ops __initdata = {
.name = "kvm_amd",

@@ -4722,6 +4732,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {

.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+
+ .alloc_apic_backing_page = svm_alloc_apic_backing_page,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1f4a8bd09c9e..9672e25a338d 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -659,6 +659,7 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm);
void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
void sev_es_unmap_ghcb(struct vcpu_svm *svm);
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);

/* vmenter.S */

--
2.25.1

2022-06-20 23:15:06

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 25/49] KVM: SVM: Disallow registering memory range from HugeTLB for SNP guest

From: Brijesh Singh <[email protected]>

While creating the VM, userspace call the KVM_MEMORY_ENCRYPT_REG_REGION
ioctl to register the memory regions for the guest. This registered
memory region is typically used as a guest RAM. Later, the guest may
issue the page state change (PSC) request that will require splitting
the large page into smaller page. If the memory is allocated from the
HugeTLB then hypervisor will not be able to split it.

Do not allow registering the memory range backed by the HugeTLB until
the hypervisor support is added to handle the case.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 9e6fc7a94ed7..41b83aa6b5f4 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -17,6 +17,7 @@
#include <linux/misc_cgroup.h>
#include <linux/processor.h>
#include <linux/trace_events.h>
+#include <linux/hugetlb.h>

#include <asm/pkru.h>
#include <asm/trapnr.h>
@@ -2007,6 +2008,35 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
return r;
}

+static bool is_range_hugetlb(struct kvm *kvm, struct kvm_enc_region *range)
+{
+ struct vm_area_struct *vma;
+ u64 start, end;
+ bool ret = true;
+
+ start = range->addr;
+ end = start + range->size;
+
+ mmap_read_lock(kvm->mm);
+
+ do {
+ vma = find_vma_intersection(kvm->mm, start, end);
+ if (!vma)
+ goto unlock;
+
+ if (is_vm_hugetlb_page(vma))
+ goto unlock;
+
+ start = vma->vm_end;
+ } while (end > vma->vm_end);
+
+ ret = false;
+
+unlock:
+ mmap_read_unlock(kvm->mm);
+ return ret;
+}
+
int sev_mem_enc_register_region(struct kvm *kvm,
struct kvm_enc_region *range)
{
@@ -2024,6 +2054,13 @@ int sev_mem_enc_register_region(struct kvm *kvm,
if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
return -EINVAL;

+ /*
+ * SEV-SNP does not support the backing pages from the HugeTLB. Verify
+ * that the registered memory range is not from the HugeTLB.
+ */
+ if (sev_snp_guest(kvm) && is_range_hugetlb(kvm, range))
+ return -EINVAL;
+
region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT);
if (!region)
return -ENOMEM;
--
2.25.1

2022-06-20 23:15:05

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 23/49] KVM: SVM: Add KVM_SNP_INIT command

From: Brijesh Singh <[email protected]>

The KVM_SNP_INIT command is used by the hypervisor to initialize the
SEV-SNP platform context. In a typical workflow, this command should be the
first command issued. When creating SEV-SNP guest, the VMM must use this
command instead of the KVM_SEV_INIT or KVM_SEV_ES_INIT.

The flags value must be zero, it will be extended in future SNP support to
communicate the optional features (such as restricted INT injection etc).

Co-developed-by: Pavan Kumar Paluri <[email protected]>
Signed-off-by: Pavan Kumar Paluri <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 27 ++++++++++++
arch/x86/include/asm/svm.h | 1 +
arch/x86/kvm/svm/sev.c | 44 ++++++++++++++++++-
arch/x86/kvm/svm/svm.h | 4 ++
include/uapi/linux/kvm.h | 13 ++++++
5 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 2d307811978c..903023f524af 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -435,6 +435,33 @@ issued by the hypervisor to make the guest ready for execution.

Returns: 0 on success, -negative on error

+18. KVM_SNP_INIT
+----------------
+
+The KVM_SNP_INIT command can be used by the hypervisor to initialize SEV-SNP
+context. In a typical workflow, this command should be the first command issued.
+
+Parameters (in/out): struct kvm_snp_init
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_snp_init {
+ __u64 flags;
+ };
+
+The flags bitmap is defined as::
+
+ /* enable the restricted injection */
+ #define KVM_SEV_SNP_RESTRICTED_INJET (1<<0)
+
+ /* enable the restricted injection timer */
+ #define KVM_SEV_SNP_RESTRICTED_TIMER_INJET (1<<1)
+
+If the specified flags is not supported then return -EOPNOTSUPP, and the supported
+flags are returned.
+
References
==========

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 1b07fba11704..284a8113227e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -263,6 +263,7 @@ enum avic_ipi_failure_cause {
#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF)
#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL

+#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)

struct vmcb_seg {
u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index dc1f69a28aa7..813bda7f7b55 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -241,6 +241,25 @@ static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
sev_decommission(handle);
}

+static int verify_snp_init_flags(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_snp_init params;
+ int ret = 0;
+
+ if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ if (params.flags & ~SEV_SNP_SUPPORTED_FLAGS)
+ ret = -EOPNOTSUPP;
+
+ params.flags = SEV_SNP_SUPPORTED_FLAGS;
+
+ if (copy_to_user((void __user *)(uintptr_t)argp->data, &params, sizeof(params)))
+ ret = -EFAULT;
+
+ return ret;
+}
+
static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -254,13 +273,23 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;

sev->active = true;
- sev->es_active = argp->id == KVM_SEV_ES_INIT;
+ sev->es_active = (argp->id == KVM_SEV_ES_INIT || argp->id == KVM_SEV_SNP_INIT);
+ sev->snp_active = argp->id == KVM_SEV_SNP_INIT;
asid = sev_asid_new(sev);
if (asid < 0)
goto e_no_asid;
sev->asid = asid;

- ret = sev_platform_init(&argp->error);
+ if (sev->snp_active) {
+ ret = verify_snp_init_flags(kvm, argp);
+ if (ret)
+ goto e_free;
+
+ ret = sev_snp_init(&argp->error);
+ } else {
+ ret = sev_platform_init(&argp->error);
+ }
+
if (ret)
goto e_free;

@@ -275,6 +304,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
sev_asid_free(sev);
sev->asid = 0;
e_no_asid:
+ sev->snp_active = false;
sev->es_active = false;
sev->active = false;
return ret;
@@ -610,6 +640,10 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
save->xss = svm->vcpu.arch.ia32_xss;
save->dr6 = svm->vcpu.arch.dr6;

+ /* Enable the SEV-SNP feature */
+ if (sev_snp_guest(svm->vcpu.kvm))
+ save->sev_features |= SVM_SEV_FEAT_SNP_ACTIVE;
+
return 0;
}

@@ -1815,6 +1849,12 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
}

switch (sev_cmd.id) {
+ case KVM_SEV_SNP_INIT:
+ if (!sev_snp_enabled) {
+ r = -ENOTTY;
+ goto out;
+ }
+ fallthrough;
case KVM_SEV_ES_INIT:
if (!sev_es_enabled) {
r = -ENOTTY;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index edecc5066517..2f45589ee596 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -72,6 +72,9 @@ enum {
/* TPR and CR2 are always written before VMRUN */
#define VMCB_ALWAYS_DIRTY_MASK ((1U << VMCB_INTR) | (1U << VMCB_CR2))

+/* Supported init feature flags */
+#define SEV_SNP_SUPPORTED_FLAGS 0x0
+
struct kvm_sev_info {
bool active; /* SEV enabled guest */
bool es_active; /* SEV-ES enabled guest */
@@ -87,6 +90,7 @@ struct kvm_sev_info {
struct list_head mirror_entry; /* Use as a list entry of mirrors */
struct misc_cg *misc_cg; /* For misc cgroup accounting */
atomic_t migration_in_progress;
+ u64 snp_init_flags;
};

struct kvm_svm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 68ce07185f03..0f912cefc544 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1810,6 +1810,9 @@ enum sev_cmd_id {
/* Guest Migration Extension */
KVM_SEV_SEND_CANCEL,

+ /* SNP specific commands */
+ KVM_SEV_SNP_INIT,
+
KVM_SEV_NR_MAX,
};

@@ -1906,6 +1909,16 @@ struct kvm_sev_receive_update_data {
__u32 trans_len;
};

+/* enable the restricted injection */
+#define KVM_SEV_SNP_RESTRICTED_INJET (1 << 0)
+
+/* enable the restricted injection timer */
+#define KVM_SEV_SNP_RESTRICTED_TIMER_INJET (1 << 1)
+
+struct kvm_snp_init {
+ __u64 flags;
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1

2022-06-20 23:15:25

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

From: Brijesh Singh <[email protected]>

The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
it as the measurement of the guest at launch.

While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
to encrypt the VMSA pages.

If its an SNP guest, then VMSA was added in the RMP entry as
a guest owned page and also removed from the kernel direct map
so flush it later after it is transitioned back to hypervisor
state and restored in the direct map.

Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
---
.../virt/kvm/x86/amd-memory-encryption.rst | 22 ++++
arch/x86/kvm/svm/sev.c | 119 ++++++++++++++++++
include/uapi/linux/kvm.h | 14 +++
3 files changed, 155 insertions(+)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 62abd5c1f72b..750162cff87b 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -514,6 +514,28 @@ Returns: 0 on success, -negative on error
See the SEV-SNP spec for further details on how to build the VMPL permission
mask and page type.

+21. KVM_SNP_LAUNCH_FINISH
+-------------------------
+
+After completion of the SNP guest launch flow, the KVM_SNP_LAUNCH_FINISH command can be
+issued to make the guest ready for the execution.
+
+Parameters (in): struct kvm_sev_snp_launch_finish
+
+Returns: 0 on success, -negative on error
+
+::
+
+ struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 host_data[32];
+ };
+
+
+See SEV-SNP specification for further details on launch finish input parameters.

References
==========
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a9461d352eda..a5b90469683f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2095,6 +2095,106 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
return ret;
}

+static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_update data = {};
+ int i, ret;
+
+ data.gctx_paddr = __psp_pa(sev->snp_context);
+ data.page_type = SNP_PAGE_TYPE_VMSA;
+
+ for (i = 0; i < kvm->created_vcpus; i++) {
+ struct vcpu_svm *svm = to_svm(xa_load(&kvm->vcpu_array, i));
+ u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
+
+ /* Perform some pre-encryption checks against the VMSA */
+ ret = sev_es_sync_vmsa(svm);
+ if (ret)
+ return ret;
+
+ /* Transition the VMSA page to a firmware state. */
+ ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);
+ if (ret)
+ return ret;
+
+ /* Issue the SNP command to encrypt the VMSA */
+ data.address = __sme_pa(svm->sev_es.vmsa);
+ ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
+ &data, &argp->error);
+ if (ret) {
+ snp_page_reclaim(pfn);
+ return ret;
+ }
+
+ svm->vcpu.arch.guest_state_protected = true;
+ }
+
+ return 0;
+}
+
+static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ struct sev_data_snp_launch_finish *data;
+ void *id_block = NULL, *id_auth = NULL;
+ struct kvm_sev_snp_launch_finish params;
+ int ret;
+
+ if (!sev_snp_guest(kvm))
+ return -ENOTTY;
+
+ if (!sev->snp_context)
+ return -EINVAL;
+
+ if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+ return -EFAULT;
+
+ /* Measure all vCPUs using LAUNCH_UPDATE before we finalize the launch flow. */
+ ret = snp_launch_update_vmsa(kvm, argp);
+ if (ret)
+ return ret;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+ if (!data)
+ return -ENOMEM;
+
+ if (params.id_block_en) {
+ id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
+ if (IS_ERR(id_block)) {
+ ret = PTR_ERR(id_block);
+ goto e_free;
+ }
+
+ data->id_block_en = 1;
+ data->id_block_paddr = __sme_pa(id_block);
+ }
+
+ if (params.auth_key_en) {
+ id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
+ if (IS_ERR(id_auth)) {
+ ret = PTR_ERR(id_auth);
+ goto e_free_id_block;
+ }
+
+ data->auth_key_en = 1;
+ data->id_auth_paddr = __sme_pa(id_auth);
+ }
+
+ data->gctx_paddr = __psp_pa(sev->snp_context);
+ ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
+
+ kfree(id_auth);
+
+e_free_id_block:
+ kfree(id_block);
+
+e_free:
+ kfree(data);
+
+ return ret;
+}
+
int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_sev_cmd sev_cmd;
@@ -2191,6 +2291,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
case KVM_SEV_SNP_LAUNCH_UPDATE:
r = snp_launch_update(kvm, &sev_cmd);
break;
+ case KVM_SEV_SNP_LAUNCH_FINISH:
+ r = snp_launch_finish(kvm, &sev_cmd);
+ break;
default:
r = -EINVAL;
goto out;
@@ -2696,11 +2799,27 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)

svm = to_svm(vcpu);

+ /*
+ * If its an SNP guest, then VMSA was added in the RMP entry as
+ * a guest owned page. Transition the page to hypervisor state
+ * before releasing it back to the system.
+ * Also the page is removed from the kernel direct map, so flush it
+ * later after it is transitioned back to hypervisor state and
+ * restored in the direct map.
+ */
+ if (sev_snp_guest(vcpu->kvm)) {
+ u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
+
+ if (host_rmp_make_shared(pfn, PG_LEVEL_4K, false))
+ goto skip_vmsa_free;
+ }
+
if (vcpu->arch.guest_state_protected)
sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);

__free_page(virt_to_page(svm->sev_es.vmsa));

+skip_vmsa_free:
if (svm->sev_es.ghcb_sa_free)
kvfree(svm->sev_es.ghcb_sa);
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9b36b07414ea..5a4662716b6a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1814,6 +1814,7 @@ enum sev_cmd_id {
KVM_SEV_SNP_INIT,
KVM_SEV_SNP_LAUNCH_START,
KVM_SEV_SNP_LAUNCH_UPDATE,
+ KVM_SEV_SNP_LAUNCH_FINISH,

KVM_SEV_NR_MAX,
};
@@ -1948,6 +1949,19 @@ struct kvm_sev_snp_launch_update {
__u8 vmpl1_perms;
};

+#define KVM_SEV_SNP_ID_BLOCK_SIZE 96
+#define KVM_SEV_SNP_ID_AUTH_SIZE 4096
+#define KVM_SEV_SNP_FINISH_DATA_SIZE 32
+
+struct kvm_sev_snp_launch_finish {
+ __u64 id_block_uaddr;
+ __u64 id_auth_uaddr;
+ __u8 id_block_en;
+ __u8 auth_key_en;
+ __u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
+ __u8 pad[6];
+};
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
--
2.25.1

2022-06-20 23:15:32

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 30/49] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP

From: Sean Christopherson <[email protected]>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path. This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 51 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e6cae6f22683..c99b15e97a0a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -204,6 +204,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
return vcpu->arch.mmu->page_fault(vcpu, &fault);
}

+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 error_code, int max_level);
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 997318ecebd1..569021af349a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4100,6 +4100,57 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}

+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+ u32 err, int max_level)
+{
+ struct kvm_page_fault fault = {
+ .addr = gpa,
+ .error_code = err,
+ .exec = err & PFERR_FETCH_MASK,
+ .write = err & PFERR_WRITE_MASK,
+ .present = err & PFERR_PRESENT_MASK,
+ .rsvd = err & PFERR_RSVD_MASK,
+ .user = err & PFERR_USER_MASK,
+ .prefetch = false,
+ .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+ .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+
+ .max_level = max_level,
+ .req_level = PG_LEVEL_4K,
+ .goal_level = PG_LEVEL_4K,
+ };
+ int r;
+
+ if (mmu_topup_memory_caches(vcpu, false))
+ return KVM_PFN_ERR_FAULT;
+
+ /*
+ * Loop on the page fault path to handle the case where an mmu_notifier
+ * invalidation triggers RET_PF_RETRY. In the normal page fault path,
+ * KVM needs to resume the guest in case the invalidation changed any
+ * of the page fault properties, i.e. the gpa or error code. For this
+ * path, the gpa and error code are fixed by the caller, and the caller
+ * expects failure if and only if the page fault can't be fixed.
+ */
+ do {
+ /*
+ * TODO: this should probably go through kvm_mmu_do_page_fault(),
+ * but we need a way to control the max_level, so maybe a direct
+ * call to kvm_tdp_page_fault, which will call into
+ * direct_page_fault() when appropriate.
+ */
+ //r = direct_page_fault(vcpu, &fault);
+#if CONFIG_RETPOLINE
+ if (fault.is_tdp)
+ r = kvm_tdp_page_fault(vcpu, &fault);
+#else
+ r = vcpu->arch.mmu->page_fault(vcpu, &fault);
+#endif
+ } while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+ return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.25.1

2022-06-20 23:15:38

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 31/49] KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use

From: Brijesh Singh <[email protected]>

The SEV-SNP VMs may call the page state change VMGEXIT to add the GPA
as private or shared in the RMP table. The page state change VMGEXIT
will contain the RMP page level to be used in the RMP entry. If the
page level between the TDP and RMP does not match then, it will result
in nested-page-fault (RMP violation).

The SEV-SNP VMGEXIT handler will use the kvm_mmu_get_tdp_walk() to get
the current page-level in the TDP for the given GPA and calculate a
workable page level. If a GPA is mapped as a 4K-page in the TDP, but
the guest requested to add the GPA as a 2M in the RMP entry then the
2M request will be broken into 4K-pages to keep the RMP and TDP
page-levels in sync.

TDP SPTEs are RCU protected so need to put kvm_mmu_get_tdp_walk() in RCU
read-side critical section by using walk_shadow_page_lockless_begin() and
walk_lockless_shadow_page_lockless_end(). This fixes the
"suspicious RCU usage" message seen with lockdep kernel build.

Signed-off-by: Brijesh Singh <[email protected]>
Signed-off by: Ashish Kalra <[email protected]>
---
arch/x86/kvm/mmu.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c99b15e97a0a..d55b5166389a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -178,6 +178,8 @@ static inline bool is_nx_huge_page_enabled(void)
return READ_ONCE(nx_huge_pages);
}

+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level);
+
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u32 err, bool prefetch)
{
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 569021af349a..c1ac486e096e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4151,6 +4151,39 @@ kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
}
EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);

+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level)
+{
+ u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
+ int leaf, root;
+
+ walk_shadow_page_lockless_begin(vcpu);
+
+ if (is_tdp_mmu(vcpu->arch.mmu))
+ leaf = kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, &root);
+ else
+ leaf = get_walk(vcpu, gpa, sptes, &root);
+
+ walk_shadow_page_lockless_end(vcpu);
+
+ if (unlikely(leaf < 0))
+ return false;
+
+ /* Check if the leaf SPTE is present */
+ if (!is_shadow_present_pte(sptes[leaf]))
+ return false;
+
+ *pfn = spte_to_pfn(sptes[leaf]);
+ if (leaf > PG_LEVEL_4K) {
+ u64 page_mask = KVM_PAGES_PER_HPAGE(leaf) - KVM_PAGES_PER_HPAGE(leaf - 1);
+ *pfn |= (gpa_to_gfn(gpa) & page_mask);
+ }
+
+ *level = leaf;
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_get_tdp_walk);
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.25.1

2022-06-20 23:15:42

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 32/49] KVM: x86: Define RMP page fault error bits for #NPF

From: Brijesh Singh <[email protected]>

When SEV-SNP is enabled globally, the hardware places restrictions on all
memory accesses based on the RMP entry, whether the hypervisor or a VM,
performs the accesses. When hardware encounters an RMP access violation
during a guest access, it will cause a #VMEXIT(NPF).

See APM2 section 16.36.10 for more details.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2748c69609e3..49b217dc8d7e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -247,9 +247,13 @@ enum x86_intercept_stage;
#define PFERR_FETCH_BIT 4
#define PFERR_PK_BIT 5
#define PFERR_SGX_BIT 15
+#define PFERR_GUEST_RMP_BIT 31
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
#define PFERR_IMPLICIT_ACCESS_BIT 48
+#define PFERR_GUEST_ENC_BIT 34
+#define PFERR_GUEST_SIZEM_BIT 35
+#define PFERR_GUEST_VMPL_BIT 36

#define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
#define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
@@ -261,6 +265,10 @@ enum x86_intercept_stage;
#define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT)
#define PFERR_IMPLICIT_ACCESS (1ULL << PFERR_IMPLICIT_ACCESS_BIT)
+#define PFERR_GUEST_RMP_MASK (1ULL << PFERR_GUEST_RMP_BIT)
+#define PFERR_GUEST_ENC_MASK (1ULL << PFERR_GUEST_ENC_BIT)
+#define PFERR_GUEST_SIZEM_MASK (1ULL << PFERR_GUEST_SIZEM_BIT)
+#define PFERR_GUEST_VMPL_MASK (1ULL << PFERR_GUEST_VMPL_BIT)

#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
PFERR_WRITE_MASK | \
--
2.25.1

2022-06-20 23:16:02

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 27/49] KVM: SVM: Mark the private vma unmerable for SEV-SNP guests

From: Brijesh Singh <[email protected]>

When SEV-SNP is enabled, the guest private pages are added in the RMP
table; while adding the pages, the rmp_make_private() unmaps the pages
from the direct map. If KSM attempts to access those unmapped pages then
it will trigger #PF (page-not-present).

Encrypted guest pages cannot be shared between the process, so an
userspace should not mark the region mergeable but to be safe, mark the
process vma unmerable before adding the pages in the RMP table.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b5f0707d7ed6..a9461d352eda 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -19,11 +19,13 @@
#include <linux/trace_events.h>
#include <linux/hugetlb.h>
#include <linux/sev.h>
+#include <linux/ksm.h>

#include <asm/pkru.h>
#include <asm/trapnr.h>
#include <asm/fpu/xcr.h>
#include <asm/sev.h>
+#include <asm/mman.h>

#include "x86.h"
#include "svm.h"
@@ -1965,6 +1967,30 @@ static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
return false;
}

+static int snp_mark_unmergable(struct kvm *kvm, u64 start, u64 size)
+{
+ struct vm_area_struct *vma;
+ u64 end = start + size;
+ int ret;
+
+ do {
+ vma = find_vma_intersection(kvm->mm, start, end);
+ if (!vma) {
+ ret = -EINVAL;
+ break;
+ }
+
+ ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ MADV_UNMERGEABLE, &vma->vm_flags);
+ if (ret)
+ break;
+
+ start = vma->vm_end;
+ } while (end > vma->vm_end);
+
+ return ret;
+}
+
static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -1989,6 +2015,12 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
if (!is_hva_registered(kvm, params.uaddr, params.len))
return -EINVAL;

+ mmap_write_lock(kvm->mm);
+ ret = snp_mark_unmergable(kvm, params.uaddr, params.len);
+ mmap_write_unlock(kvm->mm);
+ if (ret)
+ return -EFAULT;
+
/*
* The userspace memory is already locked so technically we don't
* need to lock it again. Later part of the function needs to know
--
2.25.1

2022-06-20 23:16:21

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 29/49] KVM: X86: Keep the NPT and RMP page level in sync

From: Brijesh Singh <[email protected]>

When running an SEV-SNP VM, the sPA used to index the RMP entry is
obtained through the NPT translation (gva->gpa->spa). The NPT page
level is checked against the page level programmed in the RMP entry.
If the page level does not match, then it will cause a nested page
fault with the RMP bit set to indicate the RMP violation.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 5 ++++
arch/x86/kvm/svm/sev.c | 46 ++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/svm/svm.h | 1 +
6 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index a66292dae698..e0068e702692 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -129,6 +129,7 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP(alloc_apic_backing_page)
+KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0205e2944067..2748c69609e3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1514,6 +1514,7 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);

void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+ void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c623019929a7..997318ecebd1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -43,6 +43,7 @@
#include <linux/hash.h>
#include <linux/kern_levels.h>
#include <linux/kthread.h>
+#include <linux/sev.h>

#include <asm/page.h>
#include <asm/memtype.h>
@@ -2824,6 +2825,10 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
if (unlikely(!pte))
return PG_LEVEL_4K;

+ /* Adjust the page level based on the SEV-SNP RMP page level. */
+ if (kvm_x86_ops.rmp_page_level_adjust)
+ static_call(kvm_x86_rmp_page_level_adjust)(kvm, pfn, &level);
+
return level;
}

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a5b90469683f..91d3d24e60d2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3597,3 +3597,49 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)

return pfn_to_page(pfn);
}
+
+static bool is_pfn_range_shared(kvm_pfn_t start, kvm_pfn_t end)
+{
+ int level;
+
+ while (end > start) {
+ if (snp_lookup_rmpentry(start, &level) != 0)
+ return false;
+ start++;
+ }
+
+ return true;
+}
+
+void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)
+{
+ int rmp_level, assigned;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+ return;
+
+ assigned = snp_lookup_rmpentry(pfn, &rmp_level);
+ if (unlikely(assigned < 0))
+ return;
+
+ if (!assigned) {
+ /*
+ * If all the pages are shared then no need to keep the RMP
+ * and NPT in sync.
+ */
+ pfn = pfn & ~(PTRS_PER_PMD - 1);
+ if (is_pfn_range_shared(pfn, pfn + PTRS_PER_PMD))
+ return;
+ }
+
+ /*
+ * The hardware installs 2MB TLB entries to access to 1GB pages,
+ * therefore allow NPT to use 1GB pages when pfn was added as 2MB
+ * in the RMP table.
+ */
+ if (rmp_level == PG_LEVEL_2M && (*level == PG_LEVEL_1G))
+ return;
+
+ /* Adjust the level to keep the NPT and RMP in sync */
+ *level = min_t(size_t, *level, rmp_level);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b4bd64f94d3a..18e2cd4d9559 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4734,6 +4734,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,

.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+ .rmp_page_level_adjust = sev_rmp_page_level_adjust,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 71c011af098e..7782312a1cda 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -673,6 +673,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
void sev_es_unmap_ghcb(struct vcpu_svm *svm);
struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);

/* vmenter.S */

--
2.25.1

2022-06-20 23:16:25

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 36/49] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT

From: Brijesh Singh <[email protected]>

SEV-SNP guests are required to perform a GHCB GPA registration. Before
using a GHCB GPA for a vCPU the first time, a guest must register the
vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then
it must respond back with the same GPA otherwise return -1.

On VMEXIT, Verify that GHCB GPA matches with the registered value. If a
mismatch is detected then abort the guest.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev-common.h | 8 ++++++++
arch/x86/kvm/svm/sev.c | 27 +++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.h | 7 +++++++
3 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 539de6b93420..0a9055cdfae2 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -59,6 +59,14 @@
#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS 12
#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0)

+/* Preferred GHCB GPA Request */
+#define GHCB_MSR_PREF_GPA_REQ 0x010
+#define GHCB_MSR_GPA_VALUE_POS 12
+#define GHCB_MSR_GPA_VALUE_MASK GENMASK_ULL(51, 0)
+
+#define GHCB_MSR_PREF_GPA_RESP 0x011
+#define GHCB_MSR_PREF_GPA_NONE 0xfffffffffffff
+
/* GHCB GPA Register */
#define GHCB_MSR_REG_GPA_REQ 0x012
#define GHCB_MSR_REG_GPA_REQ_VAL(v) \
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c70f3f7e06a8..6de48130e414 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3331,6 +3331,27 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
break;
}
+ case GHCB_MSR_PREF_GPA_REQ: {
+ set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_NONE, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_RESP, GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ }
+ case GHCB_MSR_REG_GPA_REQ: {
+ u64 gfn;
+
+ gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+
+ svm->sev_es.ghcb_registered_gpa = gfn_to_gpa(gfn);
+
+ set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_VALUE_MASK,
+ GHCB_MSR_GPA_VALUE_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_REG_GPA_RESP, GHCB_MSR_INFO_MASK,
+ GHCB_MSR_INFO_POS);
+ break;
+ }
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;

@@ -3381,6 +3402,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
return 1;
}

+ /* SEV-SNP guest requires that the GHCB GPA must be registered */
+ if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
+ vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
+ return -EINVAL;
+ }
+
ret = sev_es_validate_vmgexit(svm, &exit_code);
if (ret)
return ret;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index c80352c9c0d6..54ff56cb6125 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -206,6 +206,8 @@ struct vcpu_sev_es_state {
*/
u64 ghcb_sw_exit_info_1;
u64 ghcb_sw_exit_info_2;
+
+ u64 ghcb_registered_gpa;
};

struct vcpu_svm {
@@ -334,6 +336,11 @@ static inline bool sev_snp_guest(struct kvm *kvm)
return sev_es_guest(kvm) && sev->snp_active;
}

+static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
+{
+ return svm->sev_es.ghcb_registered_gpa == val;
+}
+
static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
{
vmcb->control.clean = 0;
--
2.25.1

2022-06-20 23:16:30

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 33/49] KVM: x86: Update page-fault trace to log full 64-bit error code

From: Brijesh Singh <[email protected]>

The #NPT error code is a 64-bit value but the trace prints only the
lower 32-bits. Some of the fault error code (e.g PFERR_GUEST_FINAL_MASK)
are available in the upper 32-bits.

Cc: <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/trace.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index e3a24b8f04be..9b9bc5468103 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -383,12 +383,12 @@ TRACE_EVENT(kvm_inj_exception,
* Tracepoint for page fault.
*/
TRACE_EVENT(kvm_page_fault,
- TP_PROTO(unsigned long fault_address, unsigned int error_code),
+ TP_PROTO(unsigned long fault_address, u64 error_code),
TP_ARGS(fault_address, error_code),

TP_STRUCT__entry(
__field( unsigned long, fault_address )
- __field( unsigned int, error_code )
+ __field( u64, error_code )
),

TP_fast_assign(
@@ -396,7 +396,7 @@ TRACE_EVENT(kvm_page_fault,
__entry->error_code = error_code;
),

- TP_printk("address %lx error_code %x",
+ TP_printk("address %lx error_code %llx",
__entry->fault_address, __entry->error_code)
);

--
2.25.1

2022-06-20 23:16:31

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

From: Brijesh Singh <[email protected]>

On VMGEXIT, sev_handle_vmgexit() creates a host mapping for the GHCB GPA,
and unmaps it just before VM-entry. This long-lived GHCB map is used by
the VMGEXIT handler through accessors such as ghcb_{set_get}_xxx().

A long-lived GHCB map can cause issue when SEV-SNP is enabled. When
SEV-SNP is enabled the mapped GPA needs to be protected against a page
state change.

To eliminate the long-lived GHCB mapping, update the GHCB sync operations
to explicitly map the GHCB before access and unmap it after access is
complete. This requires that the setting of the GHCBs sw_exit_info_{1,2}
fields be done during sev_es_sync_to_ghcb(), so create two new fields in
the vcpu_svm struct to hold these values when required to be set outside
of the GHCB mapping.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 131 ++++++++++++++++++++++++++---------------
arch/x86/kvm/svm/svm.c | 12 ++--
arch/x86/kvm/svm/svm.h | 24 +++++++-
3 files changed, 111 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 01ea257e17d6..c70f3f7e06a8 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2823,15 +2823,40 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
kvfree(svm->sev_es.ghcb_sa);
}

+static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
+{
+ struct vmcb_control_area *control = &svm->vmcb->control;
+ u64 gfn = gpa_to_gfn(control->ghcb_gpa);
+
+ if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
+ /* Unable to map GHCB from guest */
+ pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
+{
+ kvm_vcpu_unmap(&svm->vcpu, map, true);
+}
+
static void dump_ghcb(struct vcpu_svm *svm)
{
- struct ghcb *ghcb = svm->sev_es.ghcb;
+ struct kvm_host_map map;
unsigned int nbits;
+ struct ghcb *ghcb;
+
+ if (svm_map_ghcb(svm, &map))
+ return;
+
+ ghcb = map.hva;

/* Re-use the dump_invalid_vmcb module parameter */
if (!dump_invalid_vmcb) {
pr_warn_ratelimited("set kvm_amd.dump_invalid_vmcb=1 to dump internal KVM state.\n");
- return;
+ goto e_unmap;
}

nbits = sizeof(ghcb->save.valid_bitmap) * 8;
@@ -2846,12 +2871,21 @@ static void dump_ghcb(struct vcpu_svm *svm)
pr_err("%-20s%016llx is_valid: %u\n", "sw_scratch",
ghcb->save.sw_scratch, ghcb_sw_scratch_is_valid(ghcb));
pr_err("%-20s%*pb\n", "valid_bitmap", nbits, ghcb->save.valid_bitmap);
+
+e_unmap:
+ svm_unmap_ghcb(svm, &map);
}

-static void sev_es_sync_to_ghcb(struct vcpu_svm *svm)
+static bool sev_es_sync_to_ghcb(struct vcpu_svm *svm)
{
struct kvm_vcpu *vcpu = &svm->vcpu;
- struct ghcb *ghcb = svm->sev_es.ghcb;
+ struct kvm_host_map map;
+ struct ghcb *ghcb;
+
+ if (svm_map_ghcb(svm, &map))
+ return false;
+
+ ghcb = map.hva;

/*
* The GHCB protocol so far allows for the following data
@@ -2865,13 +2899,24 @@ static void sev_es_sync_to_ghcb(struct vcpu_svm *svm)
ghcb_set_rbx(ghcb, vcpu->arch.regs[VCPU_REGS_RBX]);
ghcb_set_rcx(ghcb, vcpu->arch.regs[VCPU_REGS_RCX]);
ghcb_set_rdx(ghcb, vcpu->arch.regs[VCPU_REGS_RDX]);
+
+ /*
+ * Copy the return values from the exit_info_{1,2}.
+ */
+ ghcb_set_sw_exit_info_1(ghcb, svm->sev_es.ghcb_sw_exit_info_1);
+ ghcb_set_sw_exit_info_2(ghcb, svm->sev_es.ghcb_sw_exit_info_2);
+
+ trace_kvm_vmgexit_exit(svm->vcpu.vcpu_id, ghcb);
+
+ svm_unmap_ghcb(svm, &map);
+
+ return true;
}

-static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
+static void sev_es_sync_from_ghcb(struct vcpu_svm *svm, struct ghcb *ghcb)
{
struct vmcb_control_area *control = &svm->vmcb->control;
struct kvm_vcpu *vcpu = &svm->vcpu;
- struct ghcb *ghcb = svm->sev_es.ghcb;
u64 exit_code;

/*
@@ -2915,20 +2960,25 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}

-static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
+static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
{
- struct kvm_vcpu *vcpu;
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm_host_map map;
struct ghcb *ghcb;
- u64 exit_code;
u64 reason;

- ghcb = svm->sev_es.ghcb;
+ if (svm_map_ghcb(svm, &map))
+ return -EFAULT;
+
+ ghcb = map.hva;
+
+ trace_kvm_vmgexit_enter(vcpu->vcpu_id, ghcb);

/*
* Retrieve the exit code now even though it may not be marked valid
* as it could help with debugging.
*/
- exit_code = ghcb_get_sw_exit_code(ghcb);
+ *exit_code = ghcb_get_sw_exit_code(ghcb);

/* Only GHCB Usage code 0 is supported */
if (ghcb->ghcb_usage) {
@@ -3021,6 +3071,9 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
goto vmgexit_err;
}

+ sev_es_sync_from_ghcb(svm, ghcb);
+
+ svm_unmap_ghcb(svm, &map);
return 0;

vmgexit_err:
@@ -3031,10 +3084,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
ghcb->ghcb_usage);
} else if (reason == GHCB_ERR_INVALID_EVENT) {
vcpu_unimpl(vcpu, "vmgexit: exit code %#llx is not valid\n",
- exit_code);
+ *exit_code);
} else {
vcpu_unimpl(vcpu, "vmgexit: exit code %#llx input is not valid\n",
- exit_code);
+ *exit_code);
dump_ghcb(svm);
}

@@ -3044,6 +3097,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
ghcb_set_sw_exit_info_1(ghcb, 2);
ghcb_set_sw_exit_info_2(ghcb, reason);

+ svm_unmap_ghcb(svm, &map);
+
/* Resume the guest to "return" the error code. */
return 1;
}
@@ -3053,23 +3108,20 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm)
/* Clear any indication that the vCPU is in a type of AP Reset Hold */
svm->sev_es.ap_reset_hold_type = AP_RESET_HOLD_NONE;

- if (!svm->sev_es.ghcb)
+ if (!svm->sev_es.ghcb_in_use)
return;

/* Sync the scratch buffer area. */
if (svm->sev_es.ghcb_sa_sync) {
kvm_write_guest(svm->vcpu.kvm,
- ghcb_get_sw_scratch(svm->sev_es.ghcb),
+ svm->sev_es.ghcb_sa_gpa,
svm->sev_es.ghcb_sa, svm->sev_es.ghcb_sa_len);
svm->sev_es.ghcb_sa_sync = false;
}

- trace_kvm_vmgexit_exit(svm->vcpu.vcpu_id, svm->sev_es.ghcb);
-
sev_es_sync_to_ghcb(svm);

- kvm_vcpu_unmap(&svm->vcpu, &svm->sev_es.ghcb_map, true);
- svm->sev_es.ghcb = NULL;
+ svm->sev_es.ghcb_in_use = false;
}

void pre_sev_run(struct vcpu_svm *svm, int cpu)
@@ -3099,7 +3151,6 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu)
static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
{
struct vmcb_control_area *control = &svm->vmcb->control;
- struct ghcb *ghcb = svm->sev_es.ghcb;
u64 ghcb_scratch_beg, ghcb_scratch_end;
u64 scratch_gpa_beg, scratch_gpa_end;

@@ -3178,8 +3229,8 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
return 0;

e_scratch:
- ghcb_set_sw_exit_info_1(ghcb, 2);
- ghcb_set_sw_exit_info_2(ghcb, GHCB_ERR_INVALID_SCRATCH_AREA);
+ svm_set_ghcb_sw_exit_info_1(&svm->vcpu, 2);
+ svm_set_ghcb_sw_exit_info_2(&svm->vcpu, GHCB_ERR_INVALID_SCRATCH_AREA);

return 1;
}
@@ -3316,7 +3367,6 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb_control_area *control = &svm->vmcb->control;
u64 ghcb_gpa, exit_code;
- struct ghcb *ghcb;
int ret;

/* Validate the GHCB */
@@ -3331,29 +3381,14 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
return 1;
}

- if (kvm_vcpu_map(vcpu, ghcb_gpa >> PAGE_SHIFT, &svm->sev_es.ghcb_map)) {
- /* Unable to map GHCB from guest */
- vcpu_unimpl(vcpu, "vmgexit: error mapping GHCB [%#llx] from guest\n",
- ghcb_gpa);
-
- /* Without a GHCB, just return right back to the guest */
- return 1;
- }
-
- svm->sev_es.ghcb = svm->sev_es.ghcb_map.hva;
- ghcb = svm->sev_es.ghcb_map.hva;
-
- trace_kvm_vmgexit_enter(vcpu->vcpu_id, ghcb);
-
- exit_code = ghcb_get_sw_exit_code(ghcb);
-
- ret = sev_es_validate_vmgexit(svm);
+ ret = sev_es_validate_vmgexit(svm, &exit_code);
if (ret)
return ret;

- sev_es_sync_from_ghcb(svm);
- ghcb_set_sw_exit_info_1(ghcb, 0);
- ghcb_set_sw_exit_info_2(ghcb, 0);
+ svm->sev_es.ghcb_in_use = true;
+
+ svm_set_ghcb_sw_exit_info_1(vcpu, 0);
+ svm_set_ghcb_sw_exit_info_2(vcpu, 0);

switch (exit_code) {
case SVM_VMGEXIT_MMIO_READ:
@@ -3393,20 +3428,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
break;
case 1:
/* Get AP jump table address */
- ghcb_set_sw_exit_info_2(ghcb, sev->ap_jump_table);
+ svm_set_ghcb_sw_exit_info_2(vcpu, sev->ap_jump_table);
break;
default:
pr_err("svm: vmgexit: unsupported AP jump table request - exit_info_1=%#llx\n",
control->exit_info_1);
- ghcb_set_sw_exit_info_1(ghcb, 2);
- ghcb_set_sw_exit_info_2(ghcb, GHCB_ERR_INVALID_INPUT);
+ svm_set_ghcb_sw_exit_info_1(vcpu, 2);
+ svm_set_ghcb_sw_exit_info_2(vcpu, GHCB_ERR_INVALID_INPUT);
}

ret = 1;
break;
}
case SVM_VMGEXIT_HV_FEATURES: {
- ghcb_set_sw_exit_info_2(ghcb, GHCB_HV_FT_SUPPORTED);
+ svm_set_ghcb_sw_exit_info_2(vcpu, GHCB_HV_FT_SUPPORTED);

ret = 1;
break;
@@ -3537,7 +3572,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
* Return from an AP Reset Hold VMGEXIT, where the guest will
* set the CS and RIP. Set SW_EXIT_INFO_2 to a non-zero value.
*/
- ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
+ svm_set_ghcb_sw_exit_info_2(vcpu, 1);
break;
case AP_RESET_HOLD_MSR_PROTO:
/*
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 18e2cd4d9559..b24e0171cbf2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2720,14 +2720,14 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
static int svm_complete_emulated_msr(struct kvm_vcpu *vcpu, int err)
{
struct vcpu_svm *svm = to_svm(vcpu);
- if (!err || !sev_es_guest(vcpu->kvm) || WARN_ON_ONCE(!svm->sev_es.ghcb))
+ if (!err || !sev_es_guest(vcpu->kvm) || WARN_ON_ONCE(!svm->sev_es.ghcb_in_use))
return kvm_complete_insn_gp(vcpu, err);

- ghcb_set_sw_exit_info_1(svm->sev_es.ghcb, 1);
- ghcb_set_sw_exit_info_2(svm->sev_es.ghcb,
- X86_TRAP_GP |
- SVM_EVTINJ_TYPE_EXEPT |
- SVM_EVTINJ_VALID);
+ svm_set_ghcb_sw_exit_info_1(vcpu, 1);
+ svm_set_ghcb_sw_exit_info_2(vcpu,
+ X86_TRAP_GP |
+ SVM_EVTINJ_TYPE_EXEPT |
+ SVM_EVTINJ_VALID);
return 1;
}

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bd0db4d4a61e..c80352c9c0d6 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -189,8 +189,7 @@ struct svm_nested_state {
struct vcpu_sev_es_state {
/* SEV-ES support */
struct sev_es_save_area *vmsa;
- struct ghcb *ghcb;
- struct kvm_host_map ghcb_map;
+ bool ghcb_in_use;
bool received_first_sipi;
unsigned int ap_reset_hold_type;

@@ -200,6 +199,13 @@ struct vcpu_sev_es_state {
u64 ghcb_sa_gpa;
u32 ghcb_sa_alloc_len;
bool ghcb_sa_sync;
+
+ /*
+ * SEV-ES support to hold the sw_exit_info return values to be
+ * sync'ed to the GHCB when mapped.
+ */
+ u64 ghcb_sw_exit_info_1;
+ u64 ghcb_sw_exit_info_2;
};

struct vcpu_svm {
@@ -614,6 +620,20 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm);
void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm);
void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb);

+static inline void svm_set_ghcb_sw_exit_info_1(struct kvm_vcpu *vcpu, u64 val)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ svm->sev_es.ghcb_sw_exit_info_1 = val;
+}
+
+static inline void svm_set_ghcb_sw_exit_info_2(struct kvm_vcpu *vcpu, u64 val)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ svm->sev_es.ghcb_sw_exit_info_2 = val;
+}
+
extern struct kvm_x86_nested_ops svm_nested_ops;

/* avic.c */
--
2.25.1

2022-06-20 23:16:35

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 38/49] KVM: SVM: Add support to handle Page State Change VMGEXIT

From: Brijesh Singh <[email protected]>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev-common.h | 7 +++
arch/x86/kvm/svm/sev.c | 79 +++++++++++++++++++++++++++++--
2 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index ee38f7408470..1b111cde8c82 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -130,6 +130,13 @@ enum psc_op {
/* SNP Page State Change NAE event */
#define VMGEXIT_PSC_MAX_ENTRY 253

+/* The page state change hdr structure in not valid */
+#define PSC_INVALID_HDR 1
+/* The hdr.cur_entry or hdr.end_entry is not valid */
+#define PSC_INVALID_ENTRY 2
+/* Page state change encountered undefined error */
+#define PSC_UNDEF_ERR 3
+
struct psc_hdr {
u16 cur_entry;
u16 end_entry;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 15900c2f30fc..cb2d1bbb862b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3066,6 +3066,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
case SVM_VMGEXIT_AP_JUMP_TABLE:
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
case SVM_VMGEXIT_HV_FEATURES:
+ case SVM_VMGEXIT_PSC:
break;
default:
reason = GHCB_ERR_INVALID_EVENT;
@@ -3351,13 +3352,13 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
*/
rc = snp_check_and_build_npt(vcpu, gpa, level);
if (rc)
- return -EINVAL;
+ return PSC_UNDEF_ERR;

if (op == SNP_PAGE_STATE_PRIVATE) {
hva_t hva;

if (snp_gpa_to_hva(kvm, gpa, &hva))
- return -EINVAL;
+ return PSC_UNDEF_ERR;

/*
* Verify that the hva range is registered. This enforcement is
@@ -3369,7 +3370,7 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
rc = is_hva_registered(kvm, hva, page_level_size(level));
mutex_unlock(&kvm->lock);
if (!rc)
- return -EINVAL;
+ return PSC_UNDEF_ERR;

/*
* Mark the userspace range unmerable before adding the pages
@@ -3379,7 +3380,7 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
mmap_write_unlock(kvm->mm);
if (rc)
- return -EINVAL;
+ return PSC_UNDEF_ERR;
}

write_lock(&kvm->mmu_lock);
@@ -3410,7 +3411,7 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
break;
default:
- rc = -EINVAL;
+ rc = PSC_INVALID_ENTRY;
break;
}

@@ -3428,6 +3429,65 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
return 0;
}

+static inline unsigned long map_to_psc_vmgexit_code(int rc)
+{
+ switch (rc) {
+ case PSC_INVALID_HDR:
+ return ((1ul << 32) | 1);
+ case PSC_INVALID_ENTRY:
+ return ((1ul << 32) | 2);
+ case RMPUPDATE_FAIL_OVERLAP:
+ return ((3ul << 32) | 2);
+ default: return (4ul << 32);
+ }
+}
+
+static unsigned long snp_handle_page_state_change(struct vcpu_svm *svm)
+{
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ int level, op, rc = PSC_UNDEF_ERR;
+ struct snp_psc_desc *info;
+ struct psc_entry *entry;
+ u16 cur, end;
+ gpa_t gpa;
+
+ if (!sev_snp_guest(vcpu->kvm))
+ return PSC_INVALID_HDR;
+
+ if (setup_vmgexit_scratch(svm, true, sizeof(*info))) {
+ pr_err("vmgexit: scratch area is not setup.\n");
+ return PSC_INVALID_HDR;
+ }
+
+ info = (struct snp_psc_desc *)svm->sev_es.ghcb_sa;
+ cur = info->hdr.cur_entry;
+ end = info->hdr.end_entry;
+
+ if (cur >= VMGEXIT_PSC_MAX_ENTRY ||
+ end >= VMGEXIT_PSC_MAX_ENTRY || cur > end)
+ return PSC_INVALID_ENTRY;
+
+ for (; cur <= end; cur++) {
+ entry = &info->entries[cur];
+ gpa = gfn_to_gpa(entry->gfn);
+ level = RMP_TO_X86_PG_LEVEL(entry->pagesize);
+ op = entry->operation;
+
+ if (!IS_ALIGNED(gpa, page_level_size(level))) {
+ rc = PSC_INVALID_ENTRY;
+ goto out;
+ }
+
+ rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
+ if (rc)
+ goto out;
+ }
+
+out:
+ info->hdr.cur_entry = cur;
+ return rc ? map_to_psc_vmgexit_code(rc) : 0;
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3670,6 +3730,15 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = 1;
break;
}
+ case SVM_VMGEXIT_PSC: {
+ unsigned long rc;
+
+ ret = 1;
+
+ rc = snp_handle_page_state_change(svm);
+ svm_set_ghcb_sw_exit_info_2(vcpu, rc);
+ break;
+ }
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
--
2.25.1

2022-06-20 23:16:57

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 34/49] KVM: SVM: Do not use long-lived GHCB map while setting scratch area

From: Brijesh Singh <[email protected]>

The setup_vmgexit_scratch() function may rely on a long-lived GHCB
mapping if the GHCB shared buffer area was used for the scratch area.
In preparation for eliminating the long-lived GHCB mapping, always
allocate a buffer for the scratch area so it can be accessed without
the GHCB mapping.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 74 +++++++++++++++++++-----------------------
arch/x86/kvm/svm/svm.h | 3 +-
2 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 91d3d24e60d2..01ea257e17d6 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2820,8 +2820,7 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
__free_page(virt_to_page(svm->sev_es.vmsa));

skip_vmsa_free:
- if (svm->sev_es.ghcb_sa_free)
- kvfree(svm->sev_es.ghcb_sa);
+ kvfree(svm->sev_es.ghcb_sa);
}

static void dump_ghcb(struct vcpu_svm *svm)
@@ -2909,6 +2908,9 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
control->exit_info_1 = ghcb_get_sw_exit_info_1(ghcb);
control->exit_info_2 = ghcb_get_sw_exit_info_2(ghcb);

+ /* Copy the GHCB scratch area GPA */
+ svm->sev_es.ghcb_sa_gpa = ghcb_get_sw_scratch(ghcb);
+
/* Clear the valid entries fields */
memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
@@ -3054,23 +3056,12 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm)
if (!svm->sev_es.ghcb)
return;

- if (svm->sev_es.ghcb_sa_free) {
- /*
- * The scratch area lives outside the GHCB, so there is a
- * buffer that, depending on the operation performed, may
- * need to be synced, then freed.
- */
- if (svm->sev_es.ghcb_sa_sync) {
- kvm_write_guest(svm->vcpu.kvm,
- ghcb_get_sw_scratch(svm->sev_es.ghcb),
- svm->sev_es.ghcb_sa,
- svm->sev_es.ghcb_sa_len);
- svm->sev_es.ghcb_sa_sync = false;
- }
-
- kvfree(svm->sev_es.ghcb_sa);
- svm->sev_es.ghcb_sa = NULL;
- svm->sev_es.ghcb_sa_free = false;
+ /* Sync the scratch buffer area. */
+ if (svm->sev_es.ghcb_sa_sync) {
+ kvm_write_guest(svm->vcpu.kvm,
+ ghcb_get_sw_scratch(svm->sev_es.ghcb),
+ svm->sev_es.ghcb_sa, svm->sev_es.ghcb_sa_len);
+ svm->sev_es.ghcb_sa_sync = false;
}

trace_kvm_vmgexit_exit(svm->vcpu.vcpu_id, svm->sev_es.ghcb);
@@ -3111,9 +3102,8 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
struct ghcb *ghcb = svm->sev_es.ghcb;
u64 ghcb_scratch_beg, ghcb_scratch_end;
u64 scratch_gpa_beg, scratch_gpa_end;
- void *scratch_va;

- scratch_gpa_beg = ghcb_get_sw_scratch(ghcb);
+ scratch_gpa_beg = svm->sev_es.ghcb_sa_gpa;
if (!scratch_gpa_beg) {
pr_err("vmgexit: scratch gpa not provided\n");
goto e_scratch;
@@ -3143,9 +3133,6 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
scratch_gpa_beg, scratch_gpa_end);
goto e_scratch;
}
-
- scratch_va = (void *)svm->sev_es.ghcb;
- scratch_va += (scratch_gpa_beg - control->ghcb_gpa);
} else {
/*
* The guest memory must be read into a kernel buffer, so
@@ -3156,29 +3143,36 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
len, GHCB_SCRATCH_AREA_LIMIT);
goto e_scratch;
}
- scratch_va = kvzalloc(len, GFP_KERNEL_ACCOUNT);
- if (!scratch_va)
- return -ENOMEM;
+ }

- if (kvm_read_guest(svm->vcpu.kvm, scratch_gpa_beg, scratch_va, len)) {
- /* Unable to copy scratch area from guest */
- pr_err("vmgexit: kvm_read_guest for scratch area failed\n");
+ if (svm->sev_es.ghcb_sa_alloc_len < len) {
+ void *scratch_va = kvzalloc(len, GFP_KERNEL_ACCOUNT);

- kvfree(scratch_va);
- return -EFAULT;
- }
+ if (!scratch_va)
+ return -ENOMEM;

/*
- * The scratch area is outside the GHCB. The operation will
- * dictate whether the buffer needs to be synced before running
- * the vCPU next time (i.e. a read was requested so the data
- * must be written back to the guest memory).
+ * Free the old scratch area and switch to using newly
+ * allocated.
*/
- svm->sev_es.ghcb_sa_sync = sync;
- svm->sev_es.ghcb_sa_free = true;
+ kvfree(svm->sev_es.ghcb_sa);
+
+ svm->sev_es.ghcb_sa_alloc_len = len;
+ svm->sev_es.ghcb_sa = scratch_va;
}

- svm->sev_es.ghcb_sa = scratch_va;
+ if (kvm_read_guest(svm->vcpu.kvm, scratch_gpa_beg, svm->sev_es.ghcb_sa, len)) {
+ /* Unable to copy scratch area from guest */
+ pr_err("vmgexit: kvm_read_guest for scratch area failed\n");
+ return -EFAULT;
+ }
+
+ /*
+ * The operation will dictate whether the buffer needs to be synced
+ * before running the vCPU next time (i.e. a read was requested so
+ * the data must be written back to the guest memory).
+ */
+ svm->sev_es.ghcb_sa_sync = sync;
svm->sev_es.ghcb_sa_len = len;

return 0;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7782312a1cda..bd0db4d4a61e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -197,8 +197,9 @@ struct vcpu_sev_es_state {
/* SEV-ES scratch area support */
void *ghcb_sa;
u32 ghcb_sa_len;
+ u64 ghcb_sa_gpa;
+ u32 ghcb_sa_alloc_len;
bool ghcb_sa_sync;
- bool ghcb_sa_free;
};

struct vcpu_svm {
--
2.25.1

2022-06-20 23:17:22

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

From: Brijesh Singh <[email protected]>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

Before changing the page state in the RMP entry, lookup the page in the
NPT to make sure that there is a valid mapping for it. If the mapping
exist then try to find a workable page level between the NPT and RMP for
the page. If the page is not mapped in the NPT, then create a fault such
that it gets mapped before we change the page state in the RMP entry.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/sev-common.h | 9 ++
arch/x86/kvm/svm/sev.c | 197 ++++++++++++++++++++++++++++++
arch/x86/kvm/trace.h | 34 ++++++
arch/x86/kvm/x86.c | 1 +
4 files changed, 241 insertions(+)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 0a9055cdfae2..ee38f7408470 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -93,6 +93,10 @@ enum psc_op {
};

#define GHCB_MSR_PSC_REQ 0x014
+#define GHCB_MSR_PSC_GFN_POS 12
+#define GHCB_MSR_PSC_GFN_MASK GENMASK_ULL(39, 0)
+#define GHCB_MSR_PSC_OP_POS 52
+#define GHCB_MSR_PSC_OP_MASK 0xf
#define GHCB_MSR_PSC_REQ_GFN(gfn, op) \
/* GHCBData[55:52] */ \
(((u64)((op) & 0xf) << 52) | \
@@ -102,6 +106,11 @@ enum psc_op {
GHCB_MSR_PSC_REQ)

#define GHCB_MSR_PSC_RESP 0x015
+#define GHCB_MSR_PSC_ERROR_POS 32
+#define GHCB_MSR_PSC_ERROR_MASK GENMASK_ULL(31, 0)
+#define GHCB_MSR_PSC_ERROR GENMASK_ULL(31, 0)
+#define GHCB_MSR_PSC_RSVD_POS 12
+#define GHCB_MSR_PSC_RSVD_MASK GENMASK_ULL(19, 0)
#define GHCB_MSR_PSC_RESP_VAL(val) \
/* GHCBData[63:32] */ \
(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6de48130e414..15900c2f30fc 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -32,6 +32,7 @@
#include "svm_ops.h"
#include "cpuid.h"
#include "trace.h"
+#include "mmu.h"

#ifndef CONFIG_KVM_AMD_SEV
/*
@@ -3252,6 +3253,181 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
svm->vmcb->control.ghcb_gpa = value;
}

+static int snp_rmptable_psmash(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+ return psmash(pfn);
+}
+
+static int snp_make_page_shared(struct kvm *kvm, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+ int rc, rmp_level;
+
+ rc = snp_lookup_rmpentry(pfn, &rmp_level);
+ if (rc < 0)
+ return -EINVAL;
+
+ /* If page is not assigned then do nothing */
+ if (!rc)
+ return 0;
+
+ /*
+ * Is the page part of an existing 2MB RMP entry ? Split the 2MB into
+ * multiple of 4K-page before making the memory shared.
+ */
+ if (level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M) {
+ rc = snp_rmptable_psmash(kvm, pfn);
+ if (rc)
+ return rc;
+ }
+
+ return rmp_make_shared(pfn, level);
+}
+
+static int snp_check_and_build_npt(struct kvm_vcpu *vcpu, gpa_t gpa, int level)
+{
+ struct kvm *kvm = vcpu->kvm;
+ int rc, npt_level;
+ kvm_pfn_t pfn;
+
+ /*
+ * Get the pfn and level for the gpa from the nested page table.
+ *
+ * If the tdp walk fails, then its safe to say that there is no
+ * valid mapping for this gpa. Create a fault to build the map.
+ */
+ write_lock(&kvm->mmu_lock);
+ rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
+ write_unlock(&kvm->mmu_lock);
+ if (!rc) {
+ pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level);
+ if (is_error_noslot_pfn(pfn))
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int snp_gpa_to_hva(struct kvm *kvm, gpa_t gpa, hva_t *hva)
+{
+ struct kvm_memory_slot *slot;
+ gfn_t gfn = gpa_to_gfn(gpa);
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!slot) {
+ srcu_read_unlock(&kvm->srcu, idx);
+ return -EINVAL;
+ }
+
+ /*
+ * Note, using the __gfn_to_hva_memslot() is not solely for performance,
+ * it's also necessary to avoid the "writable" check in __gfn_to_hva_many(),
+ * which will always fail on read-only memslots due to gfn_to_hva() assuming
+ * writes.
+ */
+ *hva = __gfn_to_hva_memslot(slot, gfn);
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ return 0;
+}
+
+static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op, gpa_t gpa,
+ int level)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
+ struct kvm *kvm = vcpu->kvm;
+ int rc, npt_level;
+ kvm_pfn_t pfn;
+ gpa_t gpa_end;
+
+ gpa_end = gpa + page_level_size(level);
+
+ while (gpa < gpa_end) {
+ /*
+ * If the gpa is not present in the NPT then build the NPT.
+ */
+ rc = snp_check_and_build_npt(vcpu, gpa, level);
+ if (rc)
+ return -EINVAL;
+
+ if (op == SNP_PAGE_STATE_PRIVATE) {
+ hva_t hva;
+
+ if (snp_gpa_to_hva(kvm, gpa, &hva))
+ return -EINVAL;
+
+ /*
+ * Verify that the hva range is registered. This enforcement is
+ * required to avoid the cases where a page is marked private
+ * in the RMP table but never gets cleanup during the VM
+ * termination path.
+ */
+ mutex_lock(&kvm->lock);
+ rc = is_hva_registered(kvm, hva, page_level_size(level));
+ mutex_unlock(&kvm->lock);
+ if (!rc)
+ return -EINVAL;
+
+ /*
+ * Mark the userspace range unmerable before adding the pages
+ * in the RMP table.
+ */
+ mmap_write_lock(kvm->mm);
+ rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
+ mmap_write_unlock(kvm->mm);
+ if (rc)
+ return -EINVAL;
+ }
+
+ write_lock(&kvm->mmu_lock);
+
+ rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
+ if (!rc) {
+ /*
+ * This may happen if another vCPU unmapped the page
+ * before we acquire the lock. Retry the PSC.
+ */
+ write_unlock(&kvm->mmu_lock);
+ return 0;
+ }
+
+ /*
+ * Adjust the level so that we don't go higher than the backing
+ * page level.
+ */
+ level = min_t(size_t, level, npt_level);
+
+ trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op, level);
+
+ switch (op) {
+ case SNP_PAGE_STATE_SHARED:
+ rc = snp_make_page_shared(kvm, gpa, pfn, level);
+ break;
+ case SNP_PAGE_STATE_PRIVATE:
+ rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
+ break;
+ default:
+ rc = -EINVAL;
+ break;
+ }
+
+ write_unlock(&kvm->mmu_lock);
+
+ if (rc) {
+ pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
+ op, gpa, pfn, level, rc);
+ return rc;
+ }
+
+ gpa = gpa + page_level_size(level);
+ }
+
+ return 0;
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3352,6 +3528,27 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
GHCB_MSR_INFO_POS);
break;
}
+ case GHCB_MSR_PSC_REQ: {
+ gfn_t gfn;
+ int ret;
+ enum psc_op op;
+
+ gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS);
+ op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS);
+
+ ret = __snp_handle_page_state_change(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K);
+
+ if (ret)
+ set_ghcb_msr_bits(svm, GHCB_MSR_PSC_ERROR,
+ GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS);
+ else
+ set_ghcb_msr_bits(svm, 0,
+ GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS);
+
+ set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS);
+ set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
+ break;
+ }
case GHCB_MSR_TERM_REQ: {
u64 reason_set, reason_code;

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 9b9bc5468103..79801e50344a 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -7,6 +7,7 @@
#include <asm/svm.h>
#include <asm/clocksource.h>
#include <asm/pvclock-abi.h>
+#include <asm/sev-common.h>

#undef TRACE_SYSTEM
#define TRACE_SYSTEM kvm
@@ -1755,6 +1756,39 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
__entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
);

+/*
+ * Tracepoint for the SEV-SNP page state change processing
+ */
+#define psc_operation \
+ {SNP_PAGE_STATE_PRIVATE, "private"}, \
+ {SNP_PAGE_STATE_SHARED, "shared"} \
+
+TRACE_EVENT(kvm_snp_psc,
+ TP_PROTO(unsigned int vcpu_id, u64 pfn, u64 gpa, u8 op, int level),
+ TP_ARGS(vcpu_id, pfn, gpa, op, level),
+
+ TP_STRUCT__entry(
+ __field(int, vcpu_id)
+ __field(u64, pfn)
+ __field(u64, gpa)
+ __field(u8, op)
+ __field(int, level)
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_id = vcpu_id;
+ __entry->pfn = pfn;
+ __entry->gpa = gpa;
+ __entry->op = op;
+ __entry->level = level;
+ ),
+
+ TP_printk("vcpu %u, pfn %llx, gpa %llx, op %s, level %d",
+ __entry->vcpu_id, __entry->pfn, __entry->gpa,
+ __print_symbolic(__entry->op, psc_operation),
+ __entry->level)
+);
+
#endif /* _TRACE_KVM_H */

#undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 50fff5202e7e..4a1d16231e30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13066,6 +13066,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_snp_psc);

static int __init kvm_x86_init(void)
{
--
2.25.1

2022-06-20 23:17:31

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 39/49] KVM: SVM: Introduce ops for the post gfn map and unmap

From: Brijesh Singh <[email protected]>

When SEV-SNP is enabled in the guest VM, the guest memory pages can
either be a private or shared. A write from the hypervisor goes through
the RMP checks. If hardware sees that hypervisor is attempting to write
to a guest private page, then it triggers an RMP violation #PF.

To avoid the RMP violation with GHCB pages, added new post_{map,unmap}_gfn
functions to verify if its safe to map GHCB pages. Uses a spinlock to
protect against the page state change for existing mapped pages.

Need to add generic post_{map,unmap}_gfn() ops that can be used to verify
that its safe to map a given guest page in the hypervisor.

This patch will need to be revisited later after consensus is reached on
how to manage guest private memory as probably UPM private memslots will
be able to handle this page state change more gracefully.

Signed-off-by: Brijesh Singh <[email protected]>
Signed-off by: Ashish Kalra <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/svm/sev.c | 48 ++++++++++++++++++++++++++++--
arch/x86/kvm/svm/svm.c | 3 ++
arch/x86/kvm/svm/svm.h | 11 +++++++
5 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e0068e702692..2dd2bc0cf4c3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -130,6 +130,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
+KVM_X86_OP(update_protected_guest_state)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49b217dc8d7e..8abc0e724f5c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1522,7 +1522,10 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);

void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+
void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
+
+ int (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index cb2d1bbb862b..4ed90331bca0 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -341,6 +341,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
if (ret)
goto e_free;

+ spin_lock_init(&sev->psc_lock);
ret = sev_snp_init(&argp->error);
} else {
ret = sev_platform_init(&argp->error);
@@ -2828,19 +2829,28 @@ static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
{
struct vmcb_control_area *control = &svm->vmcb->control;
u64 gfn = gpa_to_gfn(control->ghcb_gpa);
+ struct kvm_vcpu *vcpu = &svm->vcpu;

- if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
+ if (kvm_vcpu_map(vcpu, gfn, map)) {
/* Unable to map GHCB from guest */
pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
return -EFAULT;
}

+ if (sev_post_map_gfn(vcpu->kvm, map->gfn, map->pfn)) {
+ kvm_vcpu_unmap(vcpu, map, false);
+ return -EBUSY;
+ }
+
return 0;
}

static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
{
- kvm_vcpu_unmap(&svm->vcpu, map, true);
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+
+ kvm_vcpu_unmap(vcpu, map, true);
+ sev_post_unmap_gfn(vcpu->kvm, map->gfn, map->pfn);
}

static void dump_ghcb(struct vcpu_svm *svm)
@@ -3383,6 +3393,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
return PSC_UNDEF_ERR;
}

+ spin_lock(&sev->psc_lock);
+
write_lock(&kvm->mmu_lock);

rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
@@ -3417,6 +3429,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,

write_unlock(&kvm->mmu_lock);

+ spin_unlock(&sev->psc_lock);
+
if (rc) {
pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
op, gpa, pfn, level, rc);
@@ -3965,3 +3979,33 @@ void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)
/* Adjust the level to keep the NPT and RMP in sync */
*level = min_t(size_t, *level, rmp_level);
}
+
+int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+ int level;
+
+ if (!sev_snp_guest(kvm))
+ return 0;
+
+ spin_lock(&sev->psc_lock);
+
+ /* If pfn is not added as private then fail */
+ if (snp_lookup_rmpentry(pfn, &level) == 1) {
+ spin_unlock(&sev->psc_lock);
+ pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
+void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+ if (!sev_snp_guest(kvm))
+ return;
+
+ spin_unlock(&sev->psc_lock);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b24e0171cbf2..1c8e035ba011 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4734,7 +4734,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,

.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+
.rmp_page_level_adjust = sev_rmp_page_level_adjust,
+
+ .update_protected_guest_state = sev_snp_update_protected_guest_state,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 54ff56cb6125..3fd95193ed8d 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -79,19 +79,25 @@ struct kvm_sev_info {
bool active; /* SEV enabled guest */
bool es_active; /* SEV-ES enabled guest */
bool snp_active; /* SEV-SNP enabled guest */
+
unsigned int asid; /* ASID used for this guest */
unsigned int handle; /* SEV firmware handle */
int fd; /* SEV device fd */
+
unsigned long pages_locked; /* Number of pages locked */
struct list_head regions_list; /* List of registered regions */
+
u64 ap_jump_table; /* SEV-ES AP Jump Table address */
+
struct kvm *enc_context_owner; /* Owner of copied encryption context */
struct list_head mirror_vms; /* List of VMs mirroring */
struct list_head mirror_entry; /* Use as a list entry of mirrors */
struct misc_cg *misc_cg; /* For misc cgroup accounting */
atomic_t migration_in_progress;
+
u64 snp_init_flags;
void *snp_context; /* SNP guest context page */
+ spinlock_t psc_lock;
};

struct kvm_svm {
@@ -702,6 +708,11 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
void sev_es_unmap_ghcb(struct vcpu_svm *svm);
struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);
+int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
+void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
+void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
+int sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu);

/* vmenter.S */

--
2.25.1

2022-06-20 23:18:11

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 40/49] KVM: x86: Export the kvm_zap_gfn_range() for the SNP use

From: Brijesh Singh <[email protected]>

While resolving the RMP page fault, we may run into cases where the page
level between the RMP entry and TDP does not match and the 2M RMP entry
must be split into 4K RMP entries. Or a 2M TDP page need to be broken
into multiple of 4K pages.

To keep the RMP and TDP page level in sync, we will zap the gfn range
after splitting the pages in the RMP entry. The zap should force the
TDP to gets rebuilt with the new page level.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu.h | 2 --
arch/x86/kvm/mmu/mmu.c | 1 +
3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8abc0e724f5c..1db4d178eb1d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1627,6 +1627,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+

int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d55b5166389a..c5044958a0fa 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -267,8 +267,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return -(u32)fault & errcode;
}

-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);

int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c1ac486e096e..67120bfeb667 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6084,6 +6084,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,

return need_tlb_flush;
}
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);

void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
--
2.25.1

2022-06-20 23:18:13

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

From: Brijesh Singh <[email protected]>

When SEV-SNP is enabled in the guest, the hardware places restrictions on
all memory accesses based on the contents of the RMP table. When hardware
encounters RMP check failure caused by the guest memory access it raises
the #NPF. The error code contains additional information on the access
type. See the APM volume 2 for additional information.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 14 +++++---
2 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4ed90331bca0..7fc0fad87054 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)

spin_unlock(&sev->psc_lock);
}
+
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+ int rmp_level, npt_level, rc, assigned;
+ struct kvm *kvm = vcpu->kvm;
+ gfn_t gfn = gpa_to_gfn(gpa);
+ bool need_psc = false;
+ enum psc_op psc_op;
+ kvm_pfn_t pfn;
+ bool private;
+
+ write_lock(&kvm->mmu_lock);
+
+ if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))
+ goto unlock;
+
+ assigned = snp_lookup_rmpentry(pfn, &rmp_level);
+ if (unlikely(assigned < 0))
+ goto unlock;
+
+ private = !!(error_code & PFERR_GUEST_ENC_MASK);
+
+ /*
+ * If the fault was due to size mismatch, or NPT and RMP page level's
+ * are not in sync, then use PSMASH to split the RMP entry into 4K.
+ */
+ if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
+ (npt_level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M && private)) {
+ rc = snp_rmptable_psmash(kvm, pfn);
+ if (rc)
+ pr_err_ratelimited("psmash failed, gpa 0x%llx pfn 0x%llx rc %d\n",
+ gpa, pfn, rc);
+ goto out;
+ }
+
+ /*
+ * If it's a private access, and the page is not assigned in the
+ * RMP table, create a new private RMP entry. This can happen if
+ * guest did not use the PSC VMGEXIT to transition the page state
+ * before the access.
+ */
+ if (!assigned && private) {
+ need_psc = 1;
+ psc_op = SNP_PAGE_STATE_PRIVATE;
+ goto out;
+ }
+
+ /*
+ * If it's a shared access, but the page is private in the RMP table
+ * then make the page shared in the RMP table. This can happen if
+ * the guest did not use the PSC VMGEXIT to transition the page
+ * state before the access.
+ */
+ if (assigned && !private) {
+ need_psc = 1;
+ psc_op = SNP_PAGE_STATE_SHARED;
+ }
+
+out:
+ write_unlock(&kvm->mmu_lock);
+
+ if (need_psc)
+ rc = __snp_handle_page_state_change(vcpu, psc_op, gpa, PG_LEVEL_4K);
+
+ /*
+ * The fault handler has updated the RMP pagesize, zap the existing
+ * rmaps for large entry ranges so that nested page table gets rebuilt
+ * with the updated RMP pagesize.
+ */
+ gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+ kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+ return;
+
+unlock:
+ write_unlock(&kvm->mmu_lock);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1c8e035ba011..7742bc986afc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1866,15 +1866,21 @@ static int pf_interception(struct kvm_vcpu *vcpu)
static int npf_interception(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ int rc;

u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;

trace_kvm_page_fault(fault_address, error_code);
- return kvm_mmu_page_fault(vcpu, fault_address, error_code,
- static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
- svm->vmcb->control.insn_bytes : NULL,
- svm->vmcb->control.insn_len);
+ rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+ static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+ svm->vmcb->control.insn_bytes : NULL,
+ svm->vmcb->control.insn_len);
+
+ if (error_code & PFERR_GUEST_RMP_MASK)
+ handle_rmp_page_fault(vcpu, fault_address, error_code);
+
+ return rc;
}

static int db_interception(struct kvm_vcpu *vcpu)
--
2.25.1

2022-06-20 23:19:36

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

From: Brijesh Singh <[email protected]>

Version 2 of GHCB specification added the support for two SNP Guest
Request Message NAE events. The events allows for an SEV-SNP guest to
make request to the SEV-SNP firmware through hypervisor using the
SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.

The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
difference of an additional certificate blob that can be passed through
the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
provides snp_guest_ext_guest_request() that is used by the KVM to get
both the report and certificate data at once.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/svm/svm.h | 2 +
2 files changed, 192 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 7fc0fad87054..089af21a4efe 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -343,6 +343,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)

spin_lock_init(&sev->psc_lock);
ret = sev_snp_init(&argp->error);
+ mutex_init(&sev->guest_req_lock);
} else {
ret = sev_platform_init(&argp->error);
}
@@ -1884,23 +1885,39 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)

static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
+ void *context = NULL, *certs_data = NULL, *resp_page = NULL;
+ struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
struct sev_data_snp_gctx_create data = {};
- void *context;
int rc;

+ /* Allocate memory used for the certs data in SNP guest request */
+ certs_data = kmalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
+ if (!certs_data)
+ return NULL;
+
/* Allocate memory for context page */
context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
if (!context)
- return NULL;
+ goto e_free;
+
+ /* Allocate a firmware buffer used during the guest command handling. */
+ resp_page = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+ if (!resp_page)
+ goto e_free;

data.gctx_paddr = __psp_pa(context);
rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
- if (rc) {
- snp_free_firmware_page(context);
- return NULL;
- }
+ if (rc)
+ goto e_free;
+
+ sev->snp_certs_data = certs_data;

return context;
+
+e_free:
+ snp_free_firmware_page(context);
+ kfree(certs_data);
+ return NULL;
}

static int snp_bind_asid(struct kvm *kvm, int *error)
@@ -2565,6 +2582,8 @@ static int snp_decommission_context(struct kvm *kvm)
snp_free_firmware_page(sev->snp_context);
sev->snp_context = NULL;

+ kfree(sev->snp_certs_data);
+
return 0;
}

@@ -3077,6 +3096,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
case SVM_VMGEXIT_HV_FEATURES:
case SVM_VMGEXIT_PSC:
+ case SVM_VMGEXIT_GUEST_REQUEST:
+ case SVM_VMGEXIT_EXT_GUEST_REQUEST:
break;
default:
reason = GHCB_ERR_INVALID_EVENT;
@@ -3502,6 +3523,155 @@ static unsigned long snp_handle_page_state_change(struct vcpu_svm *svm)
return rc ? map_to_psc_vmgexit_code(rc) : 0;
}

+static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm,
+ struct sev_data_snp_guest_request *data,
+ gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ kvm_pfn_t req_pfn, resp_pfn;
+ struct kvm_sev_info *sev;
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE))
+ return SEV_RET_INVALID_PARAM;
+
+ req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
+ if (is_error_noslot_pfn(req_pfn))
+ return SEV_RET_INVALID_ADDRESS;
+
+ resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
+ if (is_error_noslot_pfn(resp_pfn))
+ return SEV_RET_INVALID_ADDRESS;
+
+ if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
+ return SEV_RET_INVALID_ADDRESS;
+
+ data->gctx_paddr = __psp_pa(sev->snp_context);
+ data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
+ data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
+
+ return 0;
+}
+
+static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data, unsigned long *rc)
+{
+ u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
+ int ret;
+
+ ret = snp_page_reclaim(pfn);
+ if (ret)
+ *rc = SEV_RET_INVALID_ADDRESS;
+
+ ret = rmp_make_shared(pfn, PG_LEVEL_4K);
+ if (ret)
+ *rc = SEV_RET_INVALID_ADDRESS;
+}
+
+static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct sev_data_snp_guest_request data = {0};
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_sev_info *sev;
+ unsigned long rc;
+ int err;
+
+ if (!sev_snp_guest(vcpu->kvm)) {
+ rc = SEV_RET_INVALID_GUEST;
+ goto e_fail;
+ }
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ mutex_lock(&sev->guest_req_lock);
+
+ rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
+ if (rc)
+ goto unlock;
+
+ rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
+ if (rc)
+ /* use the firmware error code */
+ rc = err;
+
+ snp_cleanup_guest_buf(&data, &rc);
+
+unlock:
+ mutex_unlock(&sev->guest_req_lock);
+
+e_fail:
+ svm_set_ghcb_sw_exit_info_2(vcpu, rc);
+}
+
+static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
+{
+ struct sev_data_snp_guest_request req = {0};
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm *kvm = vcpu->kvm;
+ unsigned long data_npages;
+ struct kvm_sev_info *sev;
+ unsigned long rc, err;
+ u64 data_gpa;
+
+ if (!sev_snp_guest(vcpu->kvm)) {
+ rc = SEV_RET_INVALID_GUEST;
+ goto e_fail;
+ }
+
+ sev = &to_kvm_svm(kvm)->sev_info;
+
+ data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
+ data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
+
+ if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
+ rc = SEV_RET_INVALID_ADDRESS;
+ goto e_fail;
+ }
+
+ /* Verify that requested blob will fit in certificate buffer */
+ if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
+ rc = SEV_RET_INVALID_PARAM;
+ goto e_fail;
+ }
+
+ mutex_lock(&sev->guest_req_lock);
+
+ rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
+ if (rc)
+ goto unlock;
+
+ rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
+ &data_npages, &err);
+ if (rc) {
+ /*
+ * If buffer length is small then return the expected
+ * length in rbx.
+ */
+ if (err == SNP_GUEST_REQ_INVALID_LEN)
+ vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
+
+ /* pass the firmware error code */
+ rc = err;
+ goto cleanup;
+ }
+
+ /* Copy the certificate blob in the guest memory */
+ if (data_npages &&
+ kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
+ rc = SEV_RET_INVALID_ADDRESS;
+
+cleanup:
+ snp_cleanup_guest_buf(&req, &rc);
+
+unlock:
+ mutex_unlock(&sev->guest_req_lock);
+
+e_fail:
+ svm_set_ghcb_sw_exit_info_2(vcpu, rc);
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3753,6 +3923,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
svm_set_ghcb_sw_exit_info_2(vcpu, rc);
break;
}
+ case SVM_VMGEXIT_GUEST_REQUEST: {
+ snp_handle_guest_request(svm, control->exit_info_1, control->exit_info_2);
+
+ ret = 1;
+ break;
+ }
+ case SVM_VMGEXIT_EXT_GUEST_REQUEST: {
+ snp_handle_ext_guest_request(svm,
+ control->exit_info_1,
+ control->exit_info_2);
+
+ ret = 1;
+ break;
+ }
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 3fd95193ed8d..3be24da1a743 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -98,6 +98,8 @@ struct kvm_sev_info {
u64 snp_init_flags;
void *snp_context; /* SNP guest context page */
spinlock_t psc_lock;
+ void *snp_certs_data;
+ struct mutex guest_req_lock;
};

struct kvm_svm {
--
2.25.1

2022-06-20 23:19:44

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 43/49] KVM: SVM: Use a VMSA physical address variable for populating VMCB

From: Tom Lendacky <[email protected]>

In preparation to support SEV-SNP AP Creation, use a variable that holds
the VMSA physical address rather than converting the virtual address.
This will allow SEV-SNP AP Creation to set the new physical address that
will be used should the vCPU reset path be taken.

Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/kvm/svm/sev.c | 5 ++---
arch/x86/kvm/svm/svm.c | 9 ++++++++-
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 089af21a4efe..d5584551f3dd 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3980,10 +3980,9 @@ void sev_es_init_vmcb(struct vcpu_svm *svm)

/*
* An SEV-ES guest requires a VMSA area that is a separate from the
- * VMCB page. Do not include the encryption mask on the VMSA physical
- * address since hardware will access it using the guest key.
+ * VMCB page.
*/
- svm->vmcb->control.vmsa_pa = __pa(svm->sev_es.vmsa);
+ svm->vmcb->control.vmsa_pa = svm->sev_es.vmsa_pa;

/* Can't intercept CR register access, HV can't modify CR registers */
svm_clr_intercept(svm, INTERCEPT_CR0_READ);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7742bc986afc..f7155abe7567 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1296,9 +1296,16 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
svm_switch_vmcb(svm, &svm->vmcb01);

- if (vmsa_page)
+ if (vmsa_page) {
svm->sev_es.vmsa = page_address(vmsa_page);

+ /*
+ * Do not include the encryption mask on the VMSA physical
+ * address since hardware will access it using the guest key.
+ */
+ svm->sev_es.vmsa_pa = __pa(svm->sev_es.vmsa);
+ }
+
svm->guest_state_loaded = false;

return 0;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 3be24da1a743..46790bab07a8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -197,6 +197,7 @@ struct svm_nested_state {
struct vcpu_sev_es_state {
/* SEV-ES support */
struct sev_es_save_area *vmsa;
+ hpa_t vmsa_pa;
bool ghcb_in_use;
bool received_first_sipi;
unsigned int ap_reset_hold_type;
--
2.25.1

2022-06-20 23:19:51

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 44/49] KVM: SVM: Support SEV-SNP AP Creation NAE event

From: Tom Lendacky <[email protected]>

Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to alter the register state of the APs on their own. This allows
the guest a way of simulating INIT-SIPI.

A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.

For CREATE
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID. The GPA is saved in the svm struct of the
target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
to the vCPU and then the vCPU is kicked.

For CREATE_ON_INIT:
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID the next time an INIT is performed. The GPA is
saved in the svm struct of the target vCPU.

For DESTROY:
The guest indicates it wishes to stop the vCPU. The GPA is cleared
from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
added to vCPU and then the vCPU is kicked.

The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
as a result of the event or as a result of an INIT. The handler sets the
vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will
leave the vCPU as not runnable. Any previous VMSA pages that were
installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If
a new VMSA is to be installed, the VMSA guest page is pinned and set as
the VMSA in the vCPU VMCB and the vCPU state is set to
KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is
cleared in the vCPU VMCB and the vCPU state is left as
KVM_MP_STATE_UNINITIALIZED to prevent it from being run.

Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 -
arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/include/asm/svm.h | 7 +-
arch/x86/kvm/svm/sev.c | 197 +++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 5 +-
arch/x86/kvm/svm/svm.h | 6 +
arch/x86/kvm/x86.c | 9 +-
7 files changed, 221 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 2dd2bc0cf4c3..e0068e702692 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -130,7 +130,6 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP(alloc_apic_backing_page)
KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
-KVM_X86_OP(update_protected_guest_state)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1db4d178eb1d..660cf39344fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -105,6 +105,7 @@
KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(32)

#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1524,8 +1525,6 @@ struct kvm_x86_ops {
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);

void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
-
- int (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 284a8113227e..a69b6da71a65 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -263,7 +263,12 @@ enum avic_ipi_failure_cause {
#define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF)
#define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL

-#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)
+#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0)
+#define SVM_SEV_FEAT_RESTRICTED_INJECTION BIT(3)
+#define SVM_SEV_FEAT_ALTERNATE_INJECTION BIT(4)
+#define SVM_SEV_FEAT_INT_INJ_MODES \
+ (SVM_SEV_FEAT_RESTRICTED_INJECTION | \
+ SVM_SEV_FEAT_ALTERNATE_INJECTION)

struct vmcb_seg {
u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index d5584551f3dd..bb7d4547df81 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -657,6 +657,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)

static int sev_es_sync_vmsa(struct vcpu_svm *svm)
{
+ struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
struct sev_es_save_area *save = svm->sev_es.vmsa;

/* Check some debug related fields before encrypting the VMSA */
@@ -702,6 +703,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
if (sev_snp_guest(svm->vcpu.kvm))
save->sev_features |= SVM_SEV_FEAT_SNP_ACTIVE;

+ /*
+ * Save the VMSA synced SEV features. For now, they are the same for
+ * all vCPUs, so just save each time.
+ */
+ sev->sev_features = save->sev_features;
+
return 0;
}

@@ -3090,6 +3097,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
if (!ghcb_sw_scratch_is_valid(ghcb))
goto vmgexit_err;
break;
+ case SVM_VMGEXIT_AP_CREATION:
+ if (!ghcb_rax_is_valid(ghcb))
+ goto vmgexit_err;
+ break;
case SVM_VMGEXIT_NMI_COMPLETE:
case SVM_VMGEXIT_AP_HLT_LOOP:
case SVM_VMGEXIT_AP_JUMP_TABLE:
@@ -3672,6 +3683,178 @@ static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gp
svm_set_ghcb_sw_exit_info_2(vcpu, rc);
}

+static int __sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ kvm_pfn_t pfn;
+ hpa_t cur_pa;
+
+ WARN_ON(!mutex_is_locked(&svm->sev_es.snp_vmsa_mutex));
+
+ /* Save off the current VMSA PA for later checks */
+ cur_pa = svm->sev_es.vmsa_pa;
+
+ /* Mark the vCPU as offline and not runnable */
+ vcpu->arch.pv.pv_unhalted = false;
+ vcpu->arch.mp_state = KVM_MP_STATE_STOPPED;
+
+ /* Clear use of the VMSA */
+ svm->sev_es.vmsa_pa = INVALID_PAGE;
+ svm->vmcb->control.vmsa_pa = INVALID_PAGE;
+
+ if (cur_pa != __pa(svm->sev_es.vmsa) && VALID_PAGE(cur_pa)) {
+ /*
+ * The svm->sev_es.vmsa_pa field holds the hypervisor physical
+ * address of the about to be replaced VMSA which will no longer
+ * be used or referenced, so un-pin it.
+ */
+ kvm_release_pfn_dirty(__phys_to_pfn(cur_pa));
+ }
+
+ if (VALID_PAGE(svm->sev_es.snp_vmsa_gpa)) {
+ /*
+ * The VMSA is referenced by the hypervisor physical address,
+ * so retrieve the PFN and pin it.
+ */
+ pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(svm->sev_es.snp_vmsa_gpa));
+ if (is_error_pfn(pfn))
+ return -EINVAL;
+
+ /* Use the new VMSA */
+ svm->sev_es.vmsa_pa = pfn_to_hpa(pfn);
+ svm->vmcb->control.vmsa_pa = svm->sev_es.vmsa_pa;
+
+ /* Mark the vCPU as runnable */
+ vcpu->arch.pv.pv_unhalted = false;
+ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+ svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
+ }
+
+ /*
+ * When replacing the VMSA during SEV-SNP AP creation,
+ * mark the VMCB dirty so that full state is always reloaded.
+ */
+ vmcb_mark_all_dirty(svm->vmcb);
+
+ return 0;
+}
+
+/*
+ * Invoked as part of svm_vcpu_reset() processing of an init event.
+ */
+void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ int ret;
+
+ if (!sev_snp_guest(vcpu->kvm))
+ return;
+
+ mutex_lock(&svm->sev_es.snp_vmsa_mutex);
+
+ if (!svm->sev_es.snp_ap_create)
+ goto unlock;
+
+ svm->sev_es.snp_ap_create = false;
+
+ ret = __sev_snp_update_protected_guest_state(vcpu);
+ if (ret)
+ vcpu_unimpl(vcpu, "snp: AP state update on init failed\n");
+
+unlock:
+ mutex_unlock(&svm->sev_es.snp_vmsa_mutex);
+}
+
+static int sev_snp_ap_creation(struct vcpu_svm *svm)
+{
+ struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ struct kvm_vcpu *target_vcpu;
+ struct vcpu_svm *target_svm;
+ unsigned int request;
+ unsigned int apic_id;
+ bool kick;
+ int ret;
+
+ request = lower_32_bits(svm->vmcb->control.exit_info_1);
+ apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
+
+ /* Validate the APIC ID */
+ target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
+ if (!target_vcpu) {
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n",
+ apic_id);
+ return -EINVAL;
+ }
+
+ ret = 0;
+
+ target_svm = to_svm(target_vcpu);
+
+ /*
+ * We have a valid target vCPU, so the vCPU will be kicked unless the
+ * request is for CREATE_ON_INIT. For any errors at this stage, the
+ * kick will place the vCPU in an non-runnable state.
+ */
+ kick = true;
+
+ mutex_lock(&target_svm->sev_es.snp_vmsa_mutex);
+
+ target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE;
+ target_svm->sev_es.snp_ap_create = true;
+
+ /* Interrupt injection mode shouldn't change for AP creation */
+ if (request < SVM_VMGEXIT_AP_DESTROY) {
+ u64 sev_features;
+
+ sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
+ sev_features ^= sev->sev_features;
+ if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) {
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
+ vcpu->arch.regs[VCPU_REGS_RAX]);
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ switch (request) {
+ case SVM_VMGEXIT_AP_CREATE_ON_INIT:
+ kick = false;
+ fallthrough;
+ case SVM_VMGEXIT_AP_CREATE:
+ if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) {
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n",
+ svm->vmcb->control.exit_info_2);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
+ break;
+ case SVM_VMGEXIT_AP_DESTROY:
+ break;
+ default:
+ vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
+ request);
+ ret = -EINVAL;
+ break;
+ }
+
+out:
+ if (kick) {
+ if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)
+ target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+ kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
+ kvm_vcpu_kick(target_vcpu);
+ }
+
+ mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex);
+
+ return ret;
+}
+
static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
{
struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3937,6 +4120,18 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
ret = 1;
break;
}
+ case SVM_VMGEXIT_AP_CREATION:
+ ret = sev_snp_ap_creation(svm);
+ if (ret) {
+ svm_set_ghcb_sw_exit_info_1(vcpu, 1);
+ svm_set_ghcb_sw_exit_info_2(vcpu,
+ X86_TRAP_GP |
+ SVM_EVTINJ_TYPE_EXEPT |
+ SVM_EVTINJ_VALID);
+ }
+
+ ret = 1;
+ break;
case SVM_VMGEXIT_UNSUPPORTED_EVENT:
vcpu_unimpl(vcpu,
"vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
@@ -4024,6 +4219,8 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm)
set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
GHCB_VERSION_MIN,
sev_enc_bit));
+
+ mutex_init(&svm->sev_es.snp_vmsa_mutex);
}

void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f7155abe7567..fced6ea423ad 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1237,6 +1237,9 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
svm->spec_ctrl = 0;
svm->virt_spec_ctrl = 0;

+ if (init_event)
+ sev_snp_init_protected_guest_state(vcpu);
+
init_vmcb(vcpu);

if (!init_event)
@@ -4749,8 +4752,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.alloc_apic_backing_page = svm_alloc_apic_backing_page,

.rmp_page_level_adjust = sev_rmp_page_level_adjust,
-
- .update_protected_guest_state = sev_snp_update_protected_guest_state,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 46790bab07a8..971ff4e949fd 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -100,6 +100,8 @@ struct kvm_sev_info {
spinlock_t psc_lock;
void *snp_certs_data;
struct mutex guest_req_lock;
+
+ u64 sev_features; /* Features set at VMSA creation */
};

struct kvm_svm {
@@ -217,6 +219,10 @@ struct vcpu_sev_es_state {
u64 ghcb_sw_exit_info_2;

u64 ghcb_registered_gpa;
+
+ struct mutex snp_vmsa_mutex;
+ gpa_t snp_vmsa_gpa;
+ bool snp_ap_create;
};

struct vcpu_svm {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4a1d16231e30..c649d15efae3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10095,6 +10095,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+ if (kvm_check_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu)) {
+ kvm_vcpu_reset(vcpu, true);
+ if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE)
+ goto out;
+ }
}

if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -12219,7 +12225,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
if (!list_empty_careful(&vcpu->async_pf.done))
return true;

- if (kvm_apic_has_events(vcpu))
+ if (kvm_apic_has_events(vcpu) ||
+ kvm_test_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu))
return true;

if (vcpu->arch.pv.pv_unhalted)
--
2.25.1

2022-06-20 23:20:15

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 46/49] ccp: add support to decrypt the page

From: Brijesh Singh <[email protected]>

Add support to decrypt guest encrypted memory, these API interfaces can be
used for example to dump VMCBs on SNP guest exit.

Signed-off-by: Brijesh Singh <[email protected]>
---
drivers/crypto/ccp/sev-dev.c | 33 ++++++++++++++++++++++++++++++---
include/linux/psp-sev.h | 6 +++---
2 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index f6306b820b86..9896350e7f56 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1852,11 +1852,38 @@ int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
}
EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);

-int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
{
- return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
+ struct sev_data_snp_dbg data = {0};
+ struct sev_device *sev;
+ int ret;
+
+ if (!psp_master || !psp_master->sev_data)
+ return -ENODEV;
+
+ sev = psp_master->sev_data;
+
+ if (!sev->snp_inited)
+ return -EINVAL;
+
+ data.gctx_paddr = sme_me_mask | (gctx_pfn << PAGE_SHIFT);
+ data.src_addr = sme_me_mask | (src_pfn << PAGE_SHIFT);
+ data.dst_addr = sme_me_mask | (dst_pfn << PAGE_SHIFT);
+ data.len = PAGE_SIZE;
+
+ /* The destination page must be in the firmware state. */
+ if (snp_set_rmp_state(data.dst_addr, 1, true, false, false))
+ return -EIO;
+
+ ret = sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, &data, error);
+
+ /* Restore the page state */
+ if (snp_set_rmp_state(data.dst_addr, 1, false, false, true))
+ ret = -EIO;
+
+ return ret;
}
-EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
+EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt_page);

int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index cd37ccd1fa1f..8d2565c70c39 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -928,7 +928,7 @@ int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error);
int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);

/**
- * snp_guest_dbg_decrypt - perform SEV SNP_DBG_DECRYPT command
+ * snp_guest_dbg_decrypt_page - perform SEV SNP_DBG_DECRYPT command
*
* @sev_ret: sev command return code
*
@@ -939,7 +939,7 @@ int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
* -%ETIMEDOUT if the sev command timed out
* -%EIO if the sev returned a non-zero return code
*/
-int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
+int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error);

void *psp_copy_user_blob(u64 uaddr, u32 len);
void *snp_alloc_firmware_page(gfp_t mask);
@@ -997,7 +997,7 @@ static inline int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data,
return -ENODEV;
}

-static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+static inline int snp_guest_dbg_decrypt_page(u64 gctx_pfn, u64 src_pfn, u64 dst_pfn, int *error)
{
return -ENODEV;
}
--
2.25.1

2022-06-20 23:20:43

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 45/49] KVM: SVM: Add module parameter to enable the SEV-SNP

From: Brijesh Singh <[email protected]>

Add a module parameter than can be used to enable or disable the SEV-SNP
feature. Now that KVM contains the support for the SNP set the GHCB
hypervisor feature flag to indicate that SNP is supported.

Signed-off-by: Brijesh Singh <[email protected]>
---
arch/x86/kvm/svm/sev.c | 7 ++++---
arch/x86/kvm/svm/svm.h | 2 +-
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index bb7d4547df81..2c88215a111f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -57,14 +57,15 @@ module_param_named(sev, sev_enabled, bool, 0444);
/* enable/disable SEV-ES support */
static bool sev_es_enabled = true;
module_param_named(sev_es, sev_es_enabled, bool, 0444);
+
+/* enable/disable SEV-SNP support */
+static bool sev_snp_enabled = true;
+module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
#else
#define sev_enabled false
#define sev_es_enabled false
#endif /* CONFIG_KVM_AMD_SEV */

-/* enable/disable SEV-SNP support */
-static bool sev_snp_enabled;
-
#define AP_RESET_HOLD_NONE 0
#define AP_RESET_HOLD_NAE_EVENT 1
#define AP_RESET_HOLD_MSR_PROTO 2
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 971ff4e949fd..7b14b5ef1f8c 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -688,7 +688,7 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
#define GHCB_VERSION_MAX 2ULL
#define GHCB_VERSION_MIN 1ULL

-#define GHCB_HV_FT_SUPPORTED 0
+#define GHCB_HV_FT_SUPPORTED (GHCB_HV_FT_SNP | GHCB_HV_FT_SNP_AP_CREATION)

extern unsigned int max_sev_asid;

--
2.25.1

2022-06-20 23:21:00

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 48/49] *debug: warn and retry failed rmpupdates

From: Michael Roth <[email protected]>

In some cases on B0 hardware exhibits something like the following
behavior (where M < 512):

Guest A | Guest B
|-------------------------------|----------------------------------|
| | rc = rmpupdate pfn=N*512,4K,priv
| rmpupdate pfn=N*512+M,4K,priv |
| rc = FAIL_OVERLAP | rc = SUCCESS

The FAIL_OVERLAP might possible be the result of hardware temporarily
treating Guest B's rmpupdate for pfn=N*512 as a 2M update, causing the
subsequent update from Guest A for pfn=N*512+M to report FAIL_OVERLAP
at that particular instance. Retrying the update for N*512+M immediately
afterward seems to resolve the FAIL_OVERLAP issue reliably however.

A similar failure has also been observed when transitioning pages back
to shared during VM destroy. In this case repeating the rmpupdate does
not always seem to resolve the failure immediately.

Both situations are much more likely to occur if THP is disabled, or
if it is enabled/disabled while guests are actively being
started/stopped.

Include some debug/error information to get a better idea of the
behavior on different hardware, and add the rmpupdate retry as a
workaround for Milan B0 testing.

Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kernel/sev.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 6640a639fffc..5ae8c9f853c8 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2530,6 +2530,7 @@ static int rmpupdate(u64 pfn, struct rmpupdate *val)
{
unsigned long paddr = pfn << PAGE_SHIFT;
int ret, level, npages;
+ int retries = 0;

if (!pfn_valid(pfn))
return -EINVAL;
@@ -2552,12 +2553,26 @@ static int rmpupdate(u64 pfn, struct rmpupdate *val)
}
}

+retry:
/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
: "=a"(ret)
: "a"(paddr), "c"((unsigned long)val)
: "memory", "cc");

+ if (ret) {
+ if (!retries) {
+ pr_err("rmpupdate failed, ret: %d, pfn: %llx, npages: %d, level: %d, retrying (max: %d)...\n",
+ ret, pfn, npages, level, 2 * num_present_cpus());
+ dump_stack();
+ }
+ retries++;
+ if (retries < 2 * num_present_cpus())
+ goto retry;
+ } else if (retries > 0) {
+ pr_err("rmpupdate for pfn %llx succeeded after %d retries\n", pfn, retries);
+ }
+
/*
* Restore the direct map after the page is removed from the RMP table.
*/
--
2.25.1

2022-06-20 23:21:05

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 47/49] *fix for stale per-cpu pointer due to cond_resched during ghcb mapping

From: Michael Roth <[email protected]>

Signed-off-by: Michael Roth <[email protected]>
---
arch/x86/kvm/svm/svm.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fced6ea423ad..f78e3b1bde0e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1352,7 +1352,7 @@ static void svm_vcpu_free(struct kvm_vcpu *vcpu)
static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
- struct svm_cpu_data *sd = per_cpu(svm_data, vcpu->cpu);
+ struct svm_cpu_data *sd;

if (sev_es_guest(vcpu->kvm))
sev_es_unmap_ghcb(svm);
@@ -1360,6 +1360,10 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
if (svm->guest_state_loaded)
return;

+ /* sev_es_unmap_ghcb() can resched, so grab per-cpu pointer afterward. */
+ barrier();
+ sd = per_cpu(svm_data, vcpu->cpu);
+
/*
* Save additional host state that will be restored on VMEXIT (sev-es)
* or subsequent vmload of host save area.
--
2.25.1

2022-06-20 23:22:38

by Ashish Kalra

[permalink] [raw]
Subject: [PATCH Part2 v6 49/49] KVM: SVM: Sync the GHCB scratch buffer using already mapped ghcb

From: Ashish Kalra <[email protected]>

Using kvm_write_guest() to sync the GHCB scratch buffer can fail
due to host mapping being 2M, but RMP being 4K. The page fault handling
in do_user_addr_fault() fails to split the 2M page to handle RMP fault due
to it being called here in a non-preemptible context. Instead use
the already kernel mapped ghcb to sync the scratch buffer when the
scratch buffer is contained within the GHCB.

Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/kvm/svm/sev.c | 29 +++++++++++++++++++++--------
arch/x86/kvm/svm/svm.h | 2 ++
2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 2c88215a111f..e1dd67e12774 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2944,6 +2944,24 @@ static bool sev_es_sync_to_ghcb(struct vcpu_svm *svm)
ghcb_set_sw_exit_info_1(ghcb, svm->sev_es.ghcb_sw_exit_info_1);
ghcb_set_sw_exit_info_2(ghcb, svm->sev_es.ghcb_sw_exit_info_2);

+ /* Sync the scratch buffer area. */
+ if (svm->sev_es.ghcb_sa_sync) {
+ if (svm->sev_es.ghcb_sa_contained) {
+ memcpy(ghcb->shared_buffer + svm->sev_es.ghcb_sa_offset,
+ svm->sev_es.ghcb_sa, svm->sev_es.ghcb_sa_len);
+ } else {
+ int ret;
+
+ ret = kvm_write_guest(svm->vcpu.kvm,
+ svm->sev_es.ghcb_sa_gpa,
+ svm->sev_es.ghcb_sa, svm->sev_es.ghcb_sa_len);
+ if (ret)
+ pr_warn_ratelimited("unmap_ghcb: kvm_write_guest failed while syncing scratch area, gpa: %llx, ret: %d\n",
+ svm->sev_es.ghcb_sa_gpa, ret);
+ }
+ svm->sev_es.ghcb_sa_sync = false;
+ }
+
trace_kvm_vmgexit_exit(svm->vcpu.vcpu_id, ghcb);

svm_unmap_ghcb(svm, &map);
@@ -3156,14 +3174,6 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm)
if (!svm->sev_es.ghcb_in_use)
return;

- /* Sync the scratch buffer area. */
- if (svm->sev_es.ghcb_sa_sync) {
- kvm_write_guest(svm->vcpu.kvm,
- svm->sev_es.ghcb_sa_gpa,
- svm->sev_es.ghcb_sa, svm->sev_es.ghcb_sa_len);
- svm->sev_es.ghcb_sa_sync = false;
- }
-
sev_es_sync_to_ghcb(svm);

svm->sev_es.ghcb_in_use = false;
@@ -3229,6 +3239,8 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
scratch_gpa_beg, scratch_gpa_end);
goto e_scratch;
}
+ svm->sev_es.ghcb_sa_contained = true;
+ svm->sev_es.ghcb_sa_offset = scratch_gpa_beg - ghcb_scratch_beg;
} else {
/*
* The guest memory must be read into a kernel buffer, so
@@ -3239,6 +3251,7 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
len, GHCB_SCRATCH_AREA_LIMIT);
goto e_scratch;
}
+ svm->sev_es.ghcb_sa_contained = false;
}

if (svm->sev_es.ghcb_sa_alloc_len < len) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7b14b5ef1f8c..2cdfc79bf2cf 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -210,6 +210,8 @@ struct vcpu_sev_es_state {
u64 ghcb_sa_gpa;
u32 ghcb_sa_alloc_len;
bool ghcb_sa_sync;
+ bool ghcb_sa_contained;
+ u32 ghcb_sa_offset;

/*
* SEV-ES support to hold the sw_exit_info return values to be
--
2.25.1

2022-06-21 15:53:42

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

On Mon, Jun 20, 2022 at 5:02 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The memory integrity guarantees of SEV-SNP are enforced through a new
> structure called the Reverse Map Table (RMP). The RMP is a single data
> structure shared across the system that contains one entry for every 4K
> page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
> track the owner of each page of memory. Pages of memory can be owned by
> the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
> section 15.36.3 for more detail on RMP.
>
> The RMP table is used to enforce access control to memory. The table itself
> is not directly writable by the software. New CPU instructions (RMPUPDATE,
> PVALIDATE, RMPADJUST) are used to manipulate the RMP entries.
>
> Based on the platform configuration, the BIOS reserves the memory used
> for the RMP table. The start and end address of the RMP table must be
> queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
> RMP_END are not set then disable the SEV-SNP feature.
>
> The SEV-SNP feature is enabled only after the RMP table is successfully
> initialized.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/msr-index.h | 6 +
> arch/x86/kernel/sev.c | 144 +++++++++++++++++++++++
> 3 files changed, 157 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 36369e76cc63..c1be3091a383 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -68,6 +68,12 @@
> # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
> #endif
>
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +# define DISABLE_SEV_SNP 0
> +#else
> +# define DISABLE_SEV_SNP (1 << (X86_FEATURE_SEV_SNP & 31))
> +#endif
> +
> /*
> * Make sure to add features to the correct mask
> */
> @@ -91,7 +97,7 @@
> DISABLE_ENQCMD)
> #define DISABLED_MASK17 0
> #define DISABLED_MASK18 0
> -#define DISABLED_MASK19 0
> +#define DISABLED_MASK19 (DISABLE_SEV_SNP)
> #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
>
> #endif /* _ASM_X86_DISABLED_FEATURES_H */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 9e2e7185fc1d..57a8280e283a 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -507,6 +507,8 @@
> #define MSR_AMD64_SEV_ENABLED BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
> #define MSR_AMD64_SEV_ES_ENABLED BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
> #define MSR_AMD64_SEV_SNP_ENABLED BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
> +#define MSR_AMD64_RMP_BASE 0xc0010132
> +#define MSR_AMD64_RMP_END 0xc0010133
>
> #define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f
>
> @@ -581,6 +583,10 @@
> #define MSR_AMD64_SYSCFG 0xc0010010
> #define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
> #define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_EN_BIT 24
> +#define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
> #define MSR_K8_INT_PENDING_MSG 0xc0010055
> /* C1E active bits in int pending message */
> #define K8_INTP_C1E_ACTIVE_MASK 0x18000000
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index f01f4550e2c6..3a233b5d47c5 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -22,6 +22,8 @@
> #include <linux/efi.h>
> #include <linux/platform_device.h>
> #include <linux/io.h>
> +#include <linux/cpumask.h>
> +#include <linux/iommu.h>
>
> #include <asm/cpu_entry_area.h>
> #include <asm/stacktrace.h>
> @@ -38,6 +40,7 @@
> #include <asm/apic.h>
> #include <asm/cpuid.h>
> #include <asm/cmdline.h>
> +#include <asm/iommu.h>
>
> #define DR7_RESET_VALUE 0x400
>
> @@ -57,6 +60,12 @@
> #define AP_INIT_CR0_DEFAULT 0x60000010
> #define AP_INIT_MXCSR_DEFAULT 0x1f80
>
> +/*
> + * The first 16KB from the RMP_BASE is used by the processor for the
> + * bookkeeping, the range need to be added during the RMP entry lookup.
> + */
> +#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
> +
> /* For early boot hypervisor communication in SEV-ES enabled guests */
> static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>
> @@ -69,6 +78,10 @@ static struct ghcb *boot_ghcb __section(".data");
> /* Bitmap of SEV features supported by the hypervisor */
> static u64 sev_hv_features __ro_after_init;
>
> +static unsigned long rmptable_start __ro_after_init;
> +static unsigned long rmptable_end __ro_after_init;
> +
> +
> /* #VC handler runtime per-CPU data */
> struct sev_es_runtime_data {
> struct ghcb ghcb_page;
> @@ -2218,3 +2231,134 @@ static int __init snp_init_platform_device(void)
> return 0;
> }
> device_initcall(snp_init_platform_device);
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "SEV-SNP: " fmt
> +
> +static int __snp_enable(unsigned int cpu)
> +{
> + u64 val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> + val |= MSR_AMD64_SYSCFG_SNP_EN;
> + val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
> +
> + wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> + return 0;
> +}
> +
> +static __init void snp_enable(void *arg)
> +{
> + __snp_enable(smp_processor_id());
> +}
> +
> +static bool get_rmptable_info(u64 *start, u64 *len)
> +{
> + u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
> +
> + rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
> + rdmsrl(MSR_AMD64_RMP_END, rmp_end);
> +
> + if (!rmp_base || !rmp_end) {
> + pr_info("Memory for the RMP table has not been reserved by BIOS\n");
> + return false;
> + }
> +
> + rmp_sz = rmp_end - rmp_base + 1;
> +
> + /*
> + * Calculate the amount the memory that must be reserved by the BIOS to
> + * address the full system RAM. The reserved memory should also cover the
> + * RMP table itself.
> + *
> + * See PPR Family 19h Model 01h, Revision B1 section 2.1.4.2 for more
> + * information on memory requirement.
> + */
> + nr_pages = totalram_pages();
> + calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
> +
> + if (calc_rmp_sz > rmp_sz) {
> + pr_info("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
> + calc_rmp_sz, rmp_sz);
> + return false;
> + }
> +
> + *start = rmp_base;
> + *len = rmp_sz;
> +
> + pr_info("RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
> +
> + return true;
> +}
> +
> +static __init int __snp_rmptable_init(void)
> +{
> + u64 rmp_base, sz;
> + void *start;
> + u64 val;
> +
> + if (!get_rmptable_info(&rmp_base, &sz))
> + return 1;
> +
> + start = memremap(rmp_base, sz, MEMREMAP_WB);
> + if (!start) {
> + pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
> + return 1;
> + }
> +
> + /*
> + * Check if SEV-SNP is already enabled, this can happen if we are coming from
> + * kexec boot.
> + */
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> + if (val & MSR_AMD64_SYSCFG_SNP_EN)
> + goto skip_enable;
> +
> + /* Initialize the RMP table to zero */
> + memset(start, 0, sz);
> +
> + /* Flush the caches to ensure that data is written before SNP is enabled. */
> + wbinvd_on_all_cpus();
> +
> + /* Enable SNP on all CPUs. */
> + on_each_cpu(snp_enable, NULL, 1);
> +
> +skip_enable:
> + rmptable_start = (unsigned long)start;
> + rmptable_end = rmptable_start + sz;

Since in get_rmptable_info() `rmp_sz = rmp_end - rmp_base + 1;` should
this be `rmptable_end = rmptable_start + sz - 1;`?

> +
> + return 0;
> +}
> +
> +static int __init snp_rmptable_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + if (!iommu_sev_snp_supported())
> + goto nosnp;
> +
> + if (__snp_rmptable_init())
> + goto nosnp;
> +
> + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
> +
> + return 0;
> +
> +nosnp:
> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> + return 1;
> +}
> +
> +/*
> + * This must be called after the PCI subsystem. This is because before enabling
> + * the SNP feature we need to ensure that IOMMU supports the SEV-SNP feature.
> + * The iommu_sev_snp_support() is used for checking the feature, and it is
> + * available after subsys_initcall().
> + */
> +fs_initcall(snp_rmptable_init);
> --
> 2.25.1
>

2022-06-21 17:40:35

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

Hello Dave,

>> /*
>> * The RMP entry format is not architectural. The format is defined
>> in PPR @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
>> u64 secrets_gpa;
>> };
>>
>> +struct rmpupdate {
>> + u64 gpa;
>> + u8 assigned;
>> + u8 pagesize;
>> + u8 immutable;
>> + u8 rsvd;
>> + u32 asid;
>> +} __packed;

>I see above it says the RMP entry format isn't architectural; is this 'rmpupdate' structure? If not how is this going to get handled when we have a couple of SNP capable CPUs with different layouts?

Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible.
I probably think the wording here should be architecture independent or more precisely platform independent.

Thanks,
Ashish

2022-06-21 18:08:03

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

[Public]

Hello Peter,

>> +static __init int __snp_rmptable_init(void) {
>> + u64 rmp_base, sz;
>> + void *start;
>> + u64 val;
>> +
>> + if (!get_rmptable_info(&rmp_base, &sz))
>> + return 1;
>> +
>> + start = memremap(rmp_base, sz, MEMREMAP_WB);
>> + if (!start) {
>> + pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
>> + return 1;
>> + }
>> +
>> + /*
>> + * Check if SEV-SNP is already enabled, this can happen if we are coming from
>> + * kexec boot.
>> + */
>> + rdmsrl(MSR_AMD64_SYSCFG, val);
>> + if (val & MSR_AMD64_SYSCFG_SNP_EN)
>> + goto skip_enable;
>> +
>> + /* Initialize the RMP table to zero */
>> + memset(start, 0, sz);
>> +
>> + /* Flush the caches to ensure that data is written before SNP is enabled. */
>> + wbinvd_on_all_cpus();
>> +
>> + /* Enable SNP on all CPUs. */
>> + on_each_cpu(snp_enable, NULL, 1);
>> +
>> +skip_enable:
>> + rmptable_start = (unsigned long)start;
>> + rmptable_end = rmptable_start + sz;

> Since in get_rmptable_info() `rmp_sz = rmp_end - rmp_base + 1;` should this be `rmptable_end = rmptable_start + sz - 1;`?

Yes, it should be.

Thanks,
Ashish

2022-06-21 18:13:33

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Jun 20, 2022 at 5:05 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The behavior and requirement for the SEV-legacy command is altered when
> the SNP firmware is in the INIT state. See SEV-SNP firmware specification
> for more details.
>
> Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region
> when SNP is enabled to satify new requirements for the SNP. Continue

satisfy

> allocating a 1mb region for !SNP configuration.
>
> While at it, provide API that can be used by others to allocate a page
> that can be used by the firmware. The immediate user for this API will
> be the KVM driver. The KVM driver to need to allocate a firmware context
> page during the guest creation. The context page need to be updated
> by the firmware. See the SEV-SNP specification for further details.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 173 +++++++++++++++++++++++++++++++++--
> include/linux/psp-sev.h | 11 +++
> 2 files changed, 178 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 35d76333e120..0dbd99f29b25 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -79,6 +79,14 @@ static void *sev_es_tmr;
> #define NV_LENGTH (32 * 1024)
> static void *sev_init_ex_buffer;
>
> +/* When SEV-SNP is enabled the TMR needs to be 2MB aligned and 2MB size. */
> +#define SEV_SNP_ES_TMR_SIZE (2 * 1024 * 1024)
> +
> +static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;

Why not keep all this TMR stuff together near the SEV_ES_TMR_SIZE define?

> +
> +static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
> +static int sev_do_cmd(int cmd, void *data, int *psp_ret);
> +
> static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
> {
> struct sev_device *sev = psp_master->sev_data;
> @@ -177,11 +185,161 @@ static int sev_cmd_buffer_len(int cmd)
> return 0;
> }
>
> +static void snp_leak_pages(unsigned long pfn, unsigned int npages)
> +{
> + WARN(1, "psc failed, pfn 0x%lx pages %d (leaking)\n", pfn, npages);
> + while (npages--) {
> + memory_failure(pfn, 0);
> + dump_rmpentry(pfn);
> + pfn++;
> + }
> +}
> +
> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages, bool locked)
> +{
> + struct sev_data_snp_page_reclaim data;
> + int ret, err, i, n = 0;
> +
> + for (i = 0; i < npages; i++) {

What about setting |n| here too, also the other increments.

for (i = 0, n = 0; i < npages; i++, n++, pfn++)

> + memset(&data, 0, sizeof(data));
> + data.paddr = pfn << PAGE_SHIFT;
> +
> + if (locked)
> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> + else
> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);

Can we change `sev_cmd_mutex` to some sort of nesting lock type? That
could clean up this if (locked) code.

> + if (ret)
> + goto cleanup;
> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + goto cleanup;
> +
> + pfn++;
> + n++;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /*
> + * If failed to reclaim the page then page is no longer safe to
> + * be released, leak it.
> + */
> + snp_leak_pages(pfn, npages - n);
> + return ret;
> +}
> +
> +static inline int rmp_make_firmware(unsigned long pfn, int level)
> +{
> + return rmp_make_private(pfn, 0, level, 0, true);
> +}
> +
> +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
> + bool need_reclaim)

This function can do a lot and when I read the call sites its hard to
see what its doing since we have a combination of arguments which tell
us what behavior is happening, some of which are not valid (ex: to_fw
== true and need_reclaim == true is an invalid argument combination).
Also this for loop over |npages| is duplicated from
snp_reclaim_pages(). One improvement here is that on the current
snp_reclaim_pages() if we fail to reclaim a page we assume we cannot
reclaim the next pages, this may cause us to snp_leak_pages() more
pages than we actually need too.

What about something like this?

static snp_leak_page(u64 pfn, enum pg_level level)
{
memory_failure(pfn, 0);
dump_rmpentry(pfn);
}

static int snp_reclaim_page(u64 pfn, enum pg_level level)
{
int ret;
struct sev_data_snp_page_reclaim data;

ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
if (ret)
goto cleanup;

ret = rmp_make_shared(pfn, level);
if (ret)
goto cleanup;

return 0;

cleanup:
snp_leak_page(pfn, level)
}

typedef int (*rmp_state_change_func) (u64 pfn, enum pg_level level);

static int snp_set_rmp_state(unsigned long paddr, unsigned int npages,
rmp_state_change_func state_change, rmp_state_change_func cleanup)
{
struct sev_data_snp_page_reclaim data;
int ret, err, i, n = 0;

for (i = 0, n = 0; i < npages; i++, n++, pfn++) {
ret = state_change(pfn, PG_LEVEL_4K)
if (ret)
goto cleanup;
}

return 0;

cleanup:
for (; i>= 0; i--, n--, pfn--) {
cleanup(pfn, PG_LEVEL_4K);
}

return ret;
}

Then inside of __snp_alloc_firmware_pages():

snp_set_rmp_state(paddr, npages, rmp_make_firmware, snp_reclaim_page);

And inside of __snp_free_firmware_pages():

snp_set_rmp_state(paddr, npages, snp_reclaim_page, snp_leak_page);

Just a suggestion feel free to ignore. The readability comment could
be addressed much less invasively by just making separate functions
for each valid combination of arguments here. Like
snp_set_rmp_fw_state(), snp_set_rmp_shared_state(),
snp_set_rmp_release_state() or something.

> +{
> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT; /* Cbit maybe set in the paddr */
> + int rc, n = 0, i;
> +
> + for (i = 0; i < npages; i++) {
> + if (to_fw)
> + rc = rmp_make_firmware(pfn, PG_LEVEL_4K);
> + else
> + rc = need_reclaim ? snp_reclaim_pages(pfn, 1, locked) :
> + rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (rc)
> + goto cleanup;
> +
> + pfn++;
> + n++;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /* Try unrolling the firmware state changes */
> + if (to_fw) {
> + /*
> + * Reclaim the pages which were already changed to the
> + * firmware state.
> + */
> + snp_reclaim_pages(paddr >> PAGE_SHIFT, n, locked);
> +
> + return rc;
> + }
> +
> + /*
> + * If failed to change the page state to shared, then its not safe
> + * to release the page back to the system, leak it.
> + */
> + snp_leak_pages(pfn, npages - n);
> +
> + return rc;
> +}
> +
> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
> +{
> + unsigned long npages = 1ul << order, paddr;
> + struct sev_device *sev;
> + struct page *page;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return NULL;
> +
> + page = alloc_pages(gfp_mask, order);
> + if (!page)
> + return NULL;
> +
> + /* If SEV-SNP is initialized then add the page in RMP table. */
> + sev = psp_master->sev_data;
> + if (!sev->snp_inited)
> + return page;
> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (snp_set_rmp_state(paddr, npages, true, locked, false))
> + return NULL;

So what about the case where snp_set_rmp_state() fails but we were
able to reclaim all the pages? Should we be able to signal that to
callers so that we could free |page| here? But given this is an error
path already maybe we can optimize this in a follow up series.

> +
> + return page;
> +}
> +
> +void *snp_alloc_firmware_page(gfp_t gfp_mask)
> +{
> + struct page *page;
> +
> + page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
> +
> + return page ? page_address(page) : NULL;
> +}
> +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
> +
> +static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
> +{
> + unsigned long paddr, npages = 1ul << order;
> +
> + if (!page)
> + return;
> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (snp_set_rmp_state(paddr, npages, false, locked, true))
> + return;

Here we may be able to free some of |page| depending how where inside
of snp_set_rmp_state() we failed. But again given this is an error
path already maybe we can optimize this in a follow up series.



> +
> + __free_pages(page, order);
> +}
> +
> +void snp_free_firmware_page(void *addr)
> +{
> + if (!addr)
> + return;
> +
> + __snp_free_firmware_pages(virt_to_page(addr), 0, false);
> +}
> +EXPORT_SYMBOL(snp_free_firmware_page);
> +
> static void *sev_fw_alloc(unsigned long len)
> {
> struct page *page;
>
> - page = alloc_pages(GFP_KERNEL, get_order(len));
> + page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(len), false);
> if (!page)
> return NULL;
>
> @@ -393,7 +551,7 @@ static int __sev_init_locked(int *error)
> data.tmr_address = __pa(sev_es_tmr);
>
> data.flags |= SEV_INIT_FLAGS_SEV_ES;
> - data.tmr_len = SEV_ES_TMR_SIZE;
> + data.tmr_len = sev_es_tmr_size;
> }
>
> return __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
> @@ -421,7 +579,7 @@ static int __sev_init_ex_locked(int *error)
> data.tmr_address = __pa(sev_es_tmr);
>
> data.flags |= SEV_INIT_FLAGS_SEV_ES;
> - data.tmr_len = SEV_ES_TMR_SIZE;
> + data.tmr_len = sev_es_tmr_size;
> }
>
> return __sev_do_cmd_locked(SEV_CMD_INIT_EX, &data, error);
> @@ -818,6 +976,8 @@ static int __sev_snp_init_locked(int *error)
> sev->snp_inited = true;
> dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
>
> + sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
> +
> return rc;
> }
>
> @@ -1341,8 +1501,9 @@ static void sev_firmware_shutdown(struct sev_device *sev)
> /* The TMR area was encrypted, flush it from the cache */
> wbinvd_on_all_cpus();
>
> - free_pages((unsigned long)sev_es_tmr,
> - get_order(SEV_ES_TMR_SIZE));
> + __snp_free_firmware_pages(virt_to_page(sev_es_tmr),
> + get_order(sev_es_tmr_size),
> + false);
> sev_es_tmr = NULL;
> }
>
> @@ -1430,7 +1591,7 @@ void sev_pci_init(void)
> }
>
> /* Obtain the TMR memory area for SEV-ES use */
> - sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
> + sev_es_tmr = sev_fw_alloc(sev_es_tmr_size);
> if (!sev_es_tmr)
> dev_warn(sev->dev,
> "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 9f921d221b75..a3bb792bb842 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -12,6 +12,8 @@
> #ifndef __PSP_SEV_H__
> #define __PSP_SEV_H__
>
> +#include <linux/sev.h>
> +
> #include <uapi/linux/psp-sev.h>
>
> #ifdef CONFIG_X86
> @@ -940,6 +942,8 @@ int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
> int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
>
> void *psp_copy_user_blob(u64 uaddr, u32 len);
> +void *snp_alloc_firmware_page(gfp_t mask);
> +void snp_free_firmware_page(void *addr);
>
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> @@ -981,6 +985,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
> return -ENODEV;
> }
>
> +static inline void *snp_alloc_firmware_page(gfp_t mask)
> +{
> + return NULL;
> +}
> +
> +static inline void snp_free_firmware_page(void *addr) { }
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
> --
> 2.25.1
>

2022-06-21 20:18:32

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

[Public]

Hello Peter,

>> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages,
>> +bool locked) {
>> + struct sev_data_snp_page_reclaim data;
>> + int ret, err, i, n = 0;
>> +
>> + for (i = 0; i < npages; i++) {

>What about setting |n| here too, also the other increments.

>for (i = 0, n = 0; i < npages; i++, n++, pfn++)

Yes that is simpler.

>> + memset(&data, 0, sizeof(data));
>> + data.paddr = pfn << PAGE_SHIFT;
>> +
>> + if (locked)
>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
>> + else
>> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM,
>> + &data, &err);

> Can we change `sev_cmd_mutex` to some sort of nesting lock type? That could clean up this if (locked) code.

> +static inline int rmp_make_firmware(unsigned long pfn, int level) {
> + return rmp_make_private(pfn, 0, level, 0, true); }
> +
> +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
> + bool need_reclaim)

>This function can do a lot and when I read the call sites its hard to see what its doing since we have a combination of arguments which tell us what behavior is happening, some of which are not valid (ex: to_fw == true and need_reclaim == true is an >invalid argument combination).

to_fw is used to make a firmware page and need_reclaim is for freeing the firmware page, so they are going to be mutually exclusive.

I actually can connect with it quite logically with the callers :
snp_alloc_firmware_pages will call with to_fw = true and need_reclaim = false
and snp_free_firmware_pages will do the opposite, to_fw = false and need_reclaim = true.

That seems straightforward to look at.

>Also this for loop over |npages| is duplicated from snp_reclaim_pages(). One improvement here is that on the current
>snp_reclaim_pages() if we fail to reclaim a page we assume we cannot reclaim the next pages, this may cause us to snp_leak_pages() more pages than we actually need too.

Yes that is true.

>What about something like this?

>static snp_leak_page(u64 pfn, enum pg_level level) {
> memory_failure(pfn, 0);
> dump_rmpentry(pfn);
>}

>static int snp_reclaim_page(u64 pfn, enum pg_level level) {
> int ret;
> struct sev_data_snp_page_reclaim data;

> ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> if (ret)
> goto cleanup;

> ret = rmp_make_shared(pfn, level);
> if (ret)
> goto cleanup;

> return 0;

>cleanup:
> snp_leak_page(pfn, level)
>}

>typedef int (*rmp_state_change_func) (u64 pfn, enum pg_level level);

>static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, rmp_state_change_func state_change, rmp_state_change_func cleanup) {
> struct sev_data_snp_page_reclaim data;
> int ret, err, i, n = 0;

> for (i = 0, n = 0; i < npages; i++, n++, pfn++) {
> ret = state_change(pfn, PG_LEVEL_4K)
> if (ret)
> goto cleanup;
> }

> return 0;

> cleanup:
> for (; i>= 0; i--, n--, pfn--) {
> cleanup(pfn, PG_LEVEL_4K);
> }

> return ret;
>}

>Then inside of __snp_alloc_firmware_pages():

>snp_set_rmp_state(paddr, npages, rmp_make_firmware, snp_reclaim_page);

>And inside of __snp_free_firmware_pages():

>snp_set_rmp_state(paddr, npages, snp_reclaim_page, snp_leak_page);

>Just a suggestion feel free to ignore. The readability comment could be addressed much less invasively by just making separate functions for each valid combination of arguments here. Like snp_set_rmp_fw_state(), snp_set_rmp_shared_state(),
>snp_set_rmp_release_state() or something.

>> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int
>> +order, bool locked) {
>> + unsigned long npages = 1ul << order, paddr;
>> + struct sev_device *sev;
>> + struct page *page;
>> +
>> + if (!psp_master || !psp_master->sev_data)
>> + return NULL;
>> +
>> + page = alloc_pages(gfp_mask, order);
>> + if (!page)
>> + return NULL;
>> +
>> + /* If SEV-SNP is initialized then add the page in RMP table. */
>> + sev = psp_master->sev_data;
>> + if (!sev->snp_inited)
>> + return page;
>> +
>> + paddr = __pa((unsigned long)page_address(page));
>> + if (snp_set_rmp_state(paddr, npages, true, locked, false))
>> + return NULL;

>So what about the case where snp_set_rmp_state() fails but we were able to reclaim all the pages? Should we be able to signal that to callers so that we could free |page| here? But given this is an error path already maybe we can optimize this in a >follow up series.

Yes, we should actually tie in to snp_reclaim_pages() success or failure here in the case we were able to successfully unroll some or all of the firmware state change.

> +
> + return page;
> +}
> +
> +void *snp_alloc_firmware_page(gfp_t gfp_mask) {
> + struct page *page;
> +
> + page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
> +
> + return page ? page_address(page) : NULL; }
> +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
> +
> +static void __snp_free_firmware_pages(struct page *page, int order,
> +bool locked) {
> + unsigned long paddr, npages = 1ul << order;
> +
> + if (!page)
> + return;
> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (snp_set_rmp_state(paddr, npages, false, locked, true))
> + return;

> Here we may be able to free some of |page| depending how where inside of snp_set_rmp_state() we failed. But again given this is an error path already maybe we can optimize this in a follow up series.

Yes, we probably should be able to free some of the page(s) depending on how many page(s) got reclaimed in snp_set_rmp_state().
But these reclamation failures may not be very common, so any failure is indicative of a bigger issue, it might be the case when there is a single page reclamation error it might happen with all the subsequent
pages and so follow a simple recovery procedure, then handling a more complex recovery for a chunk of pages being reclaimed and another chunk not.

Thanks,
Ashish



2022-06-21 21:44:34

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

(

On Mon, Jun 20, 2022 at 5:05 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
> commands for SEV-SNP is defined in the SEV-SNP firmware specification.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 24 ++++++++++++
> include/linux/psp-sev.h | 73 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 97 insertions(+)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index f1173221d0b9..35d76333e120 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1205,6 +1205,30 @@ int sev_guest_df_flush(int *error)
> }
> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>
> +int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_decommission);
> +
> +int snp_guest_df_flush(int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_df_flush);

Why not instead change sev_guest_df_flush() to be SNP aware? That way
callers get the right behavior without having to know if SNP is
enabled or not.

int sev_guest_df_flush(int *error)
{
if (!psp_master || !psp_master->sev_data)
return -EINVAL;

if (psp_master->sev_data->snp_inited)
return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);

return sev_do_cmd(SEV_CMD_DF_FLUSH, NULL, error);
}

> +int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);
> +
> +int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
> +
> static void sev_exit(struct kref *ref)
> {
> misc_deregister(&misc_dev->misc);
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index ef4d42e8c96e..9f921d221b75 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -881,6 +881,64 @@ int sev_guest_df_flush(int *error);
> */
> int sev_guest_decommission(struct sev_data_decommission *data, int *error);
>
> +/**
> + * snp_guest_df_flush - perform SNP DF_FLUSH command
> + *
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV if the sev device is not available
> + * -%ENOTSUPP if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO if the sev returned a non-zero return code
> + */
> +int snp_guest_df_flush(int *error);
> +
> +/**
> + * snp_guest_decommission - perform SNP_DECOMMISSION command
> + *
> + * @decommission: sev_data_decommission structure to be processed
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV if the sev device is not available
> + * -%ENOTSUPP if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO if the sev returned a non-zero return code
> + */
> +int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error);
> +
> +/**
> + * snp_guest_page_reclaim - perform SNP_PAGE_RECLAIM command
> + *
> + * @decommission: sev_snp_page_reclaim structure to be processed
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV if the sev device is not available
> + * -%ENOTSUPP if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO if the sev returned a non-zero return code
> + */
> +int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
> +
> +/**
> + * snp_guest_dbg_decrypt - perform SEV SNP_DBG_DECRYPT command
> + *
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV if the sev device is not available
> + * -%ENOTSUPP if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO if the sev returned a non-zero return code
> + */
> +int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
> +
> void *psp_copy_user_blob(u64 uaddr, u32 len);
>
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
> @@ -908,6 +966,21 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int
>
> static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }
>
> +static inline int
> +snp_guest_decommission(struct sev_data_snp_decommission *data, int *error) { return -ENODEV; }
> +
> +static inline int snp_guest_df_flush(int *error) { return -ENODEV; }
> +
> +static inline int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
> +{
> + return -ENODEV;
> +}
> +
> +static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> +{
> + return -ENODEV;
> +}
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
> --
> 2.25.1
>

2022-06-21 22:18:46

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 17/49] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command

On Mon, Jun 20, 2022 at 5:06 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
> system-wide configuration value for SNP guests. The information includes
> the TCB version string to be reported in guest attestation reports.
>
> Version 2 of the GHCB specification adds an NAE (SNP extended guest
> request) that a guest can use to query the reports that include additional
> certificates.
>
> In both cases, userspace provided additional data is included in the
> attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
> command to give the certificate blob and the reported TCB version string
> at once. Note that the specification defines certificate blob with a
> specific GUID format; the userspace is responsible for building the
> proper certificate blob. The ioctl treats it an opaque blob.
>
> While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
> command that can be used to obtain the data programmed through the
> SNP_SET_EXT_CONFIG.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> Documentation/virt/coco/sevguest.rst | 27 +++++++
> drivers/crypto/ccp/sev-dev.c | 115 +++++++++++++++++++++++++++
> drivers/crypto/ccp/sev-dev.h | 3 +
> include/uapi/linux/psp-sev.h | 17 ++++
> 4 files changed, 162 insertions(+)
>
> diff --git a/Documentation/virt/coco/sevguest.rst b/Documentation/virt/coco/sevguest.rst
> index 11ea67c944df..3014de47e4ce 100644
> --- a/Documentation/virt/coco/sevguest.rst
> +++ b/Documentation/virt/coco/sevguest.rst
> @@ -145,6 +145,33 @@ The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
> status includes API major, minor version and more. See the SEV-SNP
> specification for further details.
>
> +2.5 SNP_SET_EXT_CONFIG
> +----------------------
> +:Technology: sev-snp
> +:Type: hypervisor ioctl cmd
> +:Parameters (in): struct sev_data_snp_ext_config
> +:Returns (out): 0 on success, -negative on error
> +
> +The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
> +reported TCB version in the attestation report. The command is similar to
> +SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
> +command also accepts an additional certificate blob defined in the GHCB
> +specification.
> +
> +If the certs_address is zero, then previous certificate blob will deleted.

... then the previous certificate blob will be deleted.

> +For more information on the certificate blob layout, see the GHCB spec
> +(extended guest request message).
> +
> +2.6 SNP_GET_EXT_CONFIG
> +----------------------
> +:Technology: sev-snp
> +:Type: hypervisor ioctl cmd
> +:Parameters (in): struct sev_data_snp_ext_config
> +:Returns (out): 0 on success, -negative on error
> +
> +The SNP_SET_EXT_CONFIG is used to query the system-wide configuration set
> +through the SNP_SET_EXT_CONFIG.
> +
> 3. SEV-SNP CPUID Enforcement
> ============================
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index b9b6fab31a82..97b479d5aa86 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1312,6 +1312,10 @@ static int __sev_snp_shutdown_locked(int *error)
> if (!sev->snp_inited)
> return 0;
>
> + /* Free the memory used for caching the certificate data */
> + kfree(sev->snp_certs_data);
> + sev->snp_certs_data = NULL;
> +
> /* SHUTDOWN requires the DF_FLUSH */
> wbinvd_on_all_cpus();
> __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
> @@ -1616,6 +1620,111 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
> return ret;
> }
>
> +static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
> +{
> + struct sev_device *sev = psp_master->sev_data;
> + struct sev_user_data_ext_snp_config input;

Lets memset |input| to zero to avoid leaking kernel memory, see
"crypto: ccp - Use kzalloc for sev ioctl interfaces to prevent kernel
memory leak"

> + int ret;
> +
> + if (!sev->snp_inited || !argp->data)
> + return -EINVAL;
> +
> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
> + return -EFAULT;
> +
> + /* Copy the TCB version programmed through the SET_CONFIG to userspace */
> + if (input.config_address) {
> + if (copy_to_user((void * __user)input.config_address,
> + &sev->snp_config, sizeof(struct sev_user_data_snp_config)))
> + return -EFAULT;
> + }
> +
> + /* Copy the extended certs programmed through the SNP_SET_CONFIG */
> + if (input.certs_address && sev->snp_certs_data) {
> + if (input.certs_len < sev->snp_certs_len) {
> + /* Return the certs length to userspace */
> + input.certs_len = sev->snp_certs_len;
> +
> + ret = -ENOSR;
> + goto e_done;
> + }
> +
> + if (copy_to_user((void * __user)input.certs_address,
> + sev->snp_certs_data, sev->snp_certs_len))
> + return -EFAULT;
> + }
> +
> + ret = 0;
> +
> +e_done:
> + if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
> + ret = -EFAULT;
> +
> + return ret;
> +}
> +
> +static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
> +{
> + struct sev_device *sev = psp_master->sev_data;
> + struct sev_user_data_ext_snp_config input;
> + struct sev_user_data_snp_config config;
> + void *certs = NULL;
> + int ret = 0;
> +
> + if (!sev->snp_inited || !argp->data)
> + return -EINVAL;
> +
> + if (!writable)
> + return -EPERM;
> +
> + if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
> + return -EFAULT;
> +
> + /* Copy the certs from userspace */
> + if (input.certs_address) {
> + if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
> + return -EINVAL;
> +
> + certs = psp_copy_user_blob(input.certs_address, input.certs_len);

I see that psp_copy_user_blob() uses memdup_user() which tracks the
allocated memory to GFP_USER. Given this memory is long lived and now
belongs to the PSP driver in perpetuity, should this be tracked with
GFP_KERNEL?

> + if (IS_ERR(certs))
> + return PTR_ERR(certs);
> + }
> +
> + /* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
> + if (input.config_address) {
> + if (copy_from_user(&config,
> + (void __user *)input.config_address, sizeof(config))) {
> + ret = -EFAULT;
> + goto e_free;
> + }
> +
> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
> + if (ret)
> + goto e_free;
> +
> + memcpy(&sev->snp_config, &config, sizeof(config));
> + }
> +
> + /*
> + * If the new certs are passed then cache it else free the old certs.
> + */
> + if (certs) {
> + kfree(sev->snp_certs_data);
> + sev->snp_certs_data = certs;
> + sev->snp_certs_len = input.certs_len;
> + } else {
> + kfree(sev->snp_certs_data);
> + sev->snp_certs_data = NULL;
> + sev->snp_certs_len = 0;
> + }

Do we need another lock here? When I look at 18/49 it seems like
snp_guest_ext_guest_request() it seems like we have a race for
|sev->snp_certs_data|

> +
> + return 0;
> +
> +e_free:
> + kfree(certs);
> + return ret;
> +}
> +
> static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> {
> void __user *argp = (void __user *)arg;
> @@ -1670,6 +1779,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> case SNP_PLATFORM_STATUS:
> ret = sev_ioctl_snp_platform_status(&input);
> break;
> + case SNP_SET_EXT_CONFIG:
> + ret = sev_ioctl_snp_set_config(&input, writable);
> + break;
> + case SNP_GET_EXT_CONFIG:
> + ret = sev_ioctl_snp_get_config(&input);
> + break;
> default:
> ret = -EINVAL;
> goto out;
> diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
> index fe5d7a3ebace..d2fe1706311a 100644
> --- a/drivers/crypto/ccp/sev-dev.h
> +++ b/drivers/crypto/ccp/sev-dev.h
> @@ -66,6 +66,9 @@ struct sev_device {
>
> bool snp_inited;
> struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
> + void *snp_certs_data;
> + u32 snp_certs_len;
> + struct sev_user_data_snp_config snp_config;

Since this gets copy_to_user'd can we memset this to 0 to prevent
leaking kernel uninitialized memory? Similar to recent patches with
kzalloc and __GPF_ZERO usage.


> };
>
> int sev_dev_init(struct psp_device *psp);
> diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
> index ffd60e8b0a31..60e7a8d1a18e 100644
> --- a/include/uapi/linux/psp-sev.h
> +++ b/include/uapi/linux/psp-sev.h
> @@ -29,6 +29,8 @@ enum {
> SEV_GET_ID, /* This command is deprecated, use SEV_GET_ID2 */
> SEV_GET_ID2,
> SNP_PLATFORM_STATUS,
> + SNP_SET_EXT_CONFIG,
> + SNP_GET_EXT_CONFIG,
>
> SEV_MAX,
> };
> @@ -190,6 +192,21 @@ struct sev_user_data_snp_config {
> __u8 rsvd[52];
> } __packed;
>
> +/**
> + * struct sev_data_snp_ext_config - system wide configuration value for SNP.
> + *
> + * @config_address: address of the struct sev_user_data_snp_config or 0 when
> + * reported_tcb does not need to be updated.
> + * @certs_address: address of extended guest request certificate chain or
> + * 0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
> + * @certs_len: length of the certs
> + */
> +struct sev_user_data_ext_snp_config {
> + __u64 config_address; /* In */
> + __u64 certs_address; /* In */
> + __u32 certs_len; /* In */
> +};
> +
> /**
> * struct sev_issue_cmd - SEV ioctl parameters
> *
> --
> 2.25.1
>

2022-06-21 22:34:10

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 18/49] crypto: ccp: Provide APIs to query extended attestation report

On Mon, Jun 20, 2022 at 5:06 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> Version 2 of the GHCB specification defines VMGEXIT that is used to get
> the extended attestation report. The extended attestation report includes
> the certificate blobs provided through the SNP_SET_EXT_CONFIG.
>
> The snp_guest_ext_guest_request() will be used by the hypervisor to get
> the extended attestation report. See the GHCB specification for more
> details.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 43 ++++++++++++++++++++++++++++++++++++
> include/linux/psp-sev.h | 24 ++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 97b479d5aa86..f6306b820b86 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -25,6 +25,7 @@
> #include <linux/fs.h>
>
> #include <asm/smp.h>
> +#include <asm/sev.h>
>
> #include "psp-dev.h"
> #include "sev-dev.h"
> @@ -1857,6 +1858,48 @@ int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> }
> EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
>
> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
> +{
> + unsigned long expected_npages;
> + struct sev_device *sev;
> + int rc;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return -ENODEV;
> +
> + sev = psp_master->sev_data;
> +
> + if (!sev->snp_inited)
> + return -EINVAL;
> +
> + /*
> + * Check if there is enough space to copy the certificate chain. Otherwise
> + * return ERROR code defined in the GHCB specification.
> + */
> + expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
> + if (*npages < expected_npages) {
> + *npages = expected_npages;
> + *fw_err = SNP_GUEST_REQ_INVALID_LEN;
> + return -EINVAL;
> + }
> +
> + rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)&fw_err);

We can just pass |fw_error| here (with the cast) here right? Not need
to do &fw_err.

rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)fw_err);

> + if (rc)
> + return rc;
> +
> + /* Copy the certificate blob */
> + if (sev->snp_certs_data) {
> + *npages = expected_npages;
> + memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);

Why don't we just make |vaddr| into a void* instead of an unsigned long?

> + } else {
> + *npages = 0;
> + }
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
> +
> static void sev_exit(struct kref *ref)
> {
> misc_deregister(&misc_dev->misc);
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index a3bb792bb842..cd37ccd1fa1f 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -945,6 +945,23 @@ void *psp_copy_user_blob(u64 uaddr, u32 len);
> void *snp_alloc_firmware_page(gfp_t mask);
> void snp_free_firmware_page(void *addr);
>
> +/**
> + * snp_guest_ext_guest_request - perform the SNP extended guest request command
> + * defined in the GHCB specification.
> + *
> + * @data: the input guest request structure
> + * @vaddr: address where the certificate blob need to be copied.
> + * @npages: number of pages for the certificate blob.
> + * If the specified page count is less than the certificate blob size, then the
> + * required page count is returned with error code defined in the GHCB spec.
> + * If the specified page count is more than the certificate blob size, then
> + * page count is updated to reflect the amount of valid data copied in the
> + * vaddr.
> + */
> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *npages,
> + unsigned long *error);
> +
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> static inline int
> @@ -992,6 +1009,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
>
> static inline void snp_free_firmware_page(void *addr) { }
>
> +static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *n,
> + unsigned long *error)
> +{
> + return -ENODEV;
> +}
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
> --
> 2.25.1
>

2022-06-22 01:48:12

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

[Public]

>> +EXPORT_SYMBOL_GPL(snp_guest_decommission);
>> +
>> +int snp_guest_df_flush(int *error)
>> +{
>> + return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error); }
>> +EXPORT_SYMBOL_GPL(snp_guest_df_flush);

>Why not instead change sev_guest_df_flush() to be SNP aware? That way callers get the right behavior without having to know if SNP is enabled or not.

It can be done, and actually both DF_FLUSH commands do exactly the same thing.

But as with other API interfaces here, I think it is better to differentiate between snp and sev API interfaces and the callers be aware of which
interface they are invoking.

Thanks,
Ashish

2022-06-22 10:33:52

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 27/49] KVM: SVM: Mark the private vma unmerable for SEV-SNP guests

* Ashish Kalra ([email protected]) wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled, the guest private pages are added in the RMP
> table; while adding the pages, the rmp_make_private() unmaps the pages
> from the direct map. If KSM attempts to access those unmapped pages then
> it will trigger #PF (page-not-present).
>
> Encrypted guest pages cannot be shared between the process, so an
> userspace should not mark the region mergeable but to be safe, mark the
> process vma unmerable before adding the pages in the RMP table.
^
Typo 'unmergable' (also in title)

> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index b5f0707d7ed6..a9461d352eda 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -19,11 +19,13 @@
> #include <linux/trace_events.h>
> #include <linux/hugetlb.h>
> #include <linux/sev.h>
> +#include <linux/ksm.h>
>
> #include <asm/pkru.h>
> #include <asm/trapnr.h>
> #include <asm/fpu/xcr.h>
> #include <asm/sev.h>
> +#include <asm/mman.h>
>
> #include "x86.h"
> #include "svm.h"
> @@ -1965,6 +1967,30 @@ static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
> return false;
> }
>
> +static int snp_mark_unmergable(struct kvm *kvm, u64 start, u64 size)
> +{
> + struct vm_area_struct *vma;
> + u64 end = start + size;
> + int ret;
> +
> + do {
> + vma = find_vma_intersection(kvm->mm, start, end);
> + if (!vma) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> + MADV_UNMERGEABLE, &vma->vm_flags);
> + if (ret)
> + break;
> +
> + start = vma->vm_end;
> + } while (end > vma->vm_end);
> +
> + return ret;
> +}
> +
> static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -1989,6 +2015,12 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (!is_hva_registered(kvm, params.uaddr, params.len))
> return -EINVAL;
>
> + mmap_write_lock(kvm->mm);
> + ret = snp_mark_unmergable(kvm, params.uaddr, params.len);
> + mmap_write_unlock(kvm->mm);
> + if (ret)
> + return -EFAULT;
> +
> /*
> * The userspace memory is already locked so technically we don't
> * need to lock it again. Later part of the function needs to know
> --
> 2.25.1
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-22 14:15:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/20/22 16:02, Ashish Kalra wrote:
> +/*
> + * The RMP entry format is not architectural. The format is defined in PPR
> + * Family 19h Model 01h, Rev B1 processor.
> + */

Let's say that Family 20h comes out and has a new RMP entry format.
What keeps an old kernel from attempting to use this old format on that
new CPU?

2022-06-22 14:29:55

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]

>> +/*
>> + * The RMP entry format is not architectural. The format is defined
>> +in PPR
>> + * Family 19h Model 01h, Rev B1 processor.
>> + */

>Let's say that Family 20h comes out and has a new RMP entry format.
>What keeps an old kernel from attempting to use this old format on that new CPU?

As I replied previously on the same subject:
Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible.
I probably think the wording here should be architecture independent or more precisely platform independent.

Thanks,
Ashish

2022-06-22 14:30:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

On 6/20/22 16:02, Ashish Kalra wrote:
> +int psmash(u64 pfn)
> +{
> + unsigned long paddr = pfn << PAGE_SHIFT;
> + int ret;
> +
> + if (!pfn_valid(pfn))
> + return -EINVAL;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return -ENXIO;
> +
> + /* Binutils version 2.36 supports the PSMASH mnemonic. */
> + asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> + : "=a"(ret)
> + : "a"(paddr)
> + : "memory", "cc");
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(psmash);

If a function gets an EXPORT_SYMBOL_GPL(), the least we can do is
reasonably document it. We don't need full kerneldoc nonsense, but a
one-line about what this does would be quite helpful. That goes for all
the functions here.

It would also be extremely helpful to have the changelog explain why
these functions are exported and how the exports will be used.

As a general rule, please push cpu_feature_enabled() checks as early as
you reasonably can. They are *VERY* cheap and can even enable the
compiler to completely zap code like an #ifdef.

There also seem to be a lot of pfn_valid() checks in here that aren't
very well thought out. For instance, there's a pfn_valid() check here:


+int rmp_make_shared(u64 pfn, enum pg_level level)
+{
+ struct rmpupdate val;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
...
+ return rmpupdate(pfn, &val);
+}

and in rmpupdate():

+static int rmpupdate(u64 pfn, struct rmpupdate *val)
+{
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ int ret;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
...


This is (at best) wasteful. Could it be refactored?

2022-06-22 14:32:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 07:22, Kalra, Ashish wrote:
> As I replied previously on the same subject: Architectural implies
> that it is defined in the APM and shouldn't change in such a way as
> to not be backward compatible. I probably think the wording here
> should be architecture independent or more precisely platform
> independent.
Yeah, arch-independent and non-architectural are quite different concepts.

At Intel, at least, when someone says "not architectural" mean that the
behavior is implementation-specific. That, combined with the
model/family/stepping gave me the wrong impression about what was going on.

Some more clarity would be greatly appreciated.

2022-06-22 14:33:01

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. When the host writes to pages, hardware checks the following
> conditions at the end of page walk:
>
> 1. Assigned bit in the RMP table is zero (i.e page is shared).
> 2. If the page table entry that gives the sPA indicates that the target
> page size is a large page, then all RMP entries for the 4KB
> constituting pages of the target must have the assigned bit 0.
> 3. Immutable bit in the RMP table is not zero.
>
> The hardware will raise page fault if one of the above conditions is not
> met. Try resolving the fault instead of taking fault again and again. If
> the host attempts to write to the guest private memory then send the
> SIGBUS signal to kill the process. If the page level between the host and
> RMP entry does not match, then split the address to keep the RMP and host
> page levels in sync.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/mm/fault.c | 66 ++++++++++++++++++++++++++++++++++++++++
> include/linux/mm.h | 3 +-
> include/linux/mm_types.h | 3 ++
> mm/memory.c | 13 ++++++++
> 4 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index a4c270e99f7f..f5de9673093a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -19,6 +19,7 @@
> #include <linux/uaccess.h> /* faulthandler_disabled() */
> #include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
> #include <linux/mm_types.h>
> +#include <linux/sev.h> /* snp_lookup_rmpentry() */
>
> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> #include <asm/traps.h> /* dotraplinkage, ... */
> @@ -1209,6 +1210,60 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> }
> NOKPROBE_SYMBOL(do_kern_addr_fault);
>
> +static inline size_t pages_per_hpage(int level)
> +{
> + return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +/*
> + * Return 1 if the caller need to retry, 0 if it the address need to be split
> + * in order to resolve the fault.
> + */
> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
> + unsigned long address)
> +{
> + int rmp_level, level;
> + pte_t *pte;
> + u64 pfn;
> +
> + pte = lookup_address_in_mm(current->mm, address, &level);
> +
> + /*
> + * It can happen if there was a race between an unmap event and
> + * the RMP fault delivery.
> + */
> + if (!pte || !pte_present(*pte))
> + return 1;
> +
> + pfn = pte_pfn(*pte);
> +
> + /* If its large page then calculte the fault pfn */
> + if (level > PG_LEVEL_4K) {
> + unsigned long mask;
> +
> + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> + pfn |= (address >> PAGE_SHIFT) & mask;
> + }
> +
> + /*
> + * If its a guest private page, then the fault cannot be resolved.
> + * Send a SIGBUS to terminate the process.
> + */
> + if (snp_lookup_rmpentry(pfn, &rmp_level)) {

snp_lookup_rmpentry returns 0, 1 or -errno, so this should likely be:

if (snp_lookup_rmpentry(pfn, &rmp_level) != 1)) {

> + do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
> + return 1;
> + }
> +
> + /*
> + * The backing page level is higher than the RMP page level, request
> + * to split the page.
> + */
> + if (level > rmp_level)
> + return 0;
> +
> + return 1;
> +}
> +
> /*
> * Handle faults in the user portion of the address space. Nothing in here
> * should check X86_PF_USER without a specific justification: for almost
> @@ -1306,6 +1361,17 @@ void do_user_addr_fault(struct pt_regs *regs,
> if (error_code & X86_PF_INSTR)
> flags |= FAULT_FLAG_INSTRUCTION;
>
> + /*
> + * If its an RMP violation, try resolving it.
> + */
> + if (error_code & X86_PF_RMP) {
> + if (handle_user_rmp_page_fault(regs, error_code, address))
> + return;
> +
> + /* Ask to split the page */
> + flags |= FAULT_FLAG_PAGE_SPLIT;
> + }
> +
> #ifdef CONFIG_X86_64
> /*
> * Faults in the vsyscall page might need emulation. The
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index de32c0383387..2ccc562d166f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -463,7 +463,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
> { FAULT_FLAG_USER, "USER" }, \
> { FAULT_FLAG_REMOTE, "REMOTE" }, \
> { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
> - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
> + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
> + { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" }
>
> /*
> * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6dfaf271ebf8..aa2d8d48ce3e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -818,6 +818,8 @@ typedef struct {
> * mapped R/O.
> * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> * We should only access orig_pte if this flag set.
> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
> + * region to smaller page size and retry.
> *
> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
> * whether we would allow page faults to retry by specifying these two
> @@ -855,6 +857,7 @@ enum fault_flag {
> FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
> FAULT_FLAG_UNSHARE = 1 << 10,
> FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
> + FAULT_FLAG_PAGE_SPLIT = 1 << 12,
> };
>
> typedef unsigned int __bitwise zap_flags_t;
> diff --git a/mm/memory.c b/mm/memory.c
> index 7274f2b52bca..c2187ffcbb8e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4945,6 +4945,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> + if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
> + return VM_FAULT_SIGBUS;
> +
> + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> + return 0;
> +}
> +
> /*
> * By the time we get here, we already hold the mm semaphore
> *
> @@ -5024,6 +5033,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> pmd_migration_entry_wait(mm, vmf.pmd);
> return 0;
> }
> +
> + if (flags & FAULT_FLAG_PAGE_SPLIT)
> + return handle_split_page_fault(&vmf);
> +
> if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&vmf);
> --
> 2.25.1
>

2022-06-22 14:34:28

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 10/49] x86/fault: Add support to dump RMP entry on fault

On Mon, Jun 20, 2022 at 11:03:58PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. If the hardware encounters the check failure, then it raises
> the #PF (with RMP set). Dump the RMP entry at the faulting pfn to help
> the debug.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/sev.h | 7 +++++++
> arch/x86/kernel/sev.c | 43 ++++++++++++++++++++++++++++++++++++++
> arch/x86/mm/fault.c | 17 +++++++++++----
> include/linux/sev.h | 2 ++
> 4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 6ab872311544..c0c4df817159 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -113,6 +113,11 @@ struct __packed rmpentry {
>
> #define rmpentry_assigned(x) ((x)->info.assigned)
> #define rmpentry_pagesize(x) ((x)->info.pagesize)
> +#define rmpentry_vmsa(x) ((x)->info.vmsa)
> +#define rmpentry_asid(x) ((x)->info.asid)
> +#define rmpentry_validated(x) ((x)->info.validated)
> +#define rmpentry_gpa(x) ((unsigned long)(x)->info.gpa)
> +#define rmpentry_immutable(x) ((x)->info.immutable)
>
> #define RMPADJUST_VMSA_PAGE_BIT BIT(16)
>
> @@ -205,6 +210,7 @@ void snp_set_wakeup_secondary_cpu(void);
> bool snp_init(struct boot_params *bp);
> void snp_abort(void);
> int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
> +void dump_rmpentry(u64 pfn);
> #else
> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> static inline void sev_es_ist_exit(void) { }
> @@ -229,6 +235,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
> {
> return -ENOTTY;
> }
> +static inline void dump_rmpentry(u64 pfn) {}
> #endif
>
> #endif
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 734cddd837f5..6640a639fffc 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2414,6 +2414,49 @@ static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
> return entry;
> }
>
> +void dump_rmpentry(u64 pfn)
> +{
> + unsigned long pfn_end;
> + struct rmpentry *e;
> + int level;
> +
> + e = __snp_lookup_rmpentry(pfn, &level);
> + if (!e) {

__snp_lookup_rmpentry may return -errno so this should be:

if (e != 1)

> + pr_alert("failed to read RMP entry pfn 0x%llx\n", pfn);
> + return;
> + }
> +
> + if (rmpentry_assigned(e)) {
> + pr_alert("RMPEntry paddr 0x%llx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
> + " asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
> + rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
> + rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
> + rmpentry_validated(e));
> + return;
> + }
> +
> + /*
> + * If the RMP entry at the faulting pfn was not assigned, then we do not
> + * know what caused the RMP violation. To get some useful debug information,
> + * let iterate through the entire 2MB region, and dump the RMP entries if
> + * one of the bit in the RMP entry is set.
> + */
> + pfn = pfn & ~(PTRS_PER_PMD - 1);
> + pfn_end = pfn + PTRS_PER_PMD;
> +
> + while (pfn < pfn_end) {
> + e = __snp_lookup_rmpentry(pfn, &level);
> + if (!e)

if (e != 1)

> + return;
> +
> + if (e->low || e->high)
> + pr_alert("RMPEntry paddr 0x%llx: [high=0x%016llx low=0x%016llx]\n",
> + pfn << PAGE_SHIFT, e->high, e->low);
> + pfn++;
> + }
> +}
> +EXPORT_SYMBOL_GPL(dump_rmpentry);
> +
> /*
> * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
> * and -errno if there is no corresponding RMP entry.
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f5de9673093a..25896a6ba04a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -34,6 +34,7 @@
> #include <asm/kvm_para.h> /* kvm_handle_async_pf */
> #include <asm/vdso.h> /* fixup_vdso_exception() */
> #include <asm/irq_stack.h>
> +#include <asm/sev.h> /* dump_rmpentry() */
>
> #define CREATE_TRACE_POINTS
> #include <asm/trace/exceptions.h>
> @@ -290,7 +291,7 @@ static bool low_pfn(unsigned long pfn)
> return pfn < max_low_pfn;
> }
>
> -static void dump_pagetable(unsigned long address)
> +static void dump_pagetable(unsigned long address, bool show_rmpentry)
> {
> pgd_t *base = __va(read_cr3_pa());
> pgd_t *pgd = &base[pgd_index(address)];
> @@ -346,10 +347,11 @@ static int bad_address(void *p)
> return get_kernel_nofault(dummy, (unsigned long *)p);
> }
>
> -static void dump_pagetable(unsigned long address)
> +static void dump_pagetable(unsigned long address, bool show_rmpentry)
> {
> pgd_t *base = __va(read_cr3_pa());
> pgd_t *pgd = base + pgd_index(address);
> + unsigned long pfn;
> p4d_t *p4d;
> pud_t *pud;
> pmd_t *pmd;
> @@ -367,6 +369,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(p4d))
> goto bad;
>
> + pfn = p4d_pfn(*p4d);
> pr_cont("P4D %lx ", p4d_val(*p4d));
> if (!p4d_present(*p4d) || p4d_large(*p4d))
> goto out;
> @@ -375,6 +378,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pud))
> goto bad;
>
> + pfn = pud_pfn(*pud);
> pr_cont("PUD %lx ", pud_val(*pud));
> if (!pud_present(*pud) || pud_large(*pud))
> goto out;
> @@ -383,6 +387,7 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pmd))
> goto bad;
>
> + pfn = pmd_pfn(*pmd);
> pr_cont("PMD %lx ", pmd_val(*pmd));
> if (!pmd_present(*pmd) || pmd_large(*pmd))
> goto out;
> @@ -391,9 +396,13 @@ static void dump_pagetable(unsigned long address)
> if (bad_address(pte))
> goto bad;
>
> + pfn = pte_pfn(*pte);
> pr_cont("PTE %lx", pte_val(*pte));
> out:
> pr_cont("\n");
> +
> + if (show_rmpentry)
> + dump_rmpentry(pfn);
> return;
> bad:
> pr_info("BAD\n");
> @@ -579,7 +588,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
> show_ldttss(&gdt, "TR", tr);
> }
>
> - dump_pagetable(address);
> + dump_pagetable(address, error_code & X86_PF_RMP);
> }
>
> static noinline void
> @@ -596,7 +605,7 @@ pgtable_bad(struct pt_regs *regs, unsigned long error_code,
>
> printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
> tsk->comm, address);
> - dump_pagetable(address);
> + dump_pagetable(address, false);
>
> if (__die("Bad pagetable", regs, error_code))
> sig = 0;
> diff --git a/include/linux/sev.h b/include/linux/sev.h
> index 1a68842789e1..734b13a69c54 100644
> --- a/include/linux/sev.h
> +++ b/include/linux/sev.h
> @@ -16,6 +16,7 @@ int snp_lookup_rmpentry(u64 pfn, int *level);
> int psmash(u64 pfn);
> int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable);
> int rmp_make_shared(u64 pfn, enum pg_level level);
> +void dump_rmpentry(u64 pfn);
> #else
> static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
> static inline int psmash(u64 pfn) { return -ENXIO; }
> @@ -25,6 +26,7 @@ static inline int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int as
> return -ENODEV;
> }
> static inline int rmp_make_shared(u64 pfn, enum pg_level level) { return -ENODEV; }
> +static inline void dump_rmpentry(u64 pfn) { }
>
> #endif /* CONFIG_AMD_MEM_ENCRYPT */
> #endif /* __LINUX_SEV_H */
> --
> 2.25.1
>

2022-06-22 14:50:28

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 10/49] x86/fault: Add support to dump RMP entry on fault

On Wed, Jun 22, 2022 at 04:33:04PM +0200, Jeremi Piotrowski wrote:
> On Mon, Jun 20, 2022 at 11:03:58PM +0000, Ashish Kalra wrote:
> > From: Brijesh Singh <[email protected]>
> >
> > When SEV-SNP is enabled globally, a write from the host goes through the
> > RMP check. If the hardware encounters the check failure, then it raises
> > the #PF (with RMP set). Dump the RMP entry at the faulting pfn to help
> > the debug.
> >
> > Signed-off-by: Brijesh Singh <[email protected]>
> > ---
> > arch/x86/include/asm/sev.h | 7 +++++++
> > arch/x86/kernel/sev.c | 43 ++++++++++++++++++++++++++++++++++++++
> > arch/x86/mm/fault.c | 17 +++++++++++----
> > include/linux/sev.h | 2 ++
> > 4 files changed, 65 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> > index 6ab872311544..c0c4df817159 100644
> > --- a/arch/x86/include/asm/sev.h
> > +++ b/arch/x86/include/asm/sev.h
> > @@ -113,6 +113,11 @@ struct __packed rmpentry {
> >
> > #define rmpentry_assigned(x) ((x)->info.assigned)
> > #define rmpentry_pagesize(x) ((x)->info.pagesize)
> > +#define rmpentry_vmsa(x) ((x)->info.vmsa)
> > +#define rmpentry_asid(x) ((x)->info.asid)
> > +#define rmpentry_validated(x) ((x)->info.validated)
> > +#define rmpentry_gpa(x) ((unsigned long)(x)->info.gpa)
> > +#define rmpentry_immutable(x) ((x)->info.immutable)
> >
> > #define RMPADJUST_VMSA_PAGE_BIT BIT(16)
> >
> > @@ -205,6 +210,7 @@ void snp_set_wakeup_secondary_cpu(void);
> > bool snp_init(struct boot_params *bp);
> > void snp_abort(void);
> > int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
> > +void dump_rmpentry(u64 pfn);
> > #else
> > static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> > static inline void sev_es_ist_exit(void) { }
> > @@ -229,6 +235,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
> > {
> > return -ENOTTY;
> > }
> > +static inline void dump_rmpentry(u64 pfn) {}
> > #endif
> >
> > #endif
> > diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> > index 734cddd837f5..6640a639fffc 100644
> > --- a/arch/x86/kernel/sev.c
> > +++ b/arch/x86/kernel/sev.c
> > @@ -2414,6 +2414,49 @@ static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
> > return entry;
> > }
> >
> > +void dump_rmpentry(u64 pfn)
> > +{
> > + unsigned long pfn_end;
> > + struct rmpentry *e;
> > + int level;
> > +
> > + e = __snp_lookup_rmpentry(pfn, &level);
> > + if (!e) {
>
> __snp_lookup_rmpentry may return -errno so this should be:
>
> if (e != 1)

Sorry, actually it should be:

if (IS_ERR_OR_NULL(e)) {

>
> > + pr_alert("failed to read RMP entry pfn 0x%llx\n", pfn);
> > + return;
> > + }
> > +
> > + if (rmpentry_assigned(e)) {
> > + pr_alert("RMPEntry paddr 0x%llx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
> > + " asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
> > + rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
> > + rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
> > + rmpentry_validated(e));
> > + return;
> > + }
> > +
> > + /*
> > + * If the RMP entry at the faulting pfn was not assigned, then we do not
> > + * know what caused the RMP violation. To get some useful debug information,
> > + * let iterate through the entire 2MB region, and dump the RMP entries if
> > + * one of the bit in the RMP entry is set.
> > + */
> > + pfn = pfn & ~(PTRS_PER_PMD - 1);
> > + pfn_end = pfn + PTRS_PER_PMD;
> > +
> > + while (pfn < pfn_end) {
> > + e = __snp_lookup_rmpentry(pfn, &level);
> > + if (!e)
>
> if (e != 1)
>

and this too:

if (IS_ERR_OR_NULL(e))


> > + return;
> > +
> > + if (e->low || e->high)
> > + pr_alert("RMPEntry paddr 0x%llx: [high=0x%016llx low=0x%016llx]\n",
> > + pfn << PAGE_SHIFT, e->high, e->low);
> > + pfn++;
> > + }
> > +}
> > +EXPORT_SYMBOL_GPL(dump_rmpentry);
> > +
> > /*
> > * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
> > * and -errno if there is no corresponding RMP entry.
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index f5de9673093a..25896a6ba04a 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -34,6 +34,7 @@
> > #include <asm/kvm_para.h> /* kvm_handle_async_pf */
> > #include <asm/vdso.h> /* fixup_vdso_exception() */
> > #include <asm/irq_stack.h>
> > +#include <asm/sev.h> /* dump_rmpentry() */
> >
> > #define CREATE_TRACE_POINTS
> > #include <asm/trace/exceptions.h>
> > @@ -290,7 +291,7 @@ static bool low_pfn(unsigned long pfn)
> > return pfn < max_low_pfn;
> > }
> >
> > -static void dump_pagetable(unsigned long address)
> > +static void dump_pagetable(unsigned long address, bool show_rmpentry)
> > {
> > pgd_t *base = __va(read_cr3_pa());
> > pgd_t *pgd = &base[pgd_index(address)];
> > @@ -346,10 +347,11 @@ static int bad_address(void *p)
> > return get_kernel_nofault(dummy, (unsigned long *)p);
> > }
> >
> > -static void dump_pagetable(unsigned long address)
> > +static void dump_pagetable(unsigned long address, bool show_rmpentry)
> > {
> > pgd_t *base = __va(read_cr3_pa());
> > pgd_t *pgd = base + pgd_index(address);
> > + unsigned long pfn;
> > p4d_t *p4d;
> > pud_t *pud;
> > pmd_t *pmd;
> > @@ -367,6 +369,7 @@ static void dump_pagetable(unsigned long address)
> > if (bad_address(p4d))
> > goto bad;
> >
> > + pfn = p4d_pfn(*p4d);
> > pr_cont("P4D %lx ", p4d_val(*p4d));
> > if (!p4d_present(*p4d) || p4d_large(*p4d))
> > goto out;
> > @@ -375,6 +378,7 @@ static void dump_pagetable(unsigned long address)
> > if (bad_address(pud))
> > goto bad;
> >
> > + pfn = pud_pfn(*pud);
> > pr_cont("PUD %lx ", pud_val(*pud));
> > if (!pud_present(*pud) || pud_large(*pud))
> > goto out;
> > @@ -383,6 +387,7 @@ static void dump_pagetable(unsigned long address)
> > if (bad_address(pmd))
> > goto bad;
> >
> > + pfn = pmd_pfn(*pmd);
> > pr_cont("PMD %lx ", pmd_val(*pmd));
> > if (!pmd_present(*pmd) || pmd_large(*pmd))
> > goto out;
> > @@ -391,9 +396,13 @@ static void dump_pagetable(unsigned long address)
> > if (bad_address(pte))
> > goto bad;
> >
> > + pfn = pte_pfn(*pte);
> > pr_cont("PTE %lx", pte_val(*pte));
> > out:
> > pr_cont("\n");
> > +
> > + if (show_rmpentry)
> > + dump_rmpentry(pfn);
> > return;
> > bad:
> > pr_info("BAD\n");
> > @@ -579,7 +588,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
> > show_ldttss(&gdt, "TR", tr);
> > }
> >
> > - dump_pagetable(address);
> > + dump_pagetable(address, error_code & X86_PF_RMP);
> > }
> >
> > static noinline void
> > @@ -596,7 +605,7 @@ pgtable_bad(struct pt_regs *regs, unsigned long error_code,
> >
> > printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
> > tsk->comm, address);
> > - dump_pagetable(address);
> > + dump_pagetable(address, false);
> >
> > if (__die("Bad pagetable", regs, error_code))
> > sig = 0;
> > diff --git a/include/linux/sev.h b/include/linux/sev.h
> > index 1a68842789e1..734b13a69c54 100644
> > --- a/include/linux/sev.h
> > +++ b/include/linux/sev.h
> > @@ -16,6 +16,7 @@ int snp_lookup_rmpentry(u64 pfn, int *level);
> > int psmash(u64 pfn);
> > int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable);
> > int rmp_make_shared(u64 pfn, enum pg_level level);
> > +void dump_rmpentry(u64 pfn);
> > #else
> > static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
> > static inline int psmash(u64 pfn) { return -ENXIO; }
> > @@ -25,6 +26,7 @@ static inline int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int as
> > return -ENODEV;
> > }
> > static inline int rmp_make_shared(u64 pfn, enum pg_level level) { return -ENODEV; }
> > +static inline void dump_rmpentry(u64 pfn) { }
> >
> > #endif /* CONFIG_AMD_MEM_ENCRYPT */
> > #endif /* __LINUX_SEV_H */
> > --
> > 2.25.1
> >

2022-06-22 18:11:22

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

>> +int psmash(u64 pfn)
>> +{
>> + unsigned long paddr = pfn << PAGE_SHIFT;
>> + int ret;
>> +
>> + if (!pfn_valid(pfn))
>> + return -EINVAL;
>> +
>> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>> + return -ENXIO;
>> +
>> + /* Binutils version 2.36 supports the PSMASH mnemonic. */
>> + asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
>> + : "=a"(ret)
>> + : "a"(paddr)
>> + : "memory", "cc");
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(psmash);

>If a function gets an EXPORT_SYMBOL_GPL(), the least we can do is reasonably document it. We don't need full kerneldoc nonsense, but a one-line about what this does would be quite helpful. That goes for all the functions here.

>It would also be extremely helpful to have the changelog explain why these functions are exported and how the exports will be used.

I will add basic descriptions for all these exported functions.

Thanks,
Ashish

>As a general rule, please push cpu_feature_enabled() checks as early as you reasonably can. They are *VERY* cheap and can even enable the compiler to completely zap code like an #ifdef.

There also seem to be a lot of pfn_valid() checks in here that aren't very well thought out. For instance, there's a pfn_valid() check here:


+int rmp_make_shared(u64 pfn, enum pg_level level) {
+ struct rmpupdate val;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
...
+ return rmpupdate(pfn, &val);
+}

and in rmpupdate():

+static int rmpupdate(u64 pfn, struct rmpupdate *val) {
+ unsigned long paddr = pfn << PAGE_SHIFT;
+ int ret;
+
+ if (!pfn_valid(pfn))
+ return -EINVAL;
...


This is (at best) wasteful. Could it be refactored?

2022-06-22 18:11:42

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 10/49] x86/fault: Add support to dump RMP entry on fault

[AMD Official Use Only - General]

>> > +void dump_rmpentry(u64 pfn)
>> > +{
>> > + unsigned long pfn_end;
>> > + struct rmpentry *e;
>> > + int level;
>> > +
>> > + e = __snp_lookup_rmpentry(pfn, &level);
>> > + if (!e) {
>>
>> __snp_lookup_rmpentry may return -errno so this should be:
>>
>> if (e != 1)

>Sorry, actually it should be:

> if (IS_ERR_OR_NULL(e)) {

I will fix this accordingly.

>> > +
>> > + while (pfn < pfn_end) {
>> > + e = __snp_lookup_rmpentry(pfn, &level);
>> > + if (!e)
>>
>> if (e != 1)
>>

>and this too:

> if (IS_ERR_OR_NULL(e))

Same here.

Thanks,
Ashish

2022-06-22 18:15:56

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[Public]

On 6/22/22 07:22, Kalra, Ashish wrote:
>> As I replied previously on the same subject: Architectural implies
>> that it is defined in the APM and shouldn't change in such a way as to
>> not be backward compatible. I probably think the wording here should
>> be architecture independent or more precisely platform independent.
>Yeah, arch-independent and non-architectural are quite different concepts.

>At Intel, at least, when someone says "not architectural" mean that the behavior is implementation-specific. That, combined with the model/family/stepping gave me the wrong impression about what was going on.

>Some more clarity would be greatly appreciated.

Actually, the PPR for family 19h Model 01h, Rev B1 defines the RMP entry format as below:

2.1.4.2 RMP Entry Format
Architecturally the format of RMP entries are not specified in APM. In order to assist software, the following table specifies select portions of the RMP entry format for this specific product. Each RMP entry is 16B in size and is formatted as follows. Software should not rely on any field definitions not specified in this table and the format of an RMP entry may change in future processors.

Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible. So non-architectural in this context means that it is only defined in our PPR.

So actually this RPM entry definition is platform dependent and will need to be changed for different AMD processors and that change has to be handled correspondingly in the dump_rmpentry() code.

Thanks,
Ashish

2022-06-22 18:18:02

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

>>> /*
>>> * The RMP entry format is not architectural. The format is defined
>>> in PPR @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
>>> u64 secrets_gpa;
>>> };
>>>
>>> +struct rmpupdate {
>>> + u64 gpa;
>>> + u8 assigned;
>>> + u8 pagesize;
>>> + u8 immutable;
>>> + u8 rsvd;
>>> + u32 asid;
>>> +} __packed;

>>I see above it says the RMP entry format isn't architectural; is this 'rmpupdate' structure? If not how is this going to get handled when we have a couple >of SNP capable CPUs with different layouts?

>Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible.
>I probably think the wording here should be architecture independent or more precisely platform independent.

Some more clarity on this:

Actually, the PPR for family 19h Model 01h, Rev B1 defines the RMP entry format as below:

2.1.4.2 RMP Entry Format
Architecturally the format of RMP entries are not specified in APM. In order to assist software, the following table specifies select portions of the RMP entry format for this specific product. Each RMP entry is 16B in size and is formatted as follows. Software should not rely on any field definitions not specified in this table and the format of an RMP entry may change in future processors.

Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible. So non-architectural in this context means that it is only defined in our PPR.

So actually this RPM entry definition is platform dependent and will need to be changed for different AMD processors and that change has to be handled correspondingly in the dump_rmpentry() code.

Thanks,
Ashish

2022-06-22 18:47:30

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]

-----Original Message-----
From: Dave Hansen <[email protected]>
Sent: Wednesday, June 22, 2022 1:18 PM
To: Kalra, Ashish <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 11:15, Kalra, Ashish wrote:
> So actually this RPM entry definition is platform dependent and will
> need to be changed for different AMD processors and that change has to
> be handled correspondingly in the dump_rmpentry() code.

>So, if the RMP entry format changes in future processors, how do we make sure that the kernel does not try to use *this* code on those processors?

Functions snp_lookup_rmpentry() and dump_rmpentry() which rely on this structure definition will need to handle it accordingly.

Thanks,
Ashish

2022-06-22 18:47:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 11:15, Kalra, Ashish wrote:
> So actually this RPM entry definition is platform dependent and will
> need to be changed for different AMD processors and that change has
> to be handled correspondingly in the dump_rmpentry() code.

So, if the RMP entry format changes in future processors, how do we make
sure that the kernel does not try to use *this* code on those processors?

2022-06-22 18:49:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 11:34, Kalra, Ashish wrote:
>> So, if the RMP entry format changes in future processors, how do we
>> make sure that the kernel does not try to use *this* code on those
>> processors?
> Functions snp_lookup_rmpentry() and dump_rmpentry() which rely on
> this structure definition will need to handle it accordingly.

In other words, old kernels will break on new hardware?

I think that needs to be fixed. It should be as simple as a
model/family check, though. If someone (for example) attempts to use
SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code on a newer
CPU, the kernel should refuse.

2022-06-22 19:44:58

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]


-----Original Message-----
From: Dave Hansen <[email protected]>
Sent: Wednesday, June 22, 2022 1:43 PM
To: Kalra, Ashish <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 11:34, Kalra, Ashish wrote:
>>> So, if the RMP entry format changes in future processors, how do we
>>> make sure that the kernel does not try to use *this* code on those
>>> processors?
>> Functions snp_lookup_rmpentry() and dump_rmpentry() which rely on this
>> structure definition will need to handle it accordingly.

>In other words, old kernels will break on new hardware?

>I think that needs to be fixed. It should be as simple as a model/family check, though. If someone (for example) attempts to use SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code on a newer CPU, the kernel should refuse.

More specifically I am thinking of adding RMP entry field accessors so that they can do this cpu model/family check and return the correct field as per processor architecture.

Thanks,
Ashish

2022-06-22 20:10:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 12:43, Kalra, Ashish wrote:
>> I think that needs to be fixed. It should be as simple as a
>> model/family check, though. If someone (for example) attempts to
>> use SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code
>> on a newer CPU, the kernel should refuse.
> More specifically I am thinking of adding RMP entry field accessors
> so that they can do this cpu model/family check and return the
> correct field as per processor architecture.

That will be helpful down the road when there's more than one format.
But, the real issue is that the kernel doesn't *support* a different RMP
format. So, the SNP support should be disabled when encountering a
model/family other than the known good one.

2022-06-22 20:26:06

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]


From: Dave Hansen <[email protected]>
Sent: Wednesday, June 22, 2022 2:50 PM
To: Kalra, Ashish <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 12:43, Kalra, Ashish wrote:
>>> I think that needs to be fixed. It should be as simple as a
>>> model/family check, though. If someone (for example) attempts to use
>>> SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code on a
>>> newer CPU, the kernel should refuse.
>> More specifically I am thinking of adding RMP entry field accessors so
>> that they can do this cpu model/family check and return the correct
>> field as per processor architecture.

>That will be helpful down the road when there's more than one format.
>But, the real issue is that the kernel doesn't *support* a different RMP format. So, the SNP support should be disabled when encountering a model/family other than the known good one.

Yes, that makes sense, will add an additional check in snp_rmptable_init().

Thanks,
Ashish

2022-06-22 21:00:11

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[Public]

From: Dave Hansen <[email protected]>
Sent: Wednesday, June 22, 2022 2:50 PM
To: Kalra, Ashish <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On 6/22/22 12:43, Kalra, Ashish wrote:
>>> I think that needs to be fixed. It should be as simple as a
>>> model/family check, though. If someone (for example) attempts to
>>> use SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code on
>>> a newer CPU, the kernel should refuse.
>> More specifically I am thinking of adding RMP entry field accessors
>> so that they can do this cpu model/family check and return the
>> correct field as per processor architecture.

>That will be helpful down the road when there's more than one format.
>But, the real issue is that the kernel doesn't *support* a different RMP format. So, the SNP support should be disabled when encountering a model/family other than the known good one.

>Yes, that makes sense, will add an additional check in snp_rmptable_init().

Also to add here, additionally we may create an architectural way to read the RMP entry in the future.

Thanks,
Ashish

2022-06-23 20:50:26

by Marc Orr

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

On Mon, Jun 20, 2022 at 4:02 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The memory integrity guarantees of SEV-SNP are enforced through a new
> structure called the Reverse Map Table (RMP). The RMP is a single data
> structure shared across the system that contains one entry for every 4K
> page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
> track the owner of each page of memory. Pages of memory can be owned by
> the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
> section 15.36.3 for more detail on RMP.
>
> The RMP table is used to enforce access control to memory. The table itself
> is not directly writable by the software. New CPU instructions (RMPUPDATE,
> PVALIDATE, RMPADJUST) are used to manipulate the RMP entries.
>
> Based on the platform configuration, the BIOS reserves the memory used
> for the RMP table. The start and end address of the RMP table must be
> queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
> RMP_END are not set then disable the SEV-SNP feature.
>
> The SEV-SNP feature is enabled only after the RMP table is successfully
> initialized.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/msr-index.h | 6 +
> arch/x86/kernel/sev.c | 144 +++++++++++++++++++++++
> 3 files changed, 157 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 36369e76cc63..c1be3091a383 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -68,6 +68,12 @@
> # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
> #endif
>
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +# define DISABLE_SEV_SNP 0
> +#else
> +# define DISABLE_SEV_SNP (1 << (X86_FEATURE_SEV_SNP & 31))
> +#endif
> +
> /*
> * Make sure to add features to the correct mask
> */
> @@ -91,7 +97,7 @@
> DISABLE_ENQCMD)
> #define DISABLED_MASK17 0
> #define DISABLED_MASK18 0
> -#define DISABLED_MASK19 0
> +#define DISABLED_MASK19 (DISABLE_SEV_SNP)
> #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
>
> #endif /* _ASM_X86_DISABLED_FEATURES_H */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 9e2e7185fc1d..57a8280e283a 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -507,6 +507,8 @@
> #define MSR_AMD64_SEV_ENABLED BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
> #define MSR_AMD64_SEV_ES_ENABLED BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
> #define MSR_AMD64_SEV_SNP_ENABLED BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
> +#define MSR_AMD64_RMP_BASE 0xc0010132
> +#define MSR_AMD64_RMP_END 0xc0010133
>
> #define MSR_AMD64_VIRT_SPEC_CTRL 0xc001011f
>
> @@ -581,6 +583,10 @@
> #define MSR_AMD64_SYSCFG 0xc0010010
> #define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
> #define MSR_AMD64_SYSCFG_MEM_ENCRYPT BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_EN_BIT 24
> +#define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)

nit: The alignment here looks off. The rest of the file left-aligns
the macro definition column under a comment header. The bad alignment
can be viewed on the github version of this patch:
https://github.com/AMDESE/linux/commit/5101daef92f448c046207b701c0c420b1fce3eaf

> #define MSR_K8_INT_PENDING_MSG 0xc0010055
> /* C1E active bits in int pending message */
> #define K8_INTP_C1E_ACTIVE_MASK 0x18000000
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index f01f4550e2c6..3a233b5d47c5 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -22,6 +22,8 @@
> #include <linux/efi.h>
> #include <linux/platform_device.h>
> #include <linux/io.h>
> +#include <linux/cpumask.h>
> +#include <linux/iommu.h>
>
> #include <asm/cpu_entry_area.h>
> #include <asm/stacktrace.h>
> @@ -38,6 +40,7 @@
> #include <asm/apic.h>
> #include <asm/cpuid.h>
> #include <asm/cmdline.h>
> +#include <asm/iommu.h>
>
> #define DR7_RESET_VALUE 0x400
>
> @@ -57,6 +60,12 @@
> #define AP_INIT_CR0_DEFAULT 0x60000010
> #define AP_INIT_MXCSR_DEFAULT 0x1f80
>
> +/*
> + * The first 16KB from the RMP_BASE is used by the processor for the
> + * bookkeeping, the range need to be added during the RMP entry lookup.
> + */
> +#define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
> +
> /* For early boot hypervisor communication in SEV-ES enabled guests */
> static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>
> @@ -69,6 +78,10 @@ static struct ghcb *boot_ghcb __section(".data");
> /* Bitmap of SEV features supported by the hypervisor */
> static u64 sev_hv_features __ro_after_init;
>
> +static unsigned long rmptable_start __ro_after_init;
> +static unsigned long rmptable_end __ro_after_init;
> +
> +
> /* #VC handler runtime per-CPU data */
> struct sev_es_runtime_data {
> struct ghcb ghcb_page;
> @@ -2218,3 +2231,134 @@ static int __init snp_init_platform_device(void)
> return 0;
> }
> device_initcall(snp_init_platform_device);
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "SEV-SNP: " fmt
> +
> +static int __snp_enable(unsigned int cpu)
> +{
> + u64 val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> + val |= MSR_AMD64_SYSCFG_SNP_EN;
> + val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
> +
> + wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> + return 0;
> +}
> +
> +static __init void snp_enable(void *arg)
> +{
> + __snp_enable(smp_processor_id());
> +}
> +
> +static bool get_rmptable_info(u64 *start, u64 *len)
> +{
> + u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
> +
> + rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
> + rdmsrl(MSR_AMD64_RMP_END, rmp_end);
> +
> + if (!rmp_base || !rmp_end) {
> + pr_info("Memory for the RMP table has not been reserved by BIOS\n");
> + return false;
> + }
> +
> + rmp_sz = rmp_end - rmp_base + 1;
> +
> + /*
> + * Calculate the amount the memory that must be reserved by the BIOS to
> + * address the full system RAM. The reserved memory should also cover the
> + * RMP table itself.
> + *
> + * See PPR Family 19h Model 01h, Revision B1 section 2.1.4.2 for more
> + * information on memory requirement.
> + */
> + nr_pages = totalram_pages();
> + calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
> +
> + if (calc_rmp_sz > rmp_sz) {
> + pr_info("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
> + calc_rmp_sz, rmp_sz);
> + return false;
> + }
> +
> + *start = rmp_base;
> + *len = rmp_sz;
> +
> + pr_info("RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
> +
> + return true;
> +}
> +
> +static __init int __snp_rmptable_init(void)
> +{
> + u64 rmp_base, sz;
> + void *start;
> + u64 val;
> +
> + if (!get_rmptable_info(&rmp_base, &sz))
> + return 1;
> +
> + start = memremap(rmp_base, sz, MEMREMAP_WB);
> + if (!start) {
> + pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
> + return 1;
> + }
> +
> + /*
> + * Check if SEV-SNP is already enabled, this can happen if we are coming from
> + * kexec boot.
> + */
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> + if (val & MSR_AMD64_SYSCFG_SNP_EN)
> + goto skip_enable;
> +
> + /* Initialize the RMP table to zero */
> + memset(start, 0, sz);
> +
> + /* Flush the caches to ensure that data is written before SNP is enabled. */
> + wbinvd_on_all_cpus();
> +
> + /* Enable SNP on all CPUs. */
> + on_each_cpu(snp_enable, NULL, 1);
> +
> +skip_enable:
> + rmptable_start = (unsigned long)start;
> + rmptable_end = rmptable_start + sz;
> +
> + return 0;
> +}
> +
> +static int __init snp_rmptable_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + if (!iommu_sev_snp_supported())
> + goto nosnp;
> +
> + if (__snp_rmptable_init())
> + goto nosnp;
> +
> + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
> +
> + return 0;
> +
> +nosnp:
> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> + return 1;

Seems odd that we're returning 1 here, rather than 0. I tried to
figure out how the initcall return values are used and failed. My
impression was 0 means success and a negative number means failure.
But maybe this is normal.

> +}
> +
> +/*
> + * This must be called after the PCI subsystem. This is because before enabling
> + * the SNP feature we need to ensure that IOMMU supports the SEV-SNP feature.
> + * The iommu_sev_snp_support() is used for checking the feature, and it is
> + * available after subsys_initcall().
> + */
> +fs_initcall(snp_rmptable_init);
> --
> 2.25.1
>

2022-06-23 21:02:44

by Marc Orr

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 04/49] x86/sev: set SYSCFG.MFMD

On Mon, Jun 20, 2022 at 4:02 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> SEV-SNP FW >= 1.51 requires that SYSCFG.MFMD must be set.
>
> Subsequent CCP patches while require 1.51 as the minimum SEV-SNP
> firmware version.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/msr-index.h | 3 +++
> arch/x86/kernel/sev.c | 24 ++++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 57a8280e283a..1e36f16daa56 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -587,6 +587,9 @@
> #define MSR_AMD64_SYSCFG_SNP_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
> #define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
> #define MSR_AMD64_SYSCFG_SNP_VMPL_EN BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
> +#define MSR_AMD64_SYSCFG_MFDM_BIT 19
> +#define MSR_AMD64_SYSCFG_MFDM BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)

nit: Similar to the previous patch, the alignment here doesn't look
right. The bad alignment can be viewed on the github version of this
patch:
https://github.com/AMDESE/linux/commit/6d4469b86f90e67119ff110230857788a0d9dbd0

> +
> #define MSR_K8_INT_PENDING_MSG 0xc0010055
> /* C1E active bits in int pending message */
> #define K8_INTP_C1E_ACTIVE_MASK 0x18000000
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 3a233b5d47c5..25c7feb367f6 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2257,6 +2257,27 @@ static __init void snp_enable(void *arg)
> __snp_enable(smp_processor_id());
> }
>
> +static int __mfdm_enable(unsigned int cpu)
> +{
> + u64 val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> + val |= MSR_AMD64_SYSCFG_MFDM;

Can we do this inside `__snp_enable()`, above? Then, we'll execute if
a hotplug event happens as well.

static int __snp_enable(unsigned int cpu)
{
u64 val;

if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
return 0;

rdmsrl(MSR_AMD64_SYSCFG, val);

val |= MSR_AMD64_SYSCFG_SNP_EN;
val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
val |= MSR_AMD64_SYSCFG_MFDM;

wrmsrl(MSR_AMD64_SYSCFG, val);

return 0;
}

> +
> + wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> + return 0;
> +}
> +
> +static __init void mfdm_enable(void *arg)
> +{
> + __mfdm_enable(smp_processor_id());
> +}
> +
> static bool get_rmptable_info(u64 *start, u64 *len)
> {
> u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
> @@ -2325,6 +2346,9 @@ static __init int __snp_rmptable_init(void)
> /* Flush the caches to ensure that data is written before SNP is enabled. */
> wbinvd_on_all_cpus();
>
> + /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
> + on_each_cpu(mfdm_enable, NULL, 1);
> +
> /* Enable SNP on all CPUs. */
> on_each_cpu(snp_enable, NULL, 1);
>
> --
> 2.25.1
>

2022-06-23 21:38:49

by Marc Orr

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Mon, Jun 20, 2022 at 4:02 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
> entry for a given page. The RMP entry format is documented in AMD PPR, see
> https://bugzilla.kernel.org/attachment.cgi?id=296015.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/sev.h | 27 ++++++++++++++++++++++++
> arch/x86/kernel/sev.c | 43 ++++++++++++++++++++++++++++++++++++++
> include/linux/sev.h | 30 ++++++++++++++++++++++++++
> 3 files changed, 100 insertions(+)
> create mode 100644 include/linux/sev.h
>
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 9c2d33f1cfee..cb16f0e5b585 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -9,6 +9,7 @@
> #define __ASM_ENCRYPTED_STATE_H
>
> #include <linux/types.h>
> +#include <linux/sev.h>
> #include <asm/insn.h>
> #include <asm/sev-common.h>
> #include <asm/bootparam.h>
> @@ -84,6 +85,32 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
>
> /* RMP page size */
> #define RMP_PG_SIZE_4K 0
> +#define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
> +
> +/*
> + * The RMP entry format is not architectural. The format is defined in PPR
> + * Family 19h Model 01h, Rev B1 processor.
> + */
> +struct __packed rmpentry {
> + union {
> + struct {
> + u64 assigned : 1,
> + pagesize : 1,
> + immutable : 1,
> + rsvd1 : 9,
> + gpa : 39,
> + asid : 10,
> + vmsa : 1,
> + validated : 1,
> + rsvd2 : 1;
> + } info;
> + u64 low;
> + };
> + u64 high;
> +};
> +
> +#define rmpentry_assigned(x) ((x)->info.assigned)
> +#define rmpentry_pagesize(x) ((x)->info.pagesize)
>
> #define RMPADJUST_VMSA_PAGE_BIT BIT(16)
>
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 25c7feb367f6..59e7ec6b0326 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -65,6 +65,8 @@
> * bookkeeping, the range need to be added during the RMP entry lookup.
> */
> #define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
> +#define RMPENTRY_SHIFT 8
> +#define rmptable_page_offset(x) (RMPTABLE_CPU_BOOKKEEPING_SZ + (((unsigned long)x) >> RMPENTRY_SHIFT))
>
> /* For early boot hypervisor communication in SEV-ES enabled guests */
> static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> @@ -2386,3 +2388,44 @@ static int __init snp_rmptable_init(void)
> * available after subsys_initcall().
> */
> fs_initcall(snp_rmptable_init);
> +
> +static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
> +{
> + unsigned long vaddr, paddr = pfn << PAGE_SHIFT;
> + struct rmpentry *entry, *large_entry;
> +
> + if (!pfn_valid(pfn))
> + return ERR_PTR(-EINVAL);
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return ERR_PTR(-ENXIO);

nit: I think we should check if SNP is enabled first, before doing
anything else. In other words, I think we should move this check above
the `!pfn_valid()` check.

> +
> + vaddr = rmptable_start + rmptable_page_offset(paddr);
> + if (unlikely(vaddr > rmptable_end))
> + return ERR_PTR(-ENXIO);

nit: It would be nice to use a different error code here, from the SNP
feature check. That way, if this function fails, it's easier to
diagnose where the function failed from the error code.

> +
> + entry = (struct rmpentry *)vaddr;
> +
> + /* Read a large RMP entry to get the correct page level used in RMP entry. */
> + vaddr = rmptable_start + rmptable_page_offset(paddr & PMD_MASK);
> + large_entry = (struct rmpentry *)vaddr;
> + *level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
> +
> + return entry;
> +}
> +
> +/*
> + * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
> + * and -errno if there is no corresponding RMP entry.
> + */
> +int snp_lookup_rmpentry(u64 pfn, int *level)
> +{
> + struct rmpentry *e;
> +
> + e = __snp_lookup_rmpentry(pfn, level);
> + if (IS_ERR(e))
> + return PTR_ERR(e);
> +
> + return !!rmpentry_assigned(e);
> +}
> +EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
> diff --git a/include/linux/sev.h b/include/linux/sev.h
> new file mode 100644
> index 000000000000..1a68842789e1
> --- /dev/null
> +++ b/include/linux/sev.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * AMD Secure Encrypted Virtualization
> + *
> + * Author: Brijesh Singh <[email protected]>
> + */
> +
> +#ifndef __LINUX_SEV_H
> +#define __LINUX_SEV_H
> +
> +/* RMUPDATE detected 4K page and 2MB page overlap. */
> +#define RMPUPDATE_FAIL_OVERLAP 7
> +
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +int snp_lookup_rmpentry(u64 pfn, int *level);
> +int psmash(u64 pfn);
> +int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable);
> +int rmp_make_shared(u64 pfn, enum pg_level level);

nit: I think the declarations for `psmash()`, `rmp_make_private()`,
and `rmp_make_shared()` should be introduced in the patches that have
their definitions.

> +#else
> +static inline int snp_lookup_rmpentry(u64 pfn, int *level) { return 0; }
> +static inline int psmash(u64 pfn) { return -ENXIO; }
> +static inline int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid,
> + bool immutable)
> +{
> + return -ENODEV;
> +}
> +static inline int rmp_make_shared(u64 pfn, enum pg_level level) { return -ENODEV; }
> +
> +#endif /* CONFIG_AMD_MEM_ENCRYPT */
> +#endif /* __LINUX_SEV_H */
> --
> 2.25.1
>

2022-06-23 22:24:38

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

[AMD Official Use Only - General]

>> +static int __init snp_rmptable_init(void) {
>> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>> + return 0;
>> +
>> + if (!iommu_sev_snp_supported())
>> + goto nosnp;
>> +
>> + if (__snp_rmptable_init())
>> + goto nosnp;
>> +
>> + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
>> + "x86/rmptable_init:online", __snp_enable, NULL);
>> +
>> + return 0;
>> +
>> +nosnp:
>> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>> + return 1;

>Seems odd that we're returning 1 here, rather than 0. I tried to figure out how the initcall return values are used and failed. My impression was 0 means success and a negative number means failure.
>But maybe this is normal.

I think that initcall values are typically ignored, but it should return 0 on success and negative on error. So probably should fix this to return something like -ENOSYS instead of 1.

Thanks,
Ashish

2022-06-23 22:40:10

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Wed, Jun 22, 2022, Kalra, Ashish wrote:
> On 6/22/22 12:43, Kalra, Ashish wrote:
> >>> I think that needs to be fixed. It should be as simple as a
> >>> model/family check, though. If someone (for example) attempts to use
> >>> SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code on a
> >>> newer CPU, the kernel should refuse.
> >> More specifically I am thinking of adding RMP entry field accessors so
> >> that they can do this cpu model/family check and return the correct
> >> field as per processor architecture.
>
> >That will be helpful down the road when there's more than one format. But,
> >the real issue is that the kernel doesn't *support* a different RMP format.
> >So, the SNP support should be disabled when encountering a model/family
> >other than the known good one.
>
> Yes, that makes sense, will add an additional check in snp_rmptable_init().

And as I suggested in v5[*], bury the microarchitectural struct in sev.c so that
nothing outside of the few bits of SNP code that absolutely need to know the layout
of the struct should even be aware that there's a struct overlay for RMP entries.

[*] https://lore.kernel.org/all/[email protected]

2022-06-23 22:43:59

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]

>> On 6/22/22 12:43, Kalra, Ashish wrote:
>> >>> I think that needs to be fixed. It should be as simple as a
>> >>> model/family check, though. If someone (for example) attempts to
>> >>> use SNP (and thus snp_lookup_rmpentry() and dump_rmpentry()) code
>> >>> on a newer CPU, the kernel should refuse.
>> >> More specifically I am thinking of adding RMP entry field accessors
>> >> so that they can do this cpu model/family check and return the
>> >> correct field as per processor architecture.
>>
>> >That will be helpful down the road when there's more than one format.
>> >But, the real issue is that the kernel doesn't *support* a different RMP format.
>> >So, the SNP support should be disabled when encountering a
>> >model/family other than the known good one.
>>
>> Yes, that makes sense, will add an additional check in snp_rmptable_init().

>And as I suggested in v5[*], bury the microarchitectural struct in sev.c so that nothing outside of the few bits of SNP code that absolutely need to know the layout of the struct should even be aware that there's a struct overlay for RMP entries.

Yes, that's a nice way to hide it from the rest of the kernel which does not require access to this structure anyway, in essence, it becomes a private structure.

Thanks,
Ashish

>[*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FYPCAZaROOHNskGlO%40google.com&amp;data=05%7C01%7CAshish.Kalra%40amd.com%7Ce210ec383f654556348c08da5568ca81%>7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916205851843411%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=6TOpchjhgFg%>2F5JTa%2FqSviiTuehNoZgvTVBuZv6JxsXc%3D&amp;reserved=0

2022-06-24 14:20:55

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Tue, Jun 21, 2022 at 2:17 PM Kalra, Ashish <[email protected]> wrote:
>
> [Public]
>
> Hello Peter,
>
> >> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages,
> >> +bool locked) {
> >> + struct sev_data_snp_page_reclaim data;
> >> + int ret, err, i, n = 0;
> >> +
> >> + for (i = 0; i < npages; i++) {
>
> >What about setting |n| here too, also the other increments.
>
> >for (i = 0, n = 0; i < npages; i++, n++, pfn++)
>
> Yes that is simpler.
>
> >> + memset(&data, 0, sizeof(data));
> >> + data.paddr = pfn << PAGE_SHIFT;
> >> +
> >> + if (locked)
> >> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> >> + else
> >> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM,
> >> + &data, &err);
>
> > Can we change `sev_cmd_mutex` to some sort of nesting lock type? That could clean up this if (locked) code.
>
> > +static inline int rmp_make_firmware(unsigned long pfn, int level) {
> > + return rmp_make_private(pfn, 0, level, 0, true); }
> > +
> > +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
> > + bool need_reclaim)
>
> >This function can do a lot and when I read the call sites its hard to see what its doing since we have a combination of arguments which tell us what behavior is happening, some of which are not valid (ex: to_fw == true and need_reclaim == true is an >invalid argument combination).
>
> to_fw is used to make a firmware page and need_reclaim is for freeing the firmware page, so they are going to be mutually exclusive.
>
> I actually can connect with it quite logically with the callers :
> snp_alloc_firmware_pages will call with to_fw = true and need_reclaim = false
> and snp_free_firmware_pages will do the opposite, to_fw = false and need_reclaim = true.
>
> That seems straightforward to look at.

This might be a preference thing but I find it not straightforward.
When I am reading through unmap_firmware_writeable() and I see

/* Transition the pre-allocated buffer to the firmware state. */
if (snp_set_rmp_state(__pa(map->host), npages, true, true, false))
return -EFAULT;

I don't actually know what snp_set_rmp_state() is doing unless I go
look at the definition and see what all those booleans mean. This is
unlike the rmp_make_shared() and rmp_make_private() functions, each of
which tells me a lot more about what the function will do just from
the name.


>
> >Also this for loop over |npages| is duplicated from snp_reclaim_pages(). One improvement here is that on the current
> >snp_reclaim_pages() if we fail to reclaim a page we assume we cannot reclaim the next pages, this may cause us to snp_leak_pages() more pages than we actually need too.
>
> Yes that is true.
>
> >What about something like this?
>
> >static snp_leak_page(u64 pfn, enum pg_level level) {
> > memory_failure(pfn, 0);
> > dump_rmpentry(pfn);
> >}
>
> >static int snp_reclaim_page(u64 pfn, enum pg_level level) {
> > int ret;
> > struct sev_data_snp_page_reclaim data;
>
> > ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> > if (ret)
> > goto cleanup;
>
> > ret = rmp_make_shared(pfn, level);
> > if (ret)
> > goto cleanup;
>
> > return 0;
>
> >cleanup:
> > snp_leak_page(pfn, level)
> >}
>
> >typedef int (*rmp_state_change_func) (u64 pfn, enum pg_level level);
>
> >static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, rmp_state_change_func state_change, rmp_state_change_func cleanup) {
> > struct sev_data_snp_page_reclaim data;
> > int ret, err, i, n = 0;
>
> > for (i = 0, n = 0; i < npages; i++, n++, pfn++) {
> > ret = state_change(pfn, PG_LEVEL_4K)
> > if (ret)
> > goto cleanup;
> > }
>
> > return 0;
>
> > cleanup:
> > for (; i>= 0; i--, n--, pfn--) {
> > cleanup(pfn, PG_LEVEL_4K);
> > }
>
> > return ret;
> >}
>
> >Then inside of __snp_alloc_firmware_pages():
>
> >snp_set_rmp_state(paddr, npages, rmp_make_firmware, snp_reclaim_page);
>
> >And inside of __snp_free_firmware_pages():
>
> >snp_set_rmp_state(paddr, npages, snp_reclaim_page, snp_leak_page);
>
> >Just a suggestion feel free to ignore. The readability comment could be addressed much less invasively by just making separate functions for each valid combination of arguments here. Like snp_set_rmp_fw_state(), snp_set_rmp_shared_state(),
> >snp_set_rmp_release_state() or something.
>
> >> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int
> >> +order, bool locked) {
> >> + unsigned long npages = 1ul << order, paddr;
> >> + struct sev_device *sev;
> >> + struct page *page;
> >> +
> >> + if (!psp_master || !psp_master->sev_data)
> >> + return NULL;
> >> +
> >> + page = alloc_pages(gfp_mask, order);
> >> + if (!page)
> >> + return NULL;
> >> +
> >> + /* If SEV-SNP is initialized then add the page in RMP table. */
> >> + sev = psp_master->sev_data;
> >> + if (!sev->snp_inited)
> >> + return page;
> >> +
> >> + paddr = __pa((unsigned long)page_address(page));
> >> + if (snp_set_rmp_state(paddr, npages, true, locked, false))
> >> + return NULL;
>
> >So what about the case where snp_set_rmp_state() fails but we were able to reclaim all the pages? Should we be able to signal that to callers so that we could free |page| here? But given this is an error path already maybe we can optimize this in a >follow up series.
>
> Yes, we should actually tie in to snp_reclaim_pages() success or failure here in the case we were able to successfully unroll some or all of the firmware state change.
>
> > +
> > + return page;
> > +}
> > +
> > +void *snp_alloc_firmware_page(gfp_t gfp_mask) {
> > + struct page *page;
> > +
> > + page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
> > +
> > + return page ? page_address(page) : NULL; }
> > +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
> > +
> > +static void __snp_free_firmware_pages(struct page *page, int order,
> > +bool locked) {
> > + unsigned long paddr, npages = 1ul << order;
> > +
> > + if (!page)
> > + return;
> > +
> > + paddr = __pa((unsigned long)page_address(page));
> > + if (snp_set_rmp_state(paddr, npages, false, locked, true))
> > + return;
>
> > Here we may be able to free some of |page| depending how where inside of snp_set_rmp_state() we failed. But again given this is an error path already maybe we can optimize this in a follow up series.
>
> Yes, we probably should be able to free some of the page(s) depending on how many page(s) got reclaimed in snp_set_rmp_state().
> But these reclamation failures may not be very common, so any failure is indicative of a bigger issue, it might be the case when there is a single page reclamation error it might happen with all the subsequent
> pages and so follow a simple recovery procedure, then handling a more complex recovery for a chunk of pages being reclaimed and another chunk not.
>
> Thanks,
> Ashish
>
>
>

2022-06-24 14:35:00

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 26/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command

On Mon, Jun 20, 2022 at 5:08 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
> guest's memory. The data is encrypted with the cryptographic context
> created with the KVM_SEV_SNP_LAUNCH_START.
>
> In addition to the inserting data, it can insert a two special pages
> into the guests memory: the secrets page and the CPUID page.
>
> While terminating the guest, reclaim the guest pages added in the RMP
> table. If the reclaim fails, then the page is no longer safe to be
> released back to the system and leak them.
>
> For more information see the SEV-SNP specification.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 29 +++
> arch/x86/kvm/svm/sev.c | 187 ++++++++++++++++++
> include/uapi/linux/kvm.h | 19 ++
> 3 files changed, 235 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 878711f2dca6..62abd5c1f72b 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -486,6 +486,35 @@ Returns: 0 on success, -negative on error
>
> See the SEV-SNP specification for further detail on the launch input.
>
> +20. KVM_SNP_LAUNCH_UPDATE
> +-------------------------
> +
> +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> +calculates a measurement of the memory contents. The measurement is a signature
> +of the memory contents that can be sent to the guest owner as an attestation
> +that the memory was encrypted correctly by the firmware.
> +
> +Parameters (in): struct kvm_snp_launch_update
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_update {
> + __u64 start_gfn; /* Guest page number to start from. */
> + __u64 uaddr; /* userspace address need to be encrypted */
> + __u32 len; /* length of memory region */
> + __u8 imi_page; /* 1 if memory is part of the IMI */
> + __u8 page_type; /* page type */
> + __u8 vmpl3_perms; /* VMPL3 permission mask */
> + __u8 vmpl2_perms; /* VMPL2 permission mask */
> + __u8 vmpl1_perms; /* VMPL1 permission mask */
> + };
> +
> +See the SEV-SNP spec for further details on how to build the VMPL permission
> +mask and page type.
> +
> +
> References
> ==========
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 41b83aa6b5f4..b5f0707d7ed6 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -18,6 +18,7 @@
> #include <linux/processor.h>
> #include <linux/trace_events.h>
> #include <linux/hugetlb.h>
> +#include <linux/sev.h>
>
> #include <asm/pkru.h>
> #include <asm/trapnr.h>
> @@ -233,6 +234,49 @@ static void sev_decommission(unsigned int handle)
> sev_guest_decommission(&decommission, NULL);
> }
>
> +static inline void snp_leak_pages(u64 pfn, enum pg_level level)
> +{
> + unsigned int npages = page_level_size(level) >> PAGE_SHIFT;
> +
> + WARN(1, "psc failed pfn 0x%llx pages %d (leaking)\n", pfn, npages);
> +
> + while (npages) {
> + memory_failure(pfn, 0);
> + dump_rmpentry(pfn);
> + npages--;
> + pfn++;
> + }
> +}

Should this be deduplicated with the snp_leak_pages() in "crypto: ccp:
Handle the legacy TMR allocation when SNP is enabled" ?

> +
> +static int snp_page_reclaim(u64 pfn)
> +{
> + struct sev_data_snp_page_reclaim data = {0};
> + int err, rc;
> +
> + data.paddr = __sme_set(pfn << PAGE_SHIFT);
> + rc = snp_guest_page_reclaim(&data, &err);
> + if (rc) {
> + /*
> + * If the reclaim failed, then page is no longer safe
> + * to use.
> + */
> + snp_leak_pages(pfn, PG_LEVEL_4K);
> + }
> +
> + return rc;
> +}
> +
> +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
> +{
> + int rc;
> +
> + rc = rmp_make_shared(pfn, level);
> + if (rc && leak)
> + snp_leak_pages(pfn, level);
> +
> + return rc;
> +}
> +
> static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
> {
> struct sev_data_deactivate deactivate;
> @@ -1902,6 +1946,123 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return rc;
> }
>
> +static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct list_head *head = &sev->regions_list;
> + struct enc_region *i;
> +
> + lockdep_assert_held(&kvm->lock);
> +
> + list_for_each_entry(i, head, list) {
> + u64 start = i->uaddr;
> + u64 end = start + i->size;
> +
> + if (start <= hva && end >= (hva + len))
> + return true;
> + }

Given that usersapce could load sev->regions_list with any # of any
sized regions. Should we add a cond_resched() like in
sev_vm_destroy()?

> +
> + return false;
> +}
> +
> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {0};
> + struct kvm_sev_snp_launch_update params;
> + unsigned long npages, pfn, n = 0;
> + int *error = &argp->error;
> + struct page **inpages;
> + int ret, i, level;
> + u64 gfn;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + /* Verify that the specified address range is registered. */
> + if (!is_hva_registered(kvm, params.uaddr, params.len))
> + return -EINVAL;
> +
> + /*
> + * The userspace memory is already locked so technically we don't
> + * need to lock it again. Later part of the function needs to know
> + * pfn so call the sev_pin_memory() so that we can get the list of
> + * pages to iterate through.
> + */
> + inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
> + if (!inpages)
> + return -ENOMEM;
> +
> + /*
> + * Verify that all the pages are marked shared in the RMP table before
> + * going further. This is avoid the cases where the userspace may try
> + * updating the same page twice.
> + */
> + for (i = 0; i < npages; i++) {
> + if (snp_lookup_rmpentry(page_to_pfn(inpages[i]), &level) != 0) {
> + sev_unpin_memory(kvm, inpages, npages);
> + return -EFAULT;
> + }
> + }
> +
> + gfn = params.start_gfn;
> + level = PG_LEVEL_4K;
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> +
> + for (i = 0; i < npages; i++) {
> + pfn = page_to_pfn(inpages[i]);
> +
> + ret = rmp_make_private(pfn, gfn << PAGE_SHIFT, level, sev_get_asid(kvm), true);
> + if (ret) {
> + ret = -EFAULT;
> + goto e_unpin;
> + }
> +
> + n++;
> + data.address = __sme_page_pa(inpages[i]);
> + data.page_size = X86_TO_RMP_PG_LEVEL(level);
> + data.page_type = params.page_type;
> + data.vmpl3_perms = params.vmpl3_perms;
> + data.vmpl2_perms = params.vmpl2_perms;
> + data.vmpl1_perms = params.vmpl1_perms;
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
> + if (ret) {
> + /*
> + * If the command failed then need to reclaim the page.
> + */
> + snp_page_reclaim(pfn);
> + goto e_unpin;
> + }
> +
> + gfn++;
> + }
> +
> +e_unpin:
> + /* Content of memory is updated, mark pages dirty */
> + for (i = 0; i < n; i++) {

Since |n| is not only a loop variable but actually carries the number
of private pages over to e_unpin can we use a more descriptive name?
How about something like 'nprivate_pages'?

> + set_page_dirty_lock(inpages[i]);
> + mark_page_accessed(inpages[i]);
> +
> + /*
> + * If its an error, then update RMP entry to change page ownership
> + * to the hypervisor.
> + */
> + if (ret)
> + host_rmp_make_shared(pfn, level, true);
> + }
> +
> + /* Unlock the user pages */
> + sev_unpin_memory(kvm, inpages, npages);
> +
> + return ret;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -1995,6 +2156,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_START:
> r = snp_launch_start(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_UPDATE:
> + r = snp_launch_update(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2113,6 +2277,29 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> static void __unregister_enc_region_locked(struct kvm *kvm,
> struct enc_region *region)
> {
> + unsigned long i, pfn;
> + int level;
> +
> + /*
> + * The guest memory pages are assigned in the RMP table. Unassign it
> + * before releasing the memory.
> + */
> + if (sev_snp_guest(kvm)) {
> + for (i = 0; i < region->npages; i++) {
> + pfn = page_to_pfn(region->pages[i]);
> +
> + if (!snp_lookup_rmpentry(pfn, &level))
> + continue;
> +
> + cond_resched();
> +
> + if (level > PG_LEVEL_4K)
> + pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> +
> + host_rmp_make_shared(pfn, level, true);
> + }
> + }
> +
> sev_unpin_memory(kvm, region->pages, region->npages);
> list_del(&region->list);
> kfree(region);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0cb119d66ae5..9b36b07414ea 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1813,6 +1813,7 @@ enum sev_cmd_id {
> /* SNP specific commands */
> KVM_SEV_SNP_INIT,
> KVM_SEV_SNP_LAUNCH_START,
> + KVM_SEV_SNP_LAUNCH_UPDATE,
>
> KVM_SEV_NR_MAX,
> };
> @@ -1929,6 +1930,24 @@ struct kvm_sev_snp_launch_start {
> __u8 pad[6];
> };
>
> +#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
> +#define KVM_SEV_SNP_PAGE_TYPE_VMSA 0x2
> +#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
> +#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
> +#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
> +#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6
> +
> +struct kvm_sev_snp_launch_update {
> + __u64 start_gfn;
> + __u64 uaddr;
> + __u32 len;
> + __u8 imi_page;
> + __u8 page_type;
> + __u8 vmpl3_perms;
> + __u8 vmpl2_perms;
> + __u8 vmpl1_perms;
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
> --
> 2.25.1
>

2022-06-24 14:45:31

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 24/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command

>
> +19. KVM_SNP_LAUNCH_START
> +------------------------
> +
> +The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
> +context for the SEV-SNP guest. To create the encryption context, user must
> +provide a guest policy, migration agent (if any) and guest OS visible
> +workarounds value as defined SEV-SNP specification.
> +
> +Parameters (in): struct kvm_snp_launch_start
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_start {
> + __u64 policy; /* Guest policy to use. */
> + __u64 ma_uaddr; /* userspace address of migration agent */
> + __u8 ma_en; /* 1 if the migtation agent is enabled */

migration

> + __u8 imi_en; /* set IMI to 1. */
> + __u8 gosvw[16]; /* guest OS visible workarounds */
> + };
> +
> +See the SEV-SNP specification for further detail on the launch input.
> +
> References
> ==========
>

>
> +static int snp_decommission_context(struct kvm *kvm)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_decommission data = {};
> + int ret;
> +
> + /* If context is not created then do nothing */
> + if (!sev->snp_context)
> + return 0;
> +
> + data.gctx_paddr = __sme_pa(sev->snp_context);
> + ret = snp_guest_decommission(&data, NULL);

Do we have a similar race like in sev_unbind_asid() with DEACTIVATE
and WBINVD/DF_FLUSH? The SNP_DECOMMISSION spec looks quite similar to
DEACTIVATE.

> + if (WARN_ONCE(ret, "failed to release guest context"))
> + return ret;
> +
> + /* free the context page now */
> + snp_free_firmware_page(sev->snp_context);
> + sev->snp_context = NULL;
> +
> + return 0;
> +}
> +
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;

2022-06-24 15:24:01

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

On Mon, Jun 20, 2022 at 5:11 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> On VMGEXIT, sev_handle_vmgexit() creates a host mapping for the GHCB GPA,
> and unmaps it just before VM-entry. This long-lived GHCB map is used by
> the VMGEXIT handler through accessors such as ghcb_{set_get}_xxx().
>
> A long-lived GHCB map can cause issue when SEV-SNP is enabled. When
> SEV-SNP is enabled the mapped GPA needs to be protected against a page
> state change.
>
> To eliminate the long-lived GHCB mapping, update the GHCB sync operations
> to explicitly map the GHCB before access and unmap it after access is
> complete. This requires that the setting of the GHCBs sw_exit_info_{1,2}
> fields be done during sev_es_sync_to_ghcb(), so create two new fields in
> the vcpu_svm struct to hold these values when required to be set outside
> of the GHCB mapping.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 131 ++++++++++++++++++++++++++---------------
> arch/x86/kvm/svm/svm.c | 12 ++--
> arch/x86/kvm/svm/svm.h | 24 +++++++-
> 3 files changed, 111 insertions(+), 56 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 01ea257e17d6..c70f3f7e06a8 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2823,15 +2823,40 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
> kvfree(svm->sev_es.ghcb_sa);
> }
>
> +static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
> +{
> + struct vmcb_control_area *control = &svm->vmcb->control;
> + u64 gfn = gpa_to_gfn(control->ghcb_gpa);
> +
> + if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
> + /* Unable to map GHCB from guest */
> + pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
> + return -EFAULT;
> + }
> +
> + return 0;
> +}

There is a perf cost to this suggestion but it might make accessing
the GHCB safer for KVM. Have you thought about just using
kvm_read_guest() or copy_from_user() to fully copy out the GCHB into a
KVM owned buffer, then copying it back before the VMRUN. That way the
KVM doesn't need to guard against page_state_changes on the GHCBs,
that could be a perf improvement in a follow up.

Since we cannot unmap GHCBs I don't think UPM will help here so we
probably want to make these patches safe against malicious guests
making GHCBs private. But maybe UPM does help?

2022-06-24 16:32:19

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

On Mon, Jun 20, 2022 at 5:13 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> Version 2 of GHCB specification added the support for two SNP Guest
> Request Message NAE events. The events allows for an SEV-SNP guest to
> make request to the SEV-SNP firmware through hypervisor using the
> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>
> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
> difference of an additional certificate blob that can be passed through
> the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
> provides snp_guest_ext_guest_request() that is used by the KVM to get
> both the report and certificate data at once.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.h | 2 +
> 2 files changed, 192 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 7fc0fad87054..089af21a4efe 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -343,6 +343,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>
> spin_lock_init(&sev->psc_lock);
> ret = sev_snp_init(&argp->error);
> + mutex_init(&sev->guest_req_lock);
> } else {
> ret = sev_platform_init(&argp->error);
> }
> @@ -1884,23 +1885,39 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>
> static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> {
> + void *context = NULL, *certs_data = NULL, *resp_page = NULL;

Is the NULL setting here unnecessary since all of these are set via
functions snp_alloc_firmware_page(), kmalloc(), and
snp_alloc_firmware_page() respectively?

> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> struct sev_data_snp_gctx_create data = {};
> - void *context;
> int rc;
>
> + /* Allocate memory used for the certs data in SNP guest request */
> + certs_data = kmalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
> + if (!certs_data)
> + return NULL;

I think we want to use kzalloc() here to ensure we never give the
guest uninitialized kernel memory.

> +
> /* Allocate memory for context page */
> context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> if (!context)
> - return NULL;
> + goto e_free;
> +
> + /* Allocate a firmware buffer used during the guest command handling. */
> + resp_page = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> + if (!resp_page)
> + goto e_free;

|resp_page| doesn't appear to be used anywhere?

>
> data.gctx_paddr = __psp_pa(context);
> rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
> - if (rc) {
> - snp_free_firmware_page(context);
> - return NULL;
> - }
> + if (rc)
> + goto e_free;
> +
> + sev->snp_certs_data = certs_data;
>
> return context;
> +
> +e_free:
> + snp_free_firmware_page(context);
> + kfree(certs_data);
> + return NULL;
> }
>
> static int snp_bind_asid(struct kvm *kvm, int *error)
> @@ -2565,6 +2582,8 @@ static int snp_decommission_context(struct kvm *kvm)
> snp_free_firmware_page(sev->snp_context);
> sev->snp_context = NULL;
>
> + kfree(sev->snp_certs_data);
> +
> return 0;
> }
>
> @@ -3077,6 +3096,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
> case SVM_VMGEXIT_UNSUPPORTED_EVENT:
> case SVM_VMGEXIT_HV_FEATURES:
> case SVM_VMGEXIT_PSC:
> + case SVM_VMGEXIT_GUEST_REQUEST:
> + case SVM_VMGEXIT_EXT_GUEST_REQUEST:
> break;
> default:
> reason = GHCB_ERR_INVALID_EVENT;
> @@ -3502,6 +3523,155 @@ static unsigned long snp_handle_page_state_change(struct vcpu_svm *svm)
> return rc ? map_to_psc_vmgexit_code(rc) : 0;
> }
>
> +static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm,
> + struct sev_data_snp_guest_request *data,
> + gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + kvm_pfn_t req_pfn, resp_pfn;
> + struct kvm_sev_info *sev;
> +
> + sev = &to_kvm_svm(kvm)->sev_info;

This is normally done at declaration in this file. Why not here?

struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;

> +
> + if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE))
> + return SEV_RET_INVALID_PARAM;
> +
> + req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
> + if (is_error_noslot_pfn(req_pfn))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
> + if (is_error_noslot_pfn(resp_pfn))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + data->gctx_paddr = __psp_pa(sev->snp_context);
> + data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
> + data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
> +
> + return 0;
> +}
> +
> +static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data, unsigned long *rc)
> +{
> + u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
> + int ret;
> +
> + ret = snp_page_reclaim(pfn);
> + if (ret)
> + *rc = SEV_RET_INVALID_ADDRESS;

Do we need a diff error code here? This means the page the guest gives
us is now "stuck" in the FW owned state. How would the guest know this
is the case? We return the exact same error in snp_setup_guest_buf()
if the resp_gpa isn't page aligned so now if the guest ever sees a
SEV_RET_INVALID_ADDRESS I think its only safe option is to either try
and page_state_change it to a know state or mark it as unusable
memory.

> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + *rc = SEV_RET_INVALID_ADDRESS;

Ditto here I think we need some way to signal to the guest what state
this page is on return to guest execution.

Also these errors shadow over FW successes, this means the guest's
guest-request-sequence-numbers are now out of sync meaning this VMPCK
is unusable less the guest risk reusing the AES IV (which would break
the confidentiality/integrity). Should we have a way to signal to the
guest the FW has successfully run your command but we could not change
the page states back correctly, so the guest should increment their
sequence numbers.

> +}
> +
> +static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct sev_data_snp_guest_request data = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_sev_info *sev;
> + unsigned long rc;
> + int err;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;

Ditto why not due this above?

> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
> + if (rc)
> + /* use the firmware error code */
> + rc = err;
> +
> + snp_cleanup_guest_buf(&data, &rc);
> +
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + svm_set_ghcb_sw_exit_info_2(vcpu, rc);
> +}
> +
> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct sev_data_snp_guest_request req = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + unsigned long data_npages;
> + struct kvm_sev_info *sev;
> + unsigned long rc, err;
> + u64 data_gpa;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;
> +
> + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> + data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
> +
> + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> + rc = SEV_RET_INVALID_ADDRESS;
> + goto e_fail;
> + }
> +
> + /* Verify that requested blob will fit in certificate buffer */
> + if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
> + rc = SEV_RET_INVALID_PARAM;
> + goto e_fail;
> + }
> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> + &data_npages, &err);
> + if (rc) {
> + /*
> + * If buffer length is small then return the expected
> + * length in rbx.
> + */
> + if (err == SNP_GUEST_REQ_INVALID_LEN)
> + vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
> +
> + /* pass the firmware error code */
> + rc = err;
> + goto cleanup;
> + }
> +
> + /* Copy the certificate blob in the guest memory */
> + if (data_npages &&
> + kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
> + rc = SEV_RET_INVALID_ADDRESS;

Since at this point the PSP FW has correctly executed the command and
incremented the VMPCK sequence number I think we need another error
signal here since this will tell the guest the PSP had an error so it
will not know if the VMPCK sequence number should be incremented.

> +
> +cleanup:
> + snp_cleanup_guest_buf(&req, &rc);
> +
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + svm_set_ghcb_sw_exit_info_2(vcpu, rc);
> +}
> +
> static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> {
> struct vmcb_control_area *control = &svm->vmcb->control;
> @@ -3753,6 +3923,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
> svm_set_ghcb_sw_exit_info_2(vcpu, rc);
> break;
> }
> + case SVM_VMGEXIT_GUEST_REQUEST: {
> + snp_handle_guest_request(svm, control->exit_info_1, control->exit_info_2);
> +
> + ret = 1;
> + break;
> + }
> + case SVM_VMGEXIT_EXT_GUEST_REQUEST: {
> + snp_handle_ext_guest_request(svm,
> + control->exit_info_1,
> + control->exit_info_2);
> +
> + ret = 1;
> + break;
> + }
> case SVM_VMGEXIT_UNSUPPORTED_EVENT:
> vcpu_unimpl(vcpu,
> "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 3fd95193ed8d..3be24da1a743 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -98,6 +98,8 @@ struct kvm_sev_info {
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> spinlock_t psc_lock;
> + void *snp_certs_data;
> + struct mutex guest_req_lock;
> };
>
> struct kvm_svm {
> --
> 2.25.1
>

2022-06-24 16:37:00

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 47/49] *fix for stale per-cpu pointer due to cond_resched during ghcb mapping

On Mon, Jun 20, 2022 at 5:15 PM Ashish Kalra <[email protected]> wrote:
>
> From: Michael Roth <[email protected]>
>
> Signed-off-by: Michael Roth <[email protected]>

Can you add a commit description here? Is this a fix for existing
SEV-ES support or should this be incorporated into a patch in this
series which adds this issue?

> ---
> arch/x86/kvm/svm/svm.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index fced6ea423ad..f78e3b1bde0e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1352,7 +1352,7 @@ static void svm_vcpu_free(struct kvm_vcpu *vcpu)
> static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> - struct svm_cpu_data *sd = per_cpu(svm_data, vcpu->cpu);
> + struct svm_cpu_data *sd;
>
> if (sev_es_guest(vcpu->kvm))
> sev_es_unmap_ghcb(svm);
> @@ -1360,6 +1360,10 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> if (svm->guest_state_loaded)
> return;
>
> + /* sev_es_unmap_ghcb() can resched, so grab per-cpu pointer afterward. */
> + barrier();
> + sd = per_cpu(svm_data, vcpu->cpu);
> +
> /*
> * Save additional host state that will be restored on VMEXIT (sev-es)
> * or subsequent vmload of host save area.
> --
> 2.25.1
>

2022-06-24 16:50:27

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 47/49] *fix for stale per-cpu pointer due to cond_resched during ghcb mapping

[AMD Official Use Only - General]

Hello Peter,
>>
>> From: Michael Roth <[email protected]>
>>
>> Signed-off-by: Michael Roth <[email protected]>

>Can you add a commit description here? Is this a fix for existing SEV-ES support or should this be incorporated into a patch in this series which adds this issue?

This actually fixes issues caused due to preemption happening in svm_prepare_switch_to_guest() when kvm_vcpu_map() is called to map in the GHCB before
entering the guest.

This is a temporary fix and what we need to do is to prevent getting preempted after vcpu_enter_guest() has disabled preemption, have some ideas about
using gfn_to_pfn_cache() infrastructure to re-use the already mapped GHCB at guest exit, so that we can avoid calling kvm_vcpu_map() to re-map the
GHCB.

Thanks,
Ashish

> ---
> arch/x86/kvm/svm/svm.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index
> fced6ea423ad..f78e3b1bde0e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1352,7 +1352,7 @@ static void svm_vcpu_free(struct kvm_vcpu *vcpu)
> static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {
> struct vcpu_svm *svm = to_svm(vcpu);
> - struct svm_cpu_data *sd = per_cpu(svm_data, vcpu->cpu);
> + struct svm_cpu_data *sd;
>
> if (sev_es_guest(vcpu->kvm))
> sev_es_unmap_ghcb(svm); @@ -1360,6 +1360,10 @@ static
> void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> if (svm->guest_state_loaded)
> return;
>
> + /* sev_es_unmap_ghcb() can resched, so grab per-cpu pointer afterward. */
> + barrier();
> + sd = per_cpu(svm_data, vcpu->cpu);
> +
> /*
> * Save additional host state that will be restored on VMEXIT (sev-es)
> * or subsequent vmload of host save area.
> --
> 2.25.1
>

2022-06-24 18:20:37

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 24/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command

[AMD Official Use Only - General]

>> +static int snp_decommission_context(struct kvm *kvm) {
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct sev_data_snp_decommission data = {};
>> + int ret;
>> +
>> + /* If context is not created then do nothing */
>> + if (!sev->snp_context)
>> + return 0;
>> +
>> + data.gctx_paddr = __sme_pa(sev->snp_context);
>> + ret = snp_guest_decommission(&data, NULL);

>Do we have a similar race like in sev_unbind_asid() with DEACTIVATE and WBINVD/DF_FLUSH? The SNP_DECOMMISSION spec looks quite similar to DEACTIVATE.

Yes, SNP_DECOMMISION also marks the ASID as invalid and require a WBINVD/DF_FLUSH before the ASID is re-used/re-cycled, so we need to prevent against
DECOMMISION and ASID re-cycling happening at the same time. Can reuse the same RWSEM (sev_deactivate_lock) here too.

Thanks,
Ashish

> + if (WARN_ONCE(ret, "failed to release guest context"))
> + return ret;
> +
> + /* free the context page now */
> + snp_free_firmware_page(sev->snp_context);
> + sev->snp_context = NULL;
> +
> + return 0;
> +}
> +
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;

2022-06-24 20:16:11

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

[AMD Official Use Only - General]

Hello Peter,

>> From: Brijesh Singh <[email protected]>
>>
>> On VMGEXIT, sev_handle_vmgexit() creates a host mapping for the GHCB
>> GPA, and unmaps it just before VM-entry. This long-lived GHCB map is
>> used by the VMGEXIT handler through accessors such as ghcb_{set_get}_xxx().
>>
>> A long-lived GHCB map can cause issue when SEV-SNP is enabled. When
>> SEV-SNP is enabled the mapped GPA needs to be protected against a page
>> state change.
>>
>> To eliminate the long-lived GHCB mapping, update the GHCB sync
>> operations to explicitly map the GHCB before access and unmap it after
>> access is complete. This requires that the setting of the GHCBs
>> sw_exit_info_{1,2} fields be done during sev_es_sync_to_ghcb(), so
>> create two new fields in the vcpu_svm struct to hold these values when
>> required to be set outside of the GHCB mapping.
>>
>> Signed-off-by: Brijesh Singh <[email protected]>
>> ---
>> arch/x86/kvm/svm/sev.c | 131
>> ++++++++++++++++++++++++++---------------
>> arch/x86/kvm/svm/svm.c | 12 ++--
>> arch/x86/kvm/svm/svm.h | 24 +++++++-
>> 3 files changed, 111 insertions(+), 56 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index
>> 01ea257e17d6..c70f3f7e06a8 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2823,15 +2823,40 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>> kvfree(svm->sev_es.ghcb_sa);
>> }
>>
>> +static inline int svm_map_ghcb(struct vcpu_svm *svm, struct
>> +kvm_host_map *map) {
>> + struct vmcb_control_area *control = &svm->vmcb->control;
>> + u64 gfn = gpa_to_gfn(control->ghcb_gpa);
>> +
>> + if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
>> + /* Unable to map GHCB from guest */
>> + pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
>> + return -EFAULT;
>> + }
>> +
>> + return 0;
>> +}

>There is a perf cost to this suggestion but it might make accessing the GHCB safer for KVM. Have you thought about just using
>kvm_read_guest() or copy_from_user() to fully copy out the GCHB into a KVM owned buffer, then copying it back before the VMRUN. That way the KVM doesn't need to guard against page_state_changes on the GHCBs, that could be a perf ?>improvement in a follow up.

Along with the performance costs you mentioned, the main concern here will be the GHCB write-back path (copying it back) before VMRUN: this will again hit the issue we have currently with
kvm_write_guest() / copy_to_user(), when we use it to sync the scratch buffer back to GHCB. This can fail if guest RAM is mapped using huge-page(s) and RMP is 4K. Please refer to the patch/fix
mentioned below, kvm_write_guest() potentially can fail before VMRUN in case of SNP :

commit 94ed878c2669532ebae8eb9b4503f19aa33cd7aa
Author: Ashish Kalra <[email protected]>
Date: Mon Jun 6 22:28:01 2022 +0000

KVM: SVM: Sync the GHCB scratch buffer using already mapped ghcb

Using kvm_write_guest() to sync the GHCB scratch buffer can fail
due to host mapping being 2M, but RMP being 4K. The page fault handling
in do_user_addr_fault() fails to split the 2M page to handle RMP fault due
to it being called here in a non-preemptible context. Instead use
the already kernel mapped ghcb to sync the scratch buffer when the
scratch buffer is contained within the GHCB.

Thanks,
Ashish

>Since we cannot unmap GHCBs I don't think UPM will help here so we probably want to make these patches safe against malicious guests making GHCBs private. But maybe UPM does help?

2022-06-27 19:07:32

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

[Public]

Hello Peter,

-----Original Message-----
From: Peter Gonda <[email protected]>
Sent: Friday, June 24, 2022 11:25 AM
To: Kalra, Ashish <[email protected]>
Cc: the arch/x86 maintainers <[email protected]>; LKML <[email protected]>; kvm list <[email protected]>; [email protected]; [email protected]; Linux Crypto Mailing List <[email protected]>; Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>; Joerg Roedel <[email protected]>; Lendacky, Thomas <[email protected]>; H. Peter Anvin <[email protected]>; Ard Biesheuvel <[email protected]>; Paolo Bonzini <[email protected]>; Sean Christopherson <[email protected]>; Vitaly Kuznetsov <[email protected]>; Jim Mattson <[email protected]>; Andy Lutomirski <[email protected]>; Dave Hansen <[email protected]>; Sergio Lopez <[email protected]>; Peter Zijlstra <[email protected]>; Srinivas Pandruvada <[email protected]>; David Rientjes <[email protected]>; Dov Murik <[email protected]>; Tobin Feldman-Fitzthum <[email protected]>; Borislav Petkov <[email protected]>; Roth, Michael <[email protected]>; Vlastimil Babka <[email protected]>; Kirill A . Shutemov <[email protected]>; Andi Kleen <[email protected]>; Tony Luck <[email protected]>; Marc Orr <[email protected]>; Sathyanarayanan Kuppuswamy <[email protected]>; Alper Gun <[email protected]>; Dr. David Alan Gilbert <[email protected]>; [email protected]
Subject: Re: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

On Mon, Jun 20, 2022 at 5:13 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> Version 2 of GHCB specification added the support for two SNP Guest
> Request Message NAE events. The events allows for an SEV-SNP guest to
> make request to the SEV-SNP firmware through hypervisor using the
> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>
> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
> difference of an additional certificate blob that can be passed
> through the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP
> driver provides snp_guest_ext_guest_request() that is used by the KVM
> to get both the report and certificate data at once.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.h | 2 +
> 2 files changed, 192 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index
> 7fc0fad87054..089af21a4efe 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -343,6 +343,7 @@ static int sev_guest_init(struct kvm *kvm, struct
> kvm_sev_cmd *argp)
>
> spin_lock_init(&sev->psc_lock);
> ret = sev_snp_init(&argp->error);
> + mutex_init(&sev->guest_req_lock);
> } else {
> ret = sev_platform_init(&argp->error);
> }
> @@ -1884,23 +1885,39 @@ int sev_vm_move_enc_context_from(struct kvm
> *kvm, unsigned int source_fd)
>
> static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd
> *argp) {
> + void *context = NULL, *certs_data = NULL, *resp_page = NULL;

>Is the NULL setting here unnecessary since all of these are set via functions snp_alloc_firmware_page(), kmalloc(), and
>snp_alloc_firmware_page() respectively?

Yes, they don't need to be set to NULL.

> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> struct sev_data_snp_gctx_create data = {};
> - void *context;
> int rc;
>
> + /* Allocate memory used for the certs data in SNP guest request */
> + certs_data = kmalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
> + if (!certs_data)
> + return NULL;

>I think we want to use kzalloc() here to ensure we never give the guest uninitialized kernel memory.

Yes.

> +
> /* Allocate memory for context page */
> context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> if (!context)
> - return NULL;
> + goto e_free;
> +
> + /* Allocate a firmware buffer used during the guest command handling. */
> + resp_page = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> + if (!resp_page)
> + goto e_free;

>|resp_page| doesn't appear to be used anywhere?

>
> data.gctx_paddr = __psp_pa(context);
> rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
> - if (rc) {
> - snp_free_firmware_page(context);
> - return NULL;
> - }
> + if (rc)
> + goto e_free;
> +
> + sev->snp_certs_data = certs_data;
>
> return context;
> +
> +e_free:
> + snp_free_firmware_page(context);
> + kfree(certs_data);
> + return NULL;
> }
>
> static int snp_bind_asid(struct kvm *kvm, int *error) @@ -2565,6
> +2582,8 @@ static int snp_decommission_context(struct kvm *kvm)
> snp_free_firmware_page(sev->snp_context);
> sev->snp_context = NULL;
>
> + kfree(sev->snp_certs_data);
> +
> return 0;
> }
>
> @@ -3077,6 +3096,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm, u64 *exit_code)
> case SVM_VMGEXIT_UNSUPPORTED_EVENT:
> case SVM_VMGEXIT_HV_FEATURES:
> case SVM_VMGEXIT_PSC:
> + case SVM_VMGEXIT_GUEST_REQUEST:
> + case SVM_VMGEXIT_EXT_GUEST_REQUEST:
> break;
> default:
> reason = GHCB_ERR_INVALID_EVENT; @@ -3502,6 +3523,155
> @@ static unsigned long snp_handle_page_state_change(struct vcpu_svm *svm)
> return rc ? map_to_psc_vmgexit_code(rc) : 0; }
>
> +static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm,
> + struct sev_data_snp_guest_request *data,
> + gpa_t req_gpa, gpa_t
> +resp_gpa) {
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + kvm_pfn_t req_pfn, resp_pfn;
> + struct kvm_sev_info *sev;
> +
> + sev = &to_kvm_svm(kvm)->sev_info;

>This is normally done at declaration in this file. Why not here?

> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
Ok.

> +
> + if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE))
> + return SEV_RET_INVALID_PARAM;
> +
> + req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
> + if (is_error_noslot_pfn(req_pfn))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
> + if (is_error_noslot_pfn(resp_pfn))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true))
> + return SEV_RET_INVALID_ADDRESS;
> +
> + data->gctx_paddr = __psp_pa(sev->snp_context);
> + data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
> + data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT);
> +
> + return 0;
> +}
> +
> +static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request
> +*data, unsigned long *rc) {
> + u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT;
> + int ret;
> +
> + ret = snp_page_reclaim(pfn);
> + if (ret)
> + *rc = SEV_RET_INVALID_ADDRESS;

>Do we need a diff error code here? This means the page the guest gives us is now "stuck" in the FW owned state. How would the guest know this is the case? We return the exact same error in snp_setup_guest_buf() if the resp_gpa isn't page aligned >so now if the guest ever sees a SEV_RET_INVALID_ADDRESS I think its only safe option is to either try and page_state_change it to a know state or mark it as unusable memory.

If snp_page_reclaim() fails, it will invoke snp_leak_pages() which will indicate memory failure and trigger memory recovery mechanisms and that should drop the pages or mark them unusable.

> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + *rc = SEV_RET_INVALID_ADDRESS;

>Ditto here I think we need some way to signal to the guest what state this page is on return to guest execution.

>Also these errors shadow over FW successes, this means the guest's guest-request-sequence-numbers are now out of sync meaning this VMPCK is unusable less the guest risk reusing the AES IV (which would break the confidentiality/integrity). >Should we have a way to signal to the guest the FW has successfully run your command but we could not change the page states back correctly, so the guest should increment their sequence numbers.

Yes, that is an important observation, and the sequence numbers are now out of sync.
But as this is an error path, so what's the guarantee that the next guest message request will succeed completely, isn’t it better to let the
FW reject any subsequent guest messages once it has detected that the sequence numbers are out of sync ?

> +}
> +
> +static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t
> +req_gpa, gpa_t resp_gpa) {
> + struct sev_data_snp_guest_request data = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_sev_info *sev;
> + unsigned long rc;
> + int err;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;

>Ditto why not due this above?
Ok.

> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
> + if (rc)
> + /* use the firmware error code */
> + rc = err;
> +
> + snp_cleanup_guest_buf(&data, &rc);
> +
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + svm_set_ghcb_sw_exit_info_2(vcpu, rc); }
> +
> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t
> +req_gpa, gpa_t resp_gpa) {
> + struct sev_data_snp_guest_request req = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + unsigned long data_npages;
> + struct kvm_sev_info *sev;
> + unsigned long rc, err;
> + u64 data_gpa;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;
> +
> + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> + data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
> +
> + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> + rc = SEV_RET_INVALID_ADDRESS;
> + goto e_fail;
> + }
> +
> + /* Verify that requested blob will fit in certificate buffer */
> + if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
> + rc = SEV_RET_INVALID_PARAM;
> + goto e_fail;
> + }
> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> + &data_npages, &err);
> + if (rc) {
> + /*
> + * If buffer length is small then return the expected
> + * length in rbx.
> + */
> + if (err == SNP_GUEST_REQ_INVALID_LEN)
> + vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
> +
> + /* pass the firmware error code */
> + rc = err;
> + goto cleanup;
> + }
> +
> + /* Copy the certificate blob in the guest memory */
> + if (data_npages &&
> + kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
> + rc = SEV_RET_INVALID_ADDRESS;

>Since at this point the PSP FW has correctly executed the command and incremented the VMPCK sequence number I think we need another error signal here since this will tell the guest the PSP had an error so it will not know if the VMPCK sequence >number should be incremented.

Similarly as above, as this is an error path, so what's the guarantee that the next guest message request will succeed completely, isn’t it better to let the
FW reject any subsequent guest messages once it has detected that the sequence numbers are out of sync ?

> +
> +cleanup:
> + snp_cleanup_guest_buf(&req, &rc);
> +
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + svm_set_ghcb_sw_exit_info_2(vcpu, rc); }
> +
> static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) {
> struct vmcb_control_area *control = &svm->vmcb->control; @@
> -3753,6 +3923,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
> svm_set_ghcb_sw_exit_info_2(vcpu, rc);
> break;
> }
> + case SVM_VMGEXIT_GUEST_REQUEST: {
> + snp_handle_guest_request(svm, control->exit_info_1,
> + control->exit_info_2);
> +
> + ret = 1;
> + break;
> + }
> + case SVM_VMGEXIT_EXT_GUEST_REQUEST: {
> + snp_handle_ext_guest_request(svm,
> + control->exit_info_1,
> + control->exit_info_2);
> +
> + ret = 1;
> + break;
> + }
> case SVM_VMGEXIT_UNSUPPORTED_EVENT:
> vcpu_unimpl(vcpu,
> "vmgexit: unsupported event -
> exit_info_1=%#llx, exit_info_2=%#llx\n", diff --git
> a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index
> 3fd95193ed8d..3be24da1a743 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -98,6 +98,8 @@ struct kvm_sev_info {
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> spinlock_t psc_lock;
> + void *snp_certs_data;
> + struct mutex guest_req_lock;
> };
>
> struct kvm_svm {
> --
> 2.25.1
>

2022-06-28 10:52:49

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

* Kalra, Ashish ([email protected]) wrote:
> [AMD Official Use Only - General]
>
> >>> /*
> >>> * The RMP entry format is not architectural. The format is defined
> >>> in PPR @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
> >>> u64 secrets_gpa;
> >>> };
> >>>
> >>> +struct rmpupdate {
> >>> + u64 gpa;
> >>> + u8 assigned;
> >>> + u8 pagesize;
> >>> + u8 immutable;
> >>> + u8 rsvd;
> >>> + u32 asid;
> >>> +} __packed;
>
> >>I see above it says the RMP entry format isn't architectural; is this 'rmpupdate' structure? If not how is this going to get handled when we have a couple >of SNP capable CPUs with different layouts?
>
> >Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible.
> >I probably think the wording here should be architecture independent or more precisely platform independent.
>
> Some more clarity on this:
>
> Actually, the PPR for family 19h Model 01h, Rev B1 defines the RMP entry format as below:
>
> 2.1.4.2 RMP Entry Format
> Architecturally the format of RMP entries are not specified in APM. In order to assist software, the following table specifies select portions of the RMP entry format for this specific product. Each RMP entry is 16B in size and is formatted as follows. Software should not rely on any field definitions not specified in this table and the format of an RMP entry may change in future processors.
>
> Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible. So non-architectural in this context means that it is only defined in our PPR.
>
> So actually this RPM entry definition is platform dependent and will need to be changed for different AMD processors and that change has to be handled correspondingly in the dump_rmpentry() code.

You'll need a way to make that fail cleanly when run on a newer CPU
with different layout, and a way to build kernels that can handle
more than one layout.

Dave

> Thanks,
> Ashish
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-28 13:33:30

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 36/49] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT

* Ashish Kalra ([email protected]) wrote:
> From: Brijesh Singh <[email protected]>
>
> SEV-SNP guests are required to perform a GHCB GPA registration. Before
> using a GHCB GPA for a vCPU the first time, a guest must register the
> vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then
> it must respond back with the same GPA otherwise return -1.
>
> On VMEXIT, Verify that GHCB GPA matches with the registered value. If a
> mismatch is detected then abort the guest.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/sev-common.h | 8 ++++++++
> arch/x86/kvm/svm/sev.c | 27 +++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.h | 7 +++++++
> 3 files changed, 42 insertions(+)
>
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index 539de6b93420..0a9055cdfae2 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -59,6 +59,14 @@
> #define GHCB_MSR_AP_RESET_HOLD_RESULT_POS 12
> #define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK GENMASK_ULL(51, 0)
>
> +/* Preferred GHCB GPA Request */
> +#define GHCB_MSR_PREF_GPA_REQ 0x010
> +#define GHCB_MSR_GPA_VALUE_POS 12
> +#define GHCB_MSR_GPA_VALUE_MASK GENMASK_ULL(51, 0)

Are the magic 51's in here fixed ?

Dave

> +#define GHCB_MSR_PREF_GPA_RESP 0x011
> +#define GHCB_MSR_PREF_GPA_NONE 0xfffffffffffff
> +
> /* GHCB GPA Register */
> #define GHCB_MSR_REG_GPA_REQ 0x012
> #define GHCB_MSR_REG_GPA_REQ_VAL(v) \
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index c70f3f7e06a8..6de48130e414 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3331,6 +3331,27 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
> GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
> break;
> }
> + case GHCB_MSR_PREF_GPA_REQ: {
> + set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_NONE, GHCB_MSR_GPA_VALUE_MASK,
> + GHCB_MSR_GPA_VALUE_POS);
> + set_ghcb_msr_bits(svm, GHCB_MSR_PREF_GPA_RESP, GHCB_MSR_INFO_MASK,
> + GHCB_MSR_INFO_POS);
> + break;
> + }
> + case GHCB_MSR_REG_GPA_REQ: {
> + u64 gfn;
> +
> + gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_VALUE_MASK,
> + GHCB_MSR_GPA_VALUE_POS);
> +
> + svm->sev_es.ghcb_registered_gpa = gfn_to_gpa(gfn);
> +
> + set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_VALUE_MASK,
> + GHCB_MSR_GPA_VALUE_POS);
> + set_ghcb_msr_bits(svm, GHCB_MSR_REG_GPA_RESP, GHCB_MSR_INFO_MASK,
> + GHCB_MSR_INFO_POS);
> + break;
> + }
> case GHCB_MSR_TERM_REQ: {
> u64 reason_set, reason_code;
>
> @@ -3381,6 +3402,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
> return 1;
> }
>
> + /* SEV-SNP guest requires that the GHCB GPA must be registered */
> + if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
> + vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
> + return -EINVAL;
> + }
> +
> ret = sev_es_validate_vmgexit(svm, &exit_code);
> if (ret)
> return ret;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index c80352c9c0d6..54ff56cb6125 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -206,6 +206,8 @@ struct vcpu_sev_es_state {
> */
> u64 ghcb_sw_exit_info_1;
> u64 ghcb_sw_exit_info_2;
> +
> + u64 ghcb_registered_gpa;
> };
>
> struct vcpu_svm {
> @@ -334,6 +336,11 @@ static inline bool sev_snp_guest(struct kvm *kvm)
> return sev_es_guest(kvm) && sev->snp_active;
> }
>
> +static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
> +{
> + return svm->sev_es.ghcb_registered_gpa == val;
> +}
> +
> static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
> {
> vmcb->control.clean = 0;
> --
> 2.25.1
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-28 17:59:30

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

Hello Dave,

-----Original Message-----
From: Dr. David Alan Gilbert <[email protected]>
Sent: Tuesday, June 28, 2022 5:51 AM
To: Kalra, Ashish <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

* Kalra, Ashish ([email protected]) wrote:
> [AMD Official Use Only - General]
>
> >>> /*
> >>> * The RMP entry format is not architectural. The format is
> >>> defined in PPR @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
> >>> u64 secrets_gpa;
> >>> };
> >>>
> >>> +struct rmpupdate {
> >>> + u64 gpa;
> >>> + u8 assigned;
> >>> + u8 pagesize;
> >>> + u8 immutable;
> >>> + u8 rsvd;
> >>> + u32 asid;
> >>> +} __packed;
>
> >>I see above it says the RMP entry format isn't architectural; is this 'rmpupdate' structure? If not how is this going to get handled when we have a couple >of SNP capable CPUs with different layouts?
>
> >Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible.
> >I probably think the wording here should be architecture independent or more precisely platform independent.
>
> Some more clarity on this:
>
> Actually, the PPR for family 19h Model 01h, Rev B1 defines the RMP entry format as below:
>
> 2.1.4.2 RMP Entry Format
> Architecturally the format of RMP entries are not specified in APM. In order to assist software, the following table specifies select portions of the RMP entry format for this specific product. Each RMP entry is 16B in size and is formatted as follows. Software should not rely on any field definitions not specified in this table and the format of an RMP entry may change in future processors.
>
> Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible. So non-architectural in this context means that it is only defined in our PPR.
>
> So actually this RPM entry definition is platform dependent and will need to be changed for different AMD processors and that change has to be handled correspondingly in the dump_rmpentry() code.

> You'll need a way to make that fail cleanly when run on a newer CPU with different layout, and a way to build kernels that can handle more than one layout.

Yes, I will be adding a check for CPU family/model as following :

static int __init snp_rmptable_init(void)
{
+ int family, model;

if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
return 0;

+ family = boot_cpu_data.x86;
+ model = boot_cpu_data.x86_model;

+ /*
+ * RMP table entry format is not architectural and it can vary by processor and
+ * is defined by the per-processor PPR. Restrict SNP support on the known CPU
+ * model and family for which the RMP table entry format is currently defined for.
+ */
+ if (family != 0x19 || model > 0xaf)
+ goto nosnp;
+

This way SNP will only be enabled specifically on the platforms for which this RMP entry
format is defined in those processor's PPR. This will work for Milan and Genoa as of now.

Additionally as per Sean's suggestion, I will be moving the RMP structure definition to sev.c,
which will make it a private structure and not exposed to other parts of the kernel.

Also in the future we will have an architectural interface to read the RMP table entry,
we will first check for it's availability and if not available fall back to the RMP table
entry structure definition.

Thanks,
Ashish

2022-06-28 19:01:36

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

* Kalra, Ashish ([email protected]) wrote:
> [AMD Official Use Only - General]
>
> Hello Dave,
>
> -----Original Message-----
> From: Dr. David Alan Gilbert <[email protected]>
> Sent: Tuesday, June 28, 2022 5:51 AM
> To: Kalra, Ashish <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
>
> * Kalra, Ashish ([email protected]) wrote:
> > [AMD Official Use Only - General]
> >
> > >>> /*
> > >>> * The RMP entry format is not architectural. The format is
> > >>> defined in PPR @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
> > >>> u64 secrets_gpa;
> > >>> };
> > >>>
> > >>> +struct rmpupdate {
> > >>> + u64 gpa;
> > >>> + u8 assigned;
> > >>> + u8 pagesize;
> > >>> + u8 immutable;
> > >>> + u8 rsvd;
> > >>> + u32 asid;
> > >>> +} __packed;
> >
> > >>I see above it says the RMP entry format isn't architectural; is this 'rmpupdate' structure? If not how is this going to get handled when we have a couple >of SNP capable CPUs with different layouts?
> >
> > >Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible.
> > >I probably think the wording here should be architecture independent or more precisely platform independent.
> >
> > Some more clarity on this:
> >
> > Actually, the PPR for family 19h Model 01h, Rev B1 defines the RMP entry format as below:
> >
> > 2.1.4.2 RMP Entry Format
> > Architecturally the format of RMP entries are not specified in APM. In order to assist software, the following table specifies select portions of the RMP entry format for this specific product. Each RMP entry is 16B in size and is formatted as follows. Software should not rely on any field definitions not specified in this table and the format of an RMP entry may change in future processors.
> >
> > Architectural implies that it is defined in the APM and shouldn't change in such a way as to not be backward compatible. So non-architectural in this context means that it is only defined in our PPR.
> >
> > So actually this RPM entry definition is platform dependent and will need to be changed for different AMD processors and that change has to be handled correspondingly in the dump_rmpentry() code.
>
> > You'll need a way to make that fail cleanly when run on a newer CPU with different layout, and a way to build kernels that can handle more than one layout.
>
> Yes, I will be adding a check for CPU family/model as following :
>
> static int __init snp_rmptable_init(void)
> {
> + int family, model;
>
> if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> return 0;
>
> + family = boot_cpu_data.x86;
> + model = boot_cpu_data.x86_model;
>
> + /*
> + * RMP table entry format is not architectural and it can vary by processor and
> + * is defined by the per-processor PPR. Restrict SNP support on the known CPU
> + * model and family for which the RMP table entry format is currently defined for.
> + */
> + if (family != 0x19 || model > 0xaf)
> + goto nosnp;

please add a print there to say why you're not enabling SNP.

It would be great if your firmware could give you an 'rmpentry version'; and
then if a new model came out that happened to have the same layout
everything would just carryon working by checking that rather than
the actual family/model.

> +
>
> This way SNP will only be enabled specifically on the platforms for which this RMP entry
> format is defined in those processor's PPR. This will work for Milan and Genoa as of now.
>
> Additionally as per Sean's suggestion, I will be moving the RMP structure definition to sev.c,
> which will make it a private structure and not exposed to other parts of the kernel.
>
> Also in the future we will have an architectural interface to read the RMP table entry,
> we will first check for it's availability and if not available fall back to the RMP table
> entry structure definition.

Dave

> Thanks,
> Ashish
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-28 19:06:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

On 6/28/22 10:57, Kalra, Ashish wrote:
> + /*
> + * RMP table entry format is not architectural and it can vary by processor and
> + * is defined by the per-processor PPR. Restrict SNP support on the known CPU
> + * model and family for which the RMP table entry format is currently defined for.
> + */
> + if (family != 0x19 || model > 0xaf)
> + goto nosnp;
> +
>
> This way SNP will only be enabled specifically on the platforms for which this RMP entry
> format is defined in those processor's PPR. This will work for Milan and Genoa as of now.

At some point, it would be really nice if the AMD side of things could
work to kick the magic number habit on these things. This:

arch/x86/include/asm/intel-family.h

has been really handy. It lets you do things like

grep INTEL_FAM6_SKYLAKE arch/x86

That's a *LOT* more precise than:

egrep -i '0x5E|94' arch/x86

2022-06-29 18:18:56

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 26/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command

[Public]

>> +static inline void snp_leak_pages(u64 pfn, enum pg_level level) {
>> + unsigned int npages = page_level_size(level) >> PAGE_SHIFT;
>> +
>> + WARN(1, "psc failed pfn 0x%llx pages %d (leaking)\n", pfn,
>> + npages);
>> +
>> + while (npages) {
>> + memory_failure(pfn, 0);
>> + dump_rmpentry(pfn);
>> + npages--;
>> + pfn++;
>> + }
>> +}

>Should this be deduplicated with the snp_leak_pages() in "crypto: ccp:
>Handle the legacy TMR allocation when SNP is enabled" ?

Yes, probably should.

>> +static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct list_head *head = &sev->regions_list;
>> + struct enc_region *i;
>> +
>> + lockdep_assert_held(&kvm->lock);
>> +
>> + list_for_each_entry(i, head, list) {
>> + u64 start = i->uaddr;
>> + u64 end = start + i->size;
>> +
>> + if (start <= hva && end >= (hva + len))
>> + return true;
>> + }

>Given that usersapce could load sev->regions_list with any # of any sized regions. Should we add a cond_resched() like in sev_vm_destroy()?

Actually, is_hva_registered() is also called from PSC handler with kvm->lock mutex held. Even though it is a mutex, I am not really sure if
it is a good idea to do cond_resched() with the kvm->lock mutex held ?

>> +e_unpin:
>> + /* Content of memory is updated, mark pages dirty */
>> + for (i = 0; i < n; i++) {

>Since |n| is not only a loop variable but actually carries the number of private pages over to e_unpin can we use a more descriptive name?
>How about something like 'nprivate_pages'?

Yes.

Thanks,
Ashish

2022-06-29 19:20:32

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

[Public]


>> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t
>> +req_gpa, gpa_t resp_gpa) {
>> + struct sev_data_snp_guest_request req = {0};
>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>> + struct kvm *kvm = vcpu->kvm;
>> + unsigned long data_npages;
>> + struct kvm_sev_info *sev;
>> + unsigned long rc, err;
>> + u64 data_gpa;
>> +
>> + if (!sev_snp_guest(vcpu->kvm)) {
>> + rc = SEV_RET_INVALID_GUEST;
>> + goto e_fail;
>> + }
>> +
>> + sev = &to_kvm_svm(kvm)->sev_info;
>> +
>> + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
>> + data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
>> +
>> + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
>> + rc = SEV_RET_INVALID_ADDRESS;
>> + goto e_fail;
>> + }
>> +
>> + /* Verify that requested blob will fit in certificate buffer */
>> + if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
>> + rc = SEV_RET_INVALID_PARAM;
>> + goto e_fail;
>> + }
>> +
>> + mutex_lock(&sev->guest_req_lock);
>> +
>> + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
>> + if (rc)
>> + goto unlock;
>> +
>> + rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
>> + &data_npages, &err);
>> + if (rc) {
>> + /*
>> + * If buffer length is small then return the expected
>> + * length in rbx.
>> + */
>> + if (err == SNP_GUEST_REQ_INVALID_LEN)
>> + vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
>> +
>> + /* pass the firmware error code */
>> + rc = err;
>> + goto cleanup;
>> + }
>> +
>> + /* Copy the certificate blob in the guest memory */
>> + if (data_npages &&
>> + kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
>> + rc = SEV_RET_INVALID_ADDRESS;

>>Since at this point the PSP FW has correctly executed the command and incremented the VMPCK sequence number I think we need another error signal here since this will tell the guest the PSP had an error so it will not know if the VMPCK sequence >number should be incremented.

>Similarly as above, as this is an error path, so what's the guarantee that the next guest message request will succeed completely, isn’t it better to let the
>FW reject any subsequent guest messages once it has detected that the sequence numbers are out of sync ?

Alternately, we probably can return SEV_RET_INVALID_PAGE_STATE/SEV_RET_INVALID_PAGE_OWNER here, but that still does not indicate to the guest
that the FW has successfully executed the command and the error occurred during cleanup/result phase and it needs to increment the VMPCK sequence number. There is nothing as such defined in SNP FW API specs to indicate such kind of failures to guest. As I mentioned earlier, this is probably indicative of
a bigger system failure and it is better to let the FW reject subsequent guest messages/requests once it has detected that the sequence numbers are out of sync.

Thanks,
Ashish

2022-07-07 20:09:11

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

On Fri, Jun 24, 2022 at 2:14 PM Kalra, Ashish <[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> Hello Peter,
>
> >> From: Brijesh Singh <[email protected]>
> >>
> >> On VMGEXIT, sev_handle_vmgexit() creates a host mapping for the GHCB
> >> GPA, and unmaps it just before VM-entry. This long-lived GHCB map is
> >> used by the VMGEXIT handler through accessors such as ghcb_{set_get}_xxx().
> >>
> >> A long-lived GHCB map can cause issue when SEV-SNP is enabled. When
> >> SEV-SNP is enabled the mapped GPA needs to be protected against a page
> >> state change.
> >>
> >> To eliminate the long-lived GHCB mapping, update the GHCB sync
> >> operations to explicitly map the GHCB before access and unmap it after
> >> access is complete. This requires that the setting of the GHCBs
> >> sw_exit_info_{1,2} fields be done during sev_es_sync_to_ghcb(), so
> >> create two new fields in the vcpu_svm struct to hold these values when
> >> required to be set outside of the GHCB mapping.
> >>
> >> Signed-off-by: Brijesh Singh <[email protected]>
> >> ---
> >> arch/x86/kvm/svm/sev.c | 131
> >> ++++++++++++++++++++++++++---------------
> >> arch/x86/kvm/svm/svm.c | 12 ++--
> >> arch/x86/kvm/svm/svm.h | 24 +++++++-
> >> 3 files changed, 111 insertions(+), 56 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index
> >> 01ea257e17d6..c70f3f7e06a8 100644
> >> --- a/arch/x86/kvm/svm/sev.c
> >> +++ b/arch/x86/kvm/svm/sev.c
> >> @@ -2823,15 +2823,40 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
> >> kvfree(svm->sev_es.ghcb_sa);
> >> }
> >>
> >> +static inline int svm_map_ghcb(struct vcpu_svm *svm, struct
> >> +kvm_host_map *map) {
> >> + struct vmcb_control_area *control = &svm->vmcb->control;
> >> + u64 gfn = gpa_to_gfn(control->ghcb_gpa);
> >> +
> >> + if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
> >> + /* Unable to map GHCB from guest */
> >> + pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
> >> + return -EFAULT;
> >> + }
> >> +
> >> + return 0;
> >> +}
>
> >There is a perf cost to this suggestion but it might make accessing the GHCB safer for KVM. Have you thought about just using
> >kvm_read_guest() or copy_from_user() to fully copy out the GCHB into a KVM owned buffer, then copying it back before the VMRUN. That way the KVM doesn't need to guard against page_state_changes on the GHCBs, that could be a perf ?>improvement in a follow up.
>
> Along with the performance costs you mentioned, the main concern here will be the GHCB write-back path (copying it back) before VMRUN: this will again hit the issue we have currently with
> kvm_write_guest() / copy_to_user(), when we use it to sync the scratch buffer back to GHCB. This can fail if guest RAM is mapped using huge-page(s) and RMP is 4K. Please refer to the patch/fix
> mentioned below, kvm_write_guest() potentially can fail before VMRUN in case of SNP :
>
> commit 94ed878c2669532ebae8eb9b4503f19aa33cd7aa
> Author: Ashish Kalra <[email protected]>
> Date: Mon Jun 6 22:28:01 2022 +0000
>
> KVM: SVM: Sync the GHCB scratch buffer using already mapped ghcb
>
> Using kvm_write_guest() to sync the GHCB scratch buffer can fail
> due to host mapping being 2M, but RMP being 4K. The page fault handling
> in do_user_addr_fault() fails to split the 2M page to handle RMP fault due
> to it being called here in a non-preemptible context. Instead use
> the already kernel mapped ghcb to sync the scratch buffer when the
> scratch buffer is contained within the GHCB.

Ah I didn't see that issue thanks for the pointer.

The patch description says "When SEV-SNP is enabled the mapped GPA
needs to be protected against a page state change." since if the guest
were to convert the GHCB page to private when the host is using the
GHCB the host could get an RMP violation right? That RMP violation
would cause the host to crash unless we use some copy_to_user() type
protections. I don't see anything mechanism for this patch to add the
page state change protection discussed. Can't another vCPU still
convert the GHCB to private?

I was wrong about the importance of this though seanjc@ walked me
through how UPM will solve this issue so no worries about this until
the series is rebased on to UPM.

>
> Thanks,
> Ashish
>
> >Since we cannot unmap GHCBs I don't think UPM will help here so we probably want to make these patches safe against malicious guests making GHCBs private. But maybe UPM does help?

2022-07-07 20:42:44

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

[AMD Official Use Only - General]

Hello Peter,

>> >There is a perf cost to this suggestion but it might make accessing
>> >the GHCB safer for KVM. Have you thought about just using
>> >kvm_read_guest() or copy_from_user() to fully copy out the GCHB into a KVM owned buffer, then copying it back before the VMRUN. That way the KVM doesn't need to guard against page_state_changes on the GHCBs, that could be a perf ?>>improvement in a follow up.
>>
>> Along with the performance costs you mentioned, the main concern here
>> will be the GHCB write-back path (copying it back) before VMRUN: this
>> will again hit the issue we have currently with
>> kvm_write_guest() / copy_to_user(), when we use it to sync the scratch
>> buffer back to GHCB. This can fail if guest RAM is mapped using huge-page(s) and RMP is 4K. Please refer to the patch/fix mentioned below, kvm_write_guest() potentially can fail before VMRUN in case of SNP :
>>
>> commit 94ed878c2669532ebae8eb9b4503f19aa33cd7aa
>> Author: Ashish Kalra <[email protected]>
>> Date: Mon Jun 6 22:28:01 2022 +0000
>>
>> KVM: SVM: Sync the GHCB scratch buffer using already mapped ghcb
>>
>> Using kvm_write_guest() to sync the GHCB scratch buffer can fail
>> due to host mapping being 2M, but RMP being 4K. The page fault handling
>> in do_user_addr_fault() fails to split the 2M page to handle RMP fault due
>> to it being called here in a non-preemptible context. Instead use
>> the already kernel mapped ghcb to sync the scratch buffer when the
>> scratch buffer is contained within the GHCB.

>Ah I didn't see that issue thanks for the pointer.

>The patch description says "When SEV-SNP is enabled the mapped GPA needs to be protected against a page state change." since if the guest were to convert the GHCB page to private when the host is using the GHCB the host could get an RMP violation right?

Right.

>That RMP violation would cause the host to crash unless we use some copy_to_user() type protections.

As such copy_to_user() will only swallow the RMP violation and return failure, so the host can retry the write.

> I don't see anything mechanism for this patch to add the page state change protection discussed. Can't another vCPU still convert the GHCB to private?

We do have the protections for GHCB getting mapped to private specifically, there are new post_{map|unmap}_gfn functions added to verify if it is safe to map
GHCB pages. There is a PSC spinlock added which protects again page state change for these mapped pages.
Below is the reference to this patch:
https://lore.kernel.org/lkml/[email protected]/T/#mafcaac7296eb9a92c0ea58730dbd3ca47a8e0756

But do note that there is protection only for GHCB pages and there is a need to add generic post_{map,unmap}_gfn() ops that can be used to verify
that it's safe to map a given guest page in the hypervisor. This is a TODO right now and probably this is something which UPM can address more cleanly.

>I was wrong about the importance of this though seanjc@ walked me through how UPM will solve this issue so no worries about this until the series is rebased on to UPM.

Thanks,
Ashish

2022-07-08 15:47:44

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

On Wed, Jun 29, 2022 at 1:15 PM Kalra, Ashish <[email protected]> wrote:
>
> [Public]
>
>
> >> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t
> >> +req_gpa, gpa_t resp_gpa) {
> >> + struct sev_data_snp_guest_request req = {0};
> >> + struct kvm_vcpu *vcpu = &svm->vcpu;
> >> + struct kvm *kvm = vcpu->kvm;
> >> + unsigned long data_npages;
> >> + struct kvm_sev_info *sev;
> >> + unsigned long rc, err;
> >> + u64 data_gpa;
> >> +
> >> + if (!sev_snp_guest(vcpu->kvm)) {
> >> + rc = SEV_RET_INVALID_GUEST;
> >> + goto e_fail;
> >> + }
> >> +
> >> + sev = &to_kvm_svm(kvm)->sev_info;
> >> +
> >> + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> >> + data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
> >> +
> >> + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> >> + rc = SEV_RET_INVALID_ADDRESS;
> >> + goto e_fail;
> >> + }
> >> +
> >> + /* Verify that requested blob will fit in certificate buffer */
> >> + if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
> >> + rc = SEV_RET_INVALID_PARAM;
> >> + goto e_fail;
> >> + }
> >> +
> >> + mutex_lock(&sev->guest_req_lock);
> >> +
> >> + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
> >> + if (rc)
> >> + goto unlock;
> >> +
> >> + rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> >> + &data_npages, &err);
> >> + if (rc) {
> >> + /*
> >> + * If buffer length is small then return the expected
> >> + * length in rbx.
> >> + */
> >> + if (err == SNP_GUEST_REQ_INVALID_LEN)
> >> + vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
> >> +
> >> + /* pass the firmware error code */
> >> + rc = err;
> >> + goto cleanup;
> >> + }
> >> +
> >> + /* Copy the certificate blob in the guest memory */
> >> + if (data_npages &&
> >> + kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
> >> + rc = SEV_RET_INVALID_ADDRESS;
>
> >>Since at this point the PSP FW has correctly executed the command and incremented the VMPCK sequence number I think we need another error signal here since this will tell the guest the PSP had an error so it will not know if the VMPCK sequence >number should be incremented.
>
> >Similarly as above, as this is an error path, so what's the guarantee that the next guest message request will succeed completely, isn’t it better to let the
> >FW reject any subsequent guest messages once it has detected that the sequence numbers are out of sync ?
>
> Alternately, we probably can return SEV_RET_INVALID_PAGE_STATE/SEV_RET_INVALID_PAGE_OWNER here, but that still does not indicate to the guest
> that the FW has successfully executed the command and the error occurred during cleanup/result phase and it needs to increment the VMPCK sequence number. There is nothing as such defined in SNP FW API specs to indicate such kind of failures to guest. As I mentioned earlier, this is probably indicative of
> a bigger system failure and it is better to let the FW reject subsequent guest messages/requests once it has detected that the sequence numbers are out of sync.

Hmm I think the guest must be careful here because the guest could not
trust the hypervisor here to be truthful about the sequence numbers
incrementing. That's unfortunate since this means if these operations
do fail with a well behaved hypervisor the guest cannot use that VMPCK
again. But there is no harm in the guest re-issuing the
SNP_GUEST_REQUEST (or extended version) with the exact same request
just in at a different address. The GHCB spec actually calls this out
" It is recommended that the hypervisor validate the guest physical
address of the response page before invoking the SNP_GUEST_REQUEST API
so that the sequence numbers do not get out of sync for the guest,
possibly resulting in all successive requests failing".

Currently SVM_VMGEXIT_GUEST_REQUEST and SVM_VMGEXIT_EXT_GUEST_REQUEST
have different hypervisor -> guest usage for SW_EXITINFO2. I think
they both should be defined as what SVM_VMGEXIT_EXT_GUEST_REQUEST is
now: the high 32bits are the hypervisor error code, the low 32bits are
the FW error code. This would allow for both NAEs to have some signal
to the guest say SEV_RET_INVALID_REQ_ADDRESS. The hypervisor can use
this error code when doing the validation on the request and response
regions, if some is wrong with them the guest can retry with the exact
same request (so no IV reuse) in a corrected region.

But another reason I think SVM_VMGEXIT_GUEST_REQUEST SW_EXITINFO2
hypervisor->guest state should include this change is because in this
patch we are currently overloading the lower 32bits with hypervisor
error codes. In snp_handle_guest_request() if sev_snp_guest(),
snp_setup_guest_buf(), or snp_cleanup_guest_buf() fails we use the low
32bits of SW_EXITINFO2 to return hypervisor errors to the guest.

>
> Thanks,
> Ashish

2022-07-08 15:59:09

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

On Thu, Jul 7, 2022 at 2:31 PM Kalra, Ashish <[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> Hello Peter,
>
> >> >There is a perf cost to this suggestion but it might make accessing
> >> >the GHCB safer for KVM. Have you thought about just using
> >> >kvm_read_guest() or copy_from_user() to fully copy out the GCHB into a KVM owned buffer, then copying it back before the VMRUN. That way the KVM doesn't need to guard against page_state_changes on the GHCBs, that could be a perf ?>>improvement in a follow up.
> >>
> >> Along with the performance costs you mentioned, the main concern here
> >> will be the GHCB write-back path (copying it back) before VMRUN: this
> >> will again hit the issue we have currently with
> >> kvm_write_guest() / copy_to_user(), when we use it to sync the scratch
> >> buffer back to GHCB. This can fail if guest RAM is mapped using huge-page(s) and RMP is 4K. Please refer to the patch/fix mentioned below, kvm_write_guest() potentially can fail before VMRUN in case of SNP :
> >>
> >> commit 94ed878c2669532ebae8eb9b4503f19aa33cd7aa
> >> Author: Ashish Kalra <[email protected]>
> >> Date: Mon Jun 6 22:28:01 2022 +0000
> >>
> >> KVM: SVM: Sync the GHCB scratch buffer using already mapped ghcb
> >>
> >> Using kvm_write_guest() to sync the GHCB scratch buffer can fail
> >> due to host mapping being 2M, but RMP being 4K. The page fault handling
> >> in do_user_addr_fault() fails to split the 2M page to handle RMP fault due
> >> to it being called here in a non-preemptible context. Instead use
> >> the already kernel mapped ghcb to sync the scratch buffer when the
> >> scratch buffer is contained within the GHCB.
>
> >Ah I didn't see that issue thanks for the pointer.
>
> >The patch description says "When SEV-SNP is enabled the mapped GPA needs to be protected against a page state change." since if the guest were to convert the GHCB page to private when the host is using the GHCB the host could get an RMP violation right?
>
> Right.
>
> >That RMP violation would cause the host to crash unless we use some copy_to_user() type protections.
>
> As such copy_to_user() will only swallow the RMP violation and return failure, so the host can retry the write.
>
> > I don't see anything mechanism for this patch to add the page state change protection discussed. Can't another vCPU still convert the GHCB to private?
>
> We do have the protections for GHCB getting mapped to private specifically, there are new post_{map|unmap}_gfn functions added to verify if it is safe to map
> GHCB pages. There is a PSC spinlock added which protects again page state change for these mapped pages.
> Below is the reference to this patch:
> https://lore.kernel.org/lkml/[email protected]/T/#mafcaac7296eb9a92c0ea58730dbd3ca47a8e0756
>
> But do note that there is protection only for GHCB pages and there is a need to add generic post_{map,unmap}_gfn() ops that can be used to verify
> that it's safe to map a given guest page in the hypervisor. This is a TODO right now and probably this is something which UPM can address more cleanly.

Thank you Ashish. I had missed that.

Can you help me understand why its OK to use kvm_write_guest() for the
|snp_certs_data| inside of snp_handle_ext_guest_request() in patch
42/49? I would have thought we'd have the same 2M vs 4K mapping
issues.

>
> >I was wrong about the importance of this though seanjc@ walked me through how UPM will solve this issue so no worries about this until the series is rebased on to UPM.
>
> Thanks,
> Ashish

2022-07-08 16:01:33

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 35/49] KVM: SVM: Remove the long-lived GHCB host map

[AMD Official Use Only - General]

Hello Peter,

>> > I don't see anything mechanism for this patch to add the page state change protection discussed. Can't another vCPU still convert the GHCB to private?
>>
>> We do have the protections for GHCB getting mapped to private
>> specifically, there are new post_{map|unmap}_gfn functions added to verify if it is safe to map GHCB pages. There is a PSC spinlock added which protects again page state change for these mapped pages.
>> Below is the reference to this patch:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore
>> .kernel.org%2Flkml%2Fcover.1655761627.git.ashish.kalra%40amd.com%2FT%2
>> F%23mafcaac7296eb9a92c0ea58730dbd3ca47a8e0756&amp;data=05%7C01%7CAshis
>> h.Kalra%40amd.com%7C647218cdb2a040bf354e08da60fa2968%7C3dd8961fe4884e6
>> 08e11a82d994e183d%7C0%7C0%7C637928924845082803%7CUnknown%7CTWFpbGZsb3d
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C
>> 3000%7C%7C%7C&amp;sdata=ss8%2F5qualccXQero9phARIG2wvYhtp8SMdve3GglZeU%
>> 3D&amp;reserved=0
>>
>> But do note that there is protection only for GHCB pages and there is
>> a need to add generic post_{map,unmap}_gfn() ops that can be used to verify that it's safe to map a given guest page in the hypervisor. This is a TODO right now and probably this is something which UPM can address more cleanly.

>Thank you Ashish. I had missed that.

>Can you help me understand why its OK to use kvm_write_guest() for the
>|snp_certs_data| inside of snp_handle_ext_guest_request() in patch
>42/49? I would have thought we'd have the same 2M vs 4K mapping issues.

Preemption is not disabled there, hence the RMP page fault handler can do
the split of 2M to 4K on host pages without any issues.

Thanks,
Ashish

2022-07-11 14:07:49

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

On Mon, Jun 20, 2022 at 5:08 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
> it as the measurement of the guest at launch.
>
> While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
> to encrypt the VMSA pages.
>
> If its an SNP guest, then VMSA was added in the RMP entry as
> a guest owned page and also removed from the kernel direct map
> so flush it later after it is transitioned back to hypervisor
> state and restored in the direct map.

Given the guest uses the SNP NAE AP boot protocol we were expecting
that there would be some option to add vCPUs to the VM but mark them
as "pending AP boot creation protocol" state. This would allow the
LaunchDigest of a VM doesn't change just because its vCPU count
changes. Would it be possible to add a new add an argument to
KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA
pages for or similarly a new argument for KVM_CREATE_VCPU?

>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 22 ++++
> arch/x86/kvm/svm/sev.c | 119 ++++++++++++++++++
> include/uapi/linux/kvm.h | 14 +++
> 3 files changed, 155 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 62abd5c1f72b..750162cff87b 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -514,6 +514,28 @@ Returns: 0 on success, -negative on error
> See the SEV-SNP spec for further details on how to build the VMPL permission
> mask and page type.
>
> +21. KVM_SNP_LAUNCH_FINISH
> +-------------------------
> +
> +After completion of the SNP guest launch flow, the KVM_SNP_LAUNCH_FINISH command can be
> +issued to make the guest ready for the execution.
> +
> +Parameters (in): struct kvm_sev_snp_launch_finish
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_finish {
> + __u64 id_block_uaddr;
> + __u64 id_auth_uaddr;
> + __u8 id_block_en;
> + __u8 auth_key_en;
> + __u8 host_data[32];
> + };
> +
> +
> +See SEV-SNP specification for further details on launch finish input parameters.
>
> References
> ==========
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a9461d352eda..a5b90469683f 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2095,6 +2095,106 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return ret;
> }
>
> +static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {};
> + int i, ret;
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> + data.page_type = SNP_PAGE_TYPE_VMSA;
> +
> + for (i = 0; i < kvm->created_vcpus; i++) {
> + struct vcpu_svm *svm = to_svm(xa_load(&kvm->vcpu_array, i));

Why are we iterating over |created_vcpus| rather than using kvm_for_each_vcpu?

> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + /* Perform some pre-encryption checks against the VMSA */
> + ret = sev_es_sync_vmsa(svm);
> + if (ret)
> + return ret;

Do we need to take the 'vcpu->mutex' lock before modifying the
vcpu,like we do for SEV-ES in sev_launch_update_vmsa()?

> +
> + /* Transition the VMSA page to a firmware state. */
> + ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);
> + if (ret)
> + return ret;
> +
> + /* Issue the SNP command to encrypt the VMSA */
> + data.address = __sme_pa(svm->sev_es.vmsa);
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> + &data, &argp->error);
> + if (ret) {
> + snp_page_reclaim(pfn);
> + return ret;
> + }
> +
> + svm->vcpu.arch.guest_state_protected = true;
> + }
> +
> + return 0;
> +}
> +
> +static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_finish *data;
> + void *id_block = NULL, *id_auth = NULL;
> + struct kvm_sev_snp_launch_finish params;
> + int ret;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + /* Measure all vCPUs using LAUNCH_UPDATE before we finalize the launch flow. */
> + ret = snp_launch_update_vmsa(kvm, argp);
> + if (ret)
> + return ret;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
> + if (!data)
> + return -ENOMEM;
> +
> + if (params.id_block_en) {
> + id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
> + if (IS_ERR(id_block)) {
> + ret = PTR_ERR(id_block);
> + goto e_free;
> + }
> +
> + data->id_block_en = 1;
> + data->id_block_paddr = __sme_pa(id_block);
> + }
> +
> + if (params.auth_key_en) {
> + id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
> + if (IS_ERR(id_auth)) {
> + ret = PTR_ERR(id_auth);
> + goto e_free_id_block;
> + }
> +
> + data->auth_key_en = 1;
> + data->id_auth_paddr = __sme_pa(id_auth);
> + }
> +
> + data->gctx_paddr = __psp_pa(sev->snp_context);
> + ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
> +
> + kfree(id_auth);
> +
> +e_free_id_block:
> + kfree(id_block);
> +
> +e_free:
> + kfree(data);
> +
> + return ret;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2191,6 +2291,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_UPDATE:
> r = snp_launch_update(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_FINISH:
> + r = snp_launch_finish(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2696,11 +2799,27 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>
> svm = to_svm(vcpu);
>
> + /*
> + * If its an SNP guest, then VMSA was added in the RMP entry as
> + * a guest owned page. Transition the page to hypervisor state
> + * before releasing it back to the system.
> + * Also the page is removed from the kernel direct map, so flush it
> + * later after it is transitioned back to hypervisor state and
> + * restored in the direct map.
> + */
> + if (sev_snp_guest(vcpu->kvm)) {
> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + if (host_rmp_make_shared(pfn, PG_LEVEL_4K, false))
> + goto skip_vmsa_free;

Why not call host_rmp_make_shared with leak==true? This old VMSA page
is now unusable IIUC.



> + }
> +
> if (vcpu->arch.guest_state_protected)
> sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);
>
> __free_page(virt_to_page(svm->sev_es.vmsa));
>
> +skip_vmsa_free:
> if (svm->sev_es.ghcb_sa_free)
> kvfree(svm->sev_es.ghcb_sa);
> }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9b36b07414ea..5a4662716b6a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1814,6 +1814,7 @@ enum sev_cmd_id {
> KVM_SEV_SNP_INIT,
> KVM_SEV_SNP_LAUNCH_START,
> KVM_SEV_SNP_LAUNCH_UPDATE,
> + KVM_SEV_SNP_LAUNCH_FINISH,
>
> KVM_SEV_NR_MAX,
> };
> @@ -1948,6 +1949,19 @@ struct kvm_sev_snp_launch_update {
> __u8 vmpl1_perms;
> };
>
> +#define KVM_SEV_SNP_ID_BLOCK_SIZE 96
> +#define KVM_SEV_SNP_ID_AUTH_SIZE 4096
> +#define KVM_SEV_SNP_FINISH_DATA_SIZE 32
> +
> +struct kvm_sev_snp_launch_finish {
> + __u64 id_block_uaddr;
> + __u64 id_auth_uaddr;
> + __u8 id_block_en;
> + __u8 auth_key_en;
> + __u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
> + __u8 pad[6];
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
> --
> 2.25.1
>

2022-07-11 22:42:21

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

[AMD Official Use Only - General]

Hello Peter,

>> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and
>> stores it as the measurement of the guest at launch.
>>
>> While finalizing the launch flow, it also issues the LAUNCH_UPDATE
>> command to encrypt the VMSA pages.

>Given the guest uses the SNP NAE AP boot protocol we were expecting that there would be some option to add vCPUs to the VM but mark them as "pending AP boot creation protocol" state. This would allow the LaunchDigest of a VM doesn't change >just because its vCPU count changes. Would it be possible to add a new add an argument to KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA pages for or similarly a new argument for KVM_CREATE_VCPU?

But don't we want/need to measure all vCPUs using LAUNCH_UPDATE_VMSA before we issue SNP_LAUNCH_FINISH command ?

If we are going to add vCPUs and mark them as "pending AP boot creation" state then how are we going to do LAUNCH_UPDATE_VMSAs for them after SNP_LAUNCH_FINISH ?

int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd
>> +*argp) {
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct sev_data_snp_launch_update data = {};
>> + int i, ret;
>> +
>> + data.gctx_paddr = __psp_pa(sev->snp_context);
>> + data.page_type = SNP_PAGE_TYPE_VMSA;
>> +
>> + for (i = 0; i < kvm->created_vcpus; i++) {
>> + struct vcpu_svm *svm =
>> + to_svm(xa_load(&kvm->vcpu_array, i));

> Why are we iterating over |created_vcpus| rather than using kvm_for_each_vcpu?

Yes we should be using kvm_for_each_vcpu(), that will also help avoid touching implementation
specific details and hide complexities such as xa_load(), locking requirements, etc.

Additionally, kvm_for_each_vcpu() works on online_cpus, but I think that is what we should
be considering at LAUNCH_UPDATE_VMSA time, via-a-vis created_vcpus.

>> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
>> +
>> + /* Perform some pre-encryption checks against the VMSA */
>> + ret = sev_es_sync_vmsa(svm);
>> + if (ret)
>> + return ret;

>Do we need to take the 'vcpu->mutex' lock before modifying the vcpu,like we do for SEV-ES in sev_launch_update_vmsa()?

This is using the per-cpu vcpu_svm structure, but we may need to guard against the KVM vCPU ioctl requests, so yes it is
safer to take the 'vcpu->mutex' lock here.

>> + /*
>> + * If its an SNP guest, then VMSA was added in the RMP entry as
>> + * a guest owned page. Transition the page to hypervisor state
>> + * before releasing it back to the system.
>> + * Also the page is removed from the kernel direct map, so flush it
>> + * later after it is transitioned back to hypervisor state and
>> + * restored in the direct map.
>> + */
>> + if (sev_snp_guest(vcpu->kvm)) {
>> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
>> +
>> + if (host_rmp_make_shared(pfn, PG_LEVEL_4K, false))
>> + goto skip_vmsa_free;

>Why not call host_rmp_make_shared with leak==true? This old VMSA page is now unusable IIUC.

Yes the old VMSA page is now unavailable and lost, so makes sense to call host_rmp_make_shared() with leak==true.

Thanks,
Ashish

2022-07-12 12:00:02

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> +/*
> + * Return 1 if the caller need to retry, 0 if it the address need to be split
> + * in order to resolve the fault.
> + */
> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
> + unsigned long address)
> +{
> + int rmp_level, level;
> + pte_t *pte;
> + u64 pfn;
> +
> + pte = lookup_address_in_mm(current->mm, address, &level);

As discussed in [1], the lookup should be done in kvm->mm, along the
lines of host_pfn_mapping_level().

[1] https://lore.kernel.org/kvm/YmwIi3bXr%2F1yhYV%[email protected]/
|
BR, Jarkko

2022-07-12 12:44:20

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On Mon, Jun 20, 2022 at 11:13:03PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled in the guest, the hardware places restrictions on
> all memory accesses based on the contents of the RMP table. When hardware
> encounters RMP check failure caused by the guest memory access it raises
> the #NPF. The error code contains additional information on the access
> type. See the APM volume 2 for additional information.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.c | 14 +++++---
> 2 files changed, 86 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 4ed90331bca0..7fc0fad87054 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>
> spin_unlock(&sev->psc_lock);
> }
> +
> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> +{
> + int rmp_level, npt_level, rc, assigned;
> + struct kvm *kvm = vcpu->kvm;
> + gfn_t gfn = gpa_to_gfn(gpa);
> + bool need_psc = false;
> + enum psc_op psc_op;
> + kvm_pfn_t pfn;
> + bool private;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))

This function does not exist. Should it be kvm_mmu_get_tdp_page?

BR, Jarkko

2022-07-12 12:47:17

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On Tue, Jul 12, 2022 at 03:34:00PM +0300, Jarkko Sakkinen wrote:
> On Mon, Jun 20, 2022 at 11:13:03PM +0000, Ashish Kalra wrote:
> > From: Brijesh Singh <[email protected]>
> >
> > When SEV-SNP is enabled in the guest, the hardware places restrictions on
> > all memory accesses based on the contents of the RMP table. When hardware
> > encounters RMP check failure caused by the guest memory access it raises
> > the #NPF. The error code contains additional information on the access
> > type. See the APM volume 2 for additional information.
> >
> > Signed-off-by: Brijesh Singh <[email protected]>
> > ---
> > arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/svm/svm.c | 14 +++++---
> > 2 files changed, 86 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 4ed90331bca0..7fc0fad87054 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> >
> > spin_unlock(&sev->psc_lock);
> > }
> > +
> > +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> > +{
> > + int rmp_level, npt_level, rc, assigned;
> > + struct kvm *kvm = vcpu->kvm;
> > + gfn_t gfn = gpa_to_gfn(gpa);
> > + bool need_psc = false;
> > + enum psc_op psc_op;
> > + kvm_pfn_t pfn;
> > + bool private;
> > +
> > + write_lock(&kvm->mmu_lock);
> > +
> > + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))
>
> This function does not exist. Should it be kvm_mmu_get_tdp_page?

Ugh, ignore that.

This the actual issue:

$ git grep kvm_mmu_get_tdp_walk
arch/x86/kvm/mmu/mmu.c:bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level)
arch/x86/kvm/mmu/mmu.c:EXPORT_SYMBOL_GPL(kvm_mmu_get_tdp_walk);
arch/x86/kvm/svm/sev.c: rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);

It's not declared in any header.

BR, Jarkko

2022-07-12 12:51:22

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On Tue, Jul 12, 2022 at 03:45:13PM +0300, Jarkko Sakkinen wrote:
> On Tue, Jul 12, 2022 at 03:34:00PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Jun 20, 2022 at 11:13:03PM +0000, Ashish Kalra wrote:
> > > From: Brijesh Singh <[email protected]>
> > >
> > > When SEV-SNP is enabled in the guest, the hardware places restrictions on
> > > all memory accesses based on the contents of the RMP table. When hardware
> > > encounters RMP check failure caused by the guest memory access it raises
> > > the #NPF. The error code contains additional information on the access
> > > type. See the APM volume 2 for additional information.
> > >
> > > Signed-off-by: Brijesh Singh <[email protected]>
> > > ---
> > > arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
> > > arch/x86/kvm/svm/svm.c | 14 +++++---
> > > 2 files changed, 86 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > index 4ed90331bca0..7fc0fad87054 100644
> > > --- a/arch/x86/kvm/svm/sev.c
> > > +++ b/arch/x86/kvm/svm/sev.c
> > > @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > >
> > > spin_unlock(&sev->psc_lock);
> > > }
> > > +
> > > +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> > > +{
> > > + int rmp_level, npt_level, rc, assigned;
> > > + struct kvm *kvm = vcpu->kvm;
> > > + gfn_t gfn = gpa_to_gfn(gpa);
> > > + bool need_psc = false;
> > > + enum psc_op psc_op;
> > > + kvm_pfn_t pfn;
> > > + bool private;
> > > +
> > > + write_lock(&kvm->mmu_lock);
> > > +
> > > + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))
> >
> > This function does not exist. Should it be kvm_mmu_get_tdp_page?
>
> Ugh, ignore that.
>
> This the actual issue:
>
> $ git grep kvm_mmu_get_tdp_walk
> arch/x86/kvm/mmu/mmu.c:bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level)
> arch/x86/kvm/mmu/mmu.c:EXPORT_SYMBOL_GPL(kvm_mmu_get_tdp_walk);
> arch/x86/kvm/svm/sev.c: rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
>
> It's not declared in any header.

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0e1f4d92b89b..33267f619e61 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -164,6 +164,8 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
u32 error_code, int max_level);

+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level):
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE


BTW, kvm_mmu_map_tdp_page() ought to be in single line since it's less than
100 characters.

BR, Jarkko

2022-07-12 14:35:29

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

>> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
>> + unsigned long address)
>> +{
>> + int rmp_level, level;
>> + pte_t *pte;
>> + u64 pfn;
>> +
>> + pte = lookup_address_in_mm(current->mm, address, &level);

>As discussed in [1], the lookup should be done in kvm->mm, along the lines of host_pfn_mapping_level().

With lookup_address_in_mm() now removed in 5.19, this is now using lookup_address_in_pgd() though still using non init-mm, and as mentioned here in [1], it makes sense to
not use lookup_address_in_pgd() as it does not play nice with userspace mappings, e.g. doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
to ensure an upper level entry isn't converted to a huge page between checking the PAGE_SIZE bit and grabbing the address of the next level down.

But is KVM going to provide its own variant of lookup_address_in_pgd() that is safe for use with user addresses, i.e., a generic version of lookup_address() on kvm->mm or we need to
duplicate page table walking code of host_pfn_mapping_level() ?

Thanks,
Ashish

>[1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fkvm%2FYmwIi3bXr%252F1yhYV%252F%40google.com%2F&amp;data=05%7C01%7CAshish.Kalra%40amd.com%>7Ce300014162fc4d8b452708da63fdb970%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637932238689925974%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%>7C%7C%7C&amp;sdata=GxPrEUxuVNEm6COdfHCILOwp9yuX48gpoYmtrOwMx8Q%3D&amp;reserved=0

2022-07-12 14:46:04

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

On Mon, Jul 11, 2022 at 4:41 PM Kalra, Ashish <[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> Hello Peter,
>
> >> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and
> >> stores it as the measurement of the guest at launch.
> >>
> >> While finalizing the launch flow, it also issues the LAUNCH_UPDATE
> >> command to encrypt the VMSA pages.
>
> >Given the guest uses the SNP NAE AP boot protocol we were expecting that there would be some option to add vCPUs to the VM but mark them as "pending AP boot creation protocol" state. This would allow the LaunchDigest of a VM doesn't change >just because its vCPU count changes. Would it be possible to add a new add an argument to KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA pages for or similarly a new argument for KVM_CREATE_VCPU?
>
> But don't we want/need to measure all vCPUs using LAUNCH_UPDATE_VMSA before we issue SNP_LAUNCH_FINISH command ?
>
> If we are going to add vCPUs and mark them as "pending AP boot creation" state then how are we going to do LAUNCH_UPDATE_VMSAs for them after SNP_LAUNCH_FINISH ?

If I understand correctly we don't need or even want the APs to be
LAUNCH_UPDATE_VMSA'd. LAUNCH_UPDATEing all the VMSAs causes VMs with
different numbers of vCPUs to have different launch digests. Its my
understanding the SNP AP Creation protocol was to solve this so that
VMs with different vcpu counts have the same launch digest.

Looking at patch "[Part2,v6,44/49] KVM: SVM: Support SEV-SNP AP
Creation NAE event" and section "4.1.9 SNP AP Creation" of the GHCB
spec. There is no need to mark the LAUNCH_UPDATE the AP's VMSA or mark
the vCPUs runnable. Instead we can do that only for the BSP. Then in
the guest UEFI the BSP can: create new VMSAs from guest pages,
RMPADJUST them into the RMP state VMSA, then use the SNP AP Creation
NAE to get the hypervisor to mark them runnable. I believe this is all
setup in the UEFI patch:
https://www.mail-archive.com/[email protected]/msg38460.html.

2022-07-12 15:01:04

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Jul 12, 2022 at 02:29:18PM +0000, Kalra, Ashish wrote:
> [AMD Official Use Only - General]
>
> >> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
> >> + unsigned long address)
> >> +{
> >> + int rmp_level, level;
> >> + pte_t *pte;
> >> + u64 pfn;
> >> +
> >> + pte = lookup_address_in_mm(current->mm, address, &level);
>
> >As discussed in [1], the lookup should be done in kvm->mm, along the lines of host_pfn_mapping_level().
>
> With lookup_address_in_mm() now removed in 5.19, this is now using
> lookup_address_in_pgd() though still using non init-mm, and as mentioned
> here in [1], it makes sense to not use lookup_address_in_pgd() as it does
> not play nice with userspace mappings, e.g. doesn't disable IRQs to block
> TLB shootdowns and doesn't use READ_ONCE() to ensure an upper level entry
> isn't converted to a huge page between checking the PAGE_SIZE bit and
> grabbing the address of the next level down.
>
> But is KVM going to provide its own variant of lookup_address_in_pgd()
> that is safe for use with user addresses, i.e., a generic version of
> lookup_address() on kvm->mm or we need to duplicate page table walking
> code of host_pfn_mapping_level() ?

It's probably cpen coded for the sole reason that there is only one
call site, i.e. there has not been rational reason to have a helper
function.

Helpers are usually created only in-need basis, and since the need
comes from this patch set, it should include a patch, which simply
encapsulates it into a helper.

>
> Thanks,
> Ashish

BR, Jarkko

2022-07-12 15:26:40

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

[AMD Official Use Only - General]

Hello Peter,

>> >Given the guest uses the SNP NAE AP boot protocol we were expecting that there would be some option to add vCPUs to the VM but mark them as "pending AP boot creation protocol" state. This would allow the LaunchDigest of a VM doesn't change >just because its vCPU count changes. Would it be possible to add a new add an argument to KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA pages for or similarly a new argument for KVM_CREATE_VCPU?
>>
>> But don't we want/need to measure all vCPUs using LAUNCH_UPDATE_VMSA before we issue SNP_LAUNCH_FINISH command ?
>>
>> If we are going to add vCPUs and mark them as "pending AP boot creation" state then how are we going to do LAUNCH_UPDATE_VMSAs for them after SNP_LAUNCH_FINISH ?

>If I understand correctly we don't need or even want the APs to be LAUNCH_UPDATE_VMSA'd. LAUNCH_UPDATEing all the VMSAs causes VMs with different numbers of vCPUs to have different launch digests. Its my understanding the SNP AP >Creation protocol was to solve this so that VMs with different vcpu counts have the same launch digest.

>Looking at patch "[Part2,v6,44/49] KVM: SVM: Support SEV-SNP AP Creation NAE event" and section "4.1.9 SNP AP Creation" of the GHCB spec. There is no need to mark the LAUNCH_UPDATE the AP's VMSA or mark the vCPUs runnable. Instead we >can do that only for the BSP. Then in the guest UEFI the BSP can: create new VMSAs from guest pages, RMPADJUST them into the RMP state VMSA, then use the SNP AP Creation NAE to get the hypervisor to mark them runnable. I believe this is all >setup in the UEFI patch:
>https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fdevel%40edk2.groups.io%2Fmsg38460.html&amp;data=05%7C01%7CAshish.Kalra%40amd.com%7Ca40178ac6f284a9e33aa08da64152baa%>7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637932339382401133%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ZaiHHo9S24f9BB6E%>2FjexOt5TdKJQXxQDJI5QoYdDDHc%3D&amp;reserved=0.

Yes, I discussed the same with Tom, and this will be supported going forward, only the BSP will need to go through the LAUNCH_UPDATE_VMSA and at runtime the guest can dynamically create more APs using the SNP AP Creation NAE event.

Now, coming back to the original question, why do we need a separate vCPU count argument for SNP_LAUNCH_FINISH, won't the statically created vCPUs in kvm->created_vcpus/online_vcpus be sufficient for that, any dynamically created
vCPU's won't be part of the initial measurement or LaunchDigest of the VM, right ?

Thanks,
Ashish

2022-07-12 15:33:24

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

[AMD Official Use Only - General]

Yes, this is fixed in 5.19 rebase.

Thanks,
Ashish

-----Original Message-----
From: Jarkko Sakkinen <[email protected]>
Sent: Tuesday, July 12, 2022 7:49 AM
To: Kalra, Ashish <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On Tue, Jul 12, 2022 at 03:45:13PM +0300, Jarkko Sakkinen wrote:
> On Tue, Jul 12, 2022 at 03:34:00PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Jun 20, 2022 at 11:13:03PM +0000, Ashish Kalra wrote:
> > > From: Brijesh Singh <[email protected]>
> > >
> > > When SEV-SNP is enabled in the guest, the hardware places
> > > restrictions on all memory accesses based on the contents of the
> > > RMP table. When hardware encounters RMP check failure caused by
> > > the guest memory access it raises the #NPF. The error code
> > > contains additional information on the access type. See the APM volume 2 for additional information.
> > >
> > > Signed-off-by: Brijesh Singh <[email protected]>
> > > ---
> > > arch/x86/kvm/svm/sev.c | 76
> > > ++++++++++++++++++++++++++++++++++++++++++
> > > arch/x86/kvm/svm/svm.c | 14 +++++---
> > > 2 files changed, 86 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index
> > > 4ed90331bca0..7fc0fad87054 100644
> > > --- a/arch/x86/kvm/svm/sev.c
> > > +++ b/arch/x86/kvm/svm/sev.c
> > > @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm,
> > > gfn_t gfn, kvm_pfn_t pfn)
> > >
> > > spin_unlock(&sev->psc_lock);
> > > }
> > > +
> > > +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64
> > > +error_code) {
> > > + int rmp_level, npt_level, rc, assigned;
> > > + struct kvm *kvm = vcpu->kvm;
> > > + gfn_t gfn = gpa_to_gfn(gpa);
> > > + bool need_psc = false;
> > > + enum psc_op psc_op;
> > > + kvm_pfn_t pfn;
> > > + bool private;
> > > +
> > > + write_lock(&kvm->mmu_lock);
> > > +
> > > + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn,
> > > +&npt_level)))
> >
> > This function does not exist. Should it be kvm_mmu_get_tdp_page?
>
> Ugh, ignore that.
>
> This the actual issue:
>
> $ git grep kvm_mmu_get_tdp_walk
> arch/x86/kvm/mmu/mmu.c:bool kvm_mmu_get_tdp_walk(struct kvm_vcpu
> *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level) arch/x86/kvm/mmu/mmu.c:EXPORT_SYMBOL_GPL(kvm_mmu_get_tdp_walk);
> arch/x86/kvm/svm/sev.c: rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
>
> It's not declared in any header.

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 0e1f4d92b89b..33267f619e61 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -164,6 +164,8 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu) kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
u32 error_code, int max_level);

+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level):
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE


BTW, kvm_mmu_map_tdp_page() ought to be in single line since it's less than
100 characters.

BR, Jarkko

2022-07-12 16:05:35

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

On Tue, Jul 12, 2022 at 9:22 AM Kalra, Ashish <[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> Hello Peter,
>
> >> >Given the guest uses the SNP NAE AP boot protocol we were expecting that there would be some option to add vCPUs to the VM but mark them as "pending AP boot creation protocol" state. This would allow the LaunchDigest of a VM doesn't change >just because its vCPU count changes. Would it be possible to add a new add an argument to KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA pages for or similarly a new argument for KVM_CREATE_VCPU?
> >>
> >> But don't we want/need to measure all vCPUs using LAUNCH_UPDATE_VMSA before we issue SNP_LAUNCH_FINISH command ?
> >>
> >> If we are going to add vCPUs and mark them as "pending AP boot creation" state then how are we going to do LAUNCH_UPDATE_VMSAs for them after SNP_LAUNCH_FINISH ?
>
> >If I understand correctly we don't need or even want the APs to be LAUNCH_UPDATE_VMSA'd. LAUNCH_UPDATEing all the VMSAs causes VMs with different numbers of vCPUs to have different launch digests. Its my understanding the SNP AP >Creation protocol was to solve this so that VMs with different vcpu counts have the same launch digest.
>
> >Looking at patch "[Part2,v6,44/49] KVM: SVM: Support SEV-SNP AP Creation NAE event" and section "4.1.9 SNP AP Creation" of the GHCB spec. There is no need to mark the LAUNCH_UPDATE the AP's VMSA or mark the vCPUs runnable. Instead we >can do that only for the BSP. Then in the guest UEFI the BSP can: create new VMSAs from guest pages, RMPADJUST them into the RMP state VMSA, then use the SNP AP Creation NAE to get the hypervisor to mark them runnable. I believe this is all >setup in the UEFI patch:
> >https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fdevel%40edk2.groups.io%2Fmsg38460.html&amp;data=05%7C01%7CAshish.Kalra%40amd.com%7Ca40178ac6f284a9e33aa08da64152baa%>7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637932339382401133%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ZaiHHo9S24f9BB6E%>2FjexOt5TdKJQXxQDJI5QoYdDDHc%3D&amp;reserved=0.
>
> Yes, I discussed the same with Tom, and this will be supported going forward, only the BSP will need to go through the LAUNCH_UPDATE_VMSA and at runtime the guest can dynamically create more APs using the SNP AP Creation NAE event.
>
> Now, coming back to the original question, why do we need a separate vCPU count argument for SNP_LAUNCH_FINISH, won't the statically created vCPUs in kvm->created_vcpus/online_vcpus be sufficient for that, any dynamically created
> vCPU's won't be part of the initial measurement or LaunchDigest of the VM, right ?

Are you suggesting that QEMU will KVM_CREATE_VCPU the BSP, then
LAUNCH_FINISH, then KVM_CREATE_VCPU all the APs to their VMSAs were
not LAUNCH_UPDATED? If so, it seems annoying to have to create vCPUs
at different times to get their VMSAs into different states. That's
why I was suggesting some other mechanism so we can continue to
KVM_CREATE_VCPU all the vCPUs at the same time.

>
> Thanks,
> Ashish

2022-07-12 16:46:57

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 29/49] KVM: X86: Keep the NPT and RMP page level in sync

s/X86/x86/

On Mon, Jun 20, 2022 at 11:08:57PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When running an SEV-SNP VM, the sPA used to index the RMP entry is
> obtained through the NPT translation (gva->gpa->spa). The NPT page
> level is checked against the page level programmed in the RMP entry.
> If the page level does not match, then it will cause a nested page
> fault with the RMP bit set to indicate the RMP violation.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 5 ++++
> arch/x86/kvm/svm/sev.c | 46 ++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.c | 1 +
> arch/x86/kvm/svm/svm.h | 1 +
> 6 files changed, 55 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index a66292dae698..e0068e702692 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -129,6 +129,7 @@ KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP(alloc_apic_backing_page)
> +KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0205e2944067..2748c69609e3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1514,6 +1514,7 @@ struct kvm_x86_ops {
> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> + void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c623019929a7..997318ecebd1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -43,6 +43,7 @@
> #include <linux/hash.h>
> #include <linux/kern_levels.h>
> #include <linux/kthread.h>
> +#include <linux/sev.h>
>
> #include <asm/page.h>
> #include <asm/memtype.h>
> @@ -2824,6 +2825,10 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> if (unlikely(!pte))
> return PG_LEVEL_4K;
>
> + /* Adjust the page level based on the SEV-SNP RMP page level. */
> + if (kvm_x86_ops.rmp_page_level_adjust)
> + static_call(kvm_x86_rmp_page_level_adjust)(kvm, pfn, &level);
> +
> return level;
> }
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a5b90469683f..91d3d24e60d2 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3597,3 +3597,49 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
>
> return pfn_to_page(pfn);
> }
> +
> +static bool is_pfn_range_shared(kvm_pfn_t start, kvm_pfn_t end)
> +{
> + int level;
> +
> + while (end > start) {
> + if (snp_lookup_rmpentry(start, &level) != 0)
> + return false;
> + start++;
> + }
> +
> + return true;
> +}
> +
> +void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)

Would not do harm to document this, given that it is not a static
fuction.

> +{
> + int rmp_level, assigned;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return;
> +
> + assigned = snp_lookup_rmpentry(pfn, &rmp_level);
> + if (unlikely(assigned < 0))
> + return;
> +
> + if (!assigned) {
> + /*
> + * If all the pages are shared then no need to keep the RMP
> + * and NPT in sync.
> + */
> + pfn = pfn & ~(PTRS_PER_PMD - 1);
> + if (is_pfn_range_shared(pfn, pfn + PTRS_PER_PMD))
> + return;
> + }
> +
> + /*
> + * The hardware installs 2MB TLB entries to access to 1GB pages,
> + * therefore allow NPT to use 1GB pages when pfn was added as 2MB
> + * in the RMP table.
> + */
> + if (rmp_level == PG_LEVEL_2M && (*level == PG_LEVEL_1G))
> + return;
> +
> + /* Adjust the level to keep the NPT and RMP in sync */
> + *level = min_t(size_t, *level, rmp_level);
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b4bd64f94d3a..18e2cd4d9559 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4734,6 +4734,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
>
> .alloc_apic_backing_page = svm_alloc_apic_backing_page,
> + .rmp_page_level_adjust = sev_rmp_page_level_adjust,
> };
>
> /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 71c011af098e..7782312a1cda 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -673,6 +673,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
> void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
> void sev_es_unmap_ghcb(struct vcpu_svm *svm);
> struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
> +void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);
>
> /* vmenter.S */
>
> --
> 2.25.1
>


BR, Jarkko

2022-07-12 17:46:33

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

On 7/12/22 09:45, Peter Gonda wrote:
> On Mon, Jul 11, 2022 at 4:41 PM Kalra, Ashish <[email protected]> wrote:
>>
>> [AMD Official Use Only - General]
>>
>> Hello Peter,
>>
>>>> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and
>>>> stores it as the measurement of the guest at launch.
>>>>
>>>> While finalizing the launch flow, it also issues the LAUNCH_UPDATE
>>>> command to encrypt the VMSA pages.
>>
>>> Given the guest uses the SNP NAE AP boot protocol we were expecting that there would be some option to add vCPUs to the VM but mark them as "pending AP boot creation protocol" state. This would allow the LaunchDigest of a VM doesn't change >just because its vCPU count changes. Would it be possible to add a new add an argument to KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA pages for or similarly a new argument for KVM_CREATE_VCPU?
>>
>> But don't we want/need to measure all vCPUs using LAUNCH_UPDATE_VMSA before we issue SNP_LAUNCH_FINISH command ?
>>
>> If we are going to add vCPUs and mark them as "pending AP boot creation" state then how are we going to do LAUNCH_UPDATE_VMSAs for them after SNP_LAUNCH_FINISH ?
>
> If I understand correctly we don't need or even want the APs to be
> LAUNCH_UPDATE_VMSA'd. LAUNCH_UPDATEing all the VMSAs causes VMs with
> different numbers of vCPUs to have different launch digests. Its my
> understanding the SNP AP Creation protocol was to solve this so that
> VMs with different vcpu counts have the same launch digest.
>
> Looking at patch "[Part2,v6,44/49] KVM: SVM: Support SEV-SNP AP
> Creation NAE event" and section "4.1.9 SNP AP Creation" of the GHCB
> spec. There is no need to mark the LAUNCH_UPDATE the AP's VMSA or mark
> the vCPUs runnable. Instead we can do that only for the BSP. Then in
> the guest UEFI the BSP can: create new VMSAs from guest pages,
> RMPADJUST them into the RMP state VMSA, then use the SNP AP Creation
> NAE to get the hypervisor to mark them runnable. I believe this is all
> setup in the UEFI patch:
> https://www.mail-archive.com/[email protected]/msg38460.html.

Not quite... there isn't a way to (easily) retrieve the APIC IDs for all
of the vCPUs, which are required in order to use the AP Create event.

For this version of SNP, all of the vCPUs are measured and started by OVMF
in the same way as SEV-ES. However, once the vCPUs have run, we now have
the APIC ID associated with each vCPU and the AP Create event can be used
going forward.

The SVSM support will introduce a new NAE event to the GHCB spec to
retrieve all of the APIC IDs from the hypervisor. With that, then you
would be able be required to perform a LAUNCH_UPDATE_VMSA against the BSP.

Thanks,
Tom

2022-07-13 15:02:06

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

On Tue, Jul 12, 2022 at 11:40 AM Tom Lendacky <[email protected]> wrote:
>
> On 7/12/22 09:45, Peter Gonda wrote:
> > On Mon, Jul 11, 2022 at 4:41 PM Kalra, Ashish <[email protected]> wrote:
> >>
> >> [AMD Official Use Only - General]
> >>
> >> Hello Peter,
> >>
> >>>> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and
> >>>> stores it as the measurement of the guest at launch.
> >>>>
> >>>> While finalizing the launch flow, it also issues the LAUNCH_UPDATE
> >>>> command to encrypt the VMSA pages.
> >>
> >>> Given the guest uses the SNP NAE AP boot protocol we were expecting that there would be some option to add vCPUs to the VM but mark them as "pending AP boot creation protocol" state. This would allow the LaunchDigest of a VM doesn't change >just because its vCPU count changes. Would it be possible to add a new add an argument to KVM_SNP_LAUNCH_FINISH to tell it which vCPUs to LAUNCH_UPDATE VMSA pages for or similarly a new argument for KVM_CREATE_VCPU?
> >>
> >> But don't we want/need to measure all vCPUs using LAUNCH_UPDATE_VMSA before we issue SNP_LAUNCH_FINISH command ?
> >>
> >> If we are going to add vCPUs and mark them as "pending AP boot creation" state then how are we going to do LAUNCH_UPDATE_VMSAs for them after SNP_LAUNCH_FINISH ?
> >
> > If I understand correctly we don't need or even want the APs to be
> > LAUNCH_UPDATE_VMSA'd. LAUNCH_UPDATEing all the VMSAs causes VMs with
> > different numbers of vCPUs to have different launch digests. Its my
> > understanding the SNP AP Creation protocol was to solve this so that
> > VMs with different vcpu counts have the same launch digest.
> >
> > Looking at patch "[Part2,v6,44/49] KVM: SVM: Support SEV-SNP AP
> > Creation NAE event" and section "4.1.9 SNP AP Creation" of the GHCB
> > spec. There is no need to mark the LAUNCH_UPDATE the AP's VMSA or mark
> > the vCPUs runnable. Instead we can do that only for the BSP. Then in
> > the guest UEFI the BSP can: create new VMSAs from guest pages,
> > RMPADJUST them into the RMP state VMSA, then use the SNP AP Creation
> > NAE to get the hypervisor to mark them runnable. I believe this is all
> > setup in the UEFI patch:
> > https://www.mail-archive.com/[email protected]/msg38460.html.
>
> Not quite... there isn't a way to (easily) retrieve the APIC IDs for all
> of the vCPUs, which are required in order to use the AP Create event.
>
> For this version of SNP, all of the vCPUs are measured and started by OVMF
> in the same way as SEV-ES. However, once the vCPUs have run, we now have
> the APIC ID associated with each vCPU and the AP Create event can be used
> going forward.
>
> The SVSM support will introduce a new NAE event to the GHCB spec to
> retrieve all of the APIC IDs from the hypervisor. With that, then you
> would be able be required to perform a LAUNCH_UPDATE_VMSA against the BSP.

Thank you Tom I missed that we needed to run the APs to set up their
APIC IDs for OVMF. Is there any reason we need to wait for the SVSM to
do what you describe? Couldn't the OVMF use an NAE to get all the APIC
IDs?

>
> Thanks,
> Tom
>

2022-07-17 10:08:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

On Mon, Jun 20, 2022 at 11:02:01PM +0000, Ashish Kalra wrote:
> +/*
> + * The first 16KB from the RMP_BASE is used by the processor for the
> + * bookkeeping, the range need to be added during the RMP entry lookup.

needs

> +static int __snp_enable(unsigned int cpu)
> +{
> + u64 val;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return 0;
> +
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> + val |= MSR_AMD64_SYSCFG_SNP_EN;
> + val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
> +
> + wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> + return 0;
> +}
> +
> +static __init void snp_enable(void *arg)
> +{
> + __snp_enable(smp_processor_id());
> +}

Get rid of that silly wrapper - you're not even using that @cpu argument.

> +static bool get_rmptable_info(u64 *start, u64 *len)
> +{
> + u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
> +
> + rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
> + rdmsrl(MSR_AMD64_RMP_END, rmp_end);
> +
> + if (!rmp_base || !rmp_end) {
> + pr_info("Memory for the RMP table has not been reserved by BIOS\n");

pr_err

> + return false;
> + }
> +
> + rmp_sz = rmp_end - rmp_base + 1;
> +
> + /*
> + * Calculate the amount the memory that must be reserved by the BIOS to
> + * address the full system RAM. The reserved memory should also cover the

"... address the whole RAM."

> + * RMP table itself.
> + *
> + * See PPR Family 19h Model 01h, Revision B1 section 2.1.4.2 for more
> + * information on memory requirement.

That section number will change over time - if you want to refer to some
section just use its title so that people can at least grep for the
relevant text.

> + */
> + nr_pages = totalram_pages();
> + calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;

use totalram_pages() directly and get rid of nr_pages.

> +
> + if (calc_rmp_sz > rmp_sz) {
> + pr_info("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
> + calc_rmp_sz, rmp_sz);

pr_err

> + return false;
> + }
> +
> + *start = rmp_base;
> + *len = rmp_sz;
> +
> + pr_info("RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);

"RMP table physical address range: ...[0x.. - 0x..]"

> +
> + return true;
> +}
> +
> +static __init int __snp_rmptable_init(void)

s/int/bool/

> +{
> + u64 rmp_base, sz;
> + void *start;
> + u64 val;
> +
> + if (!get_rmptable_info(&rmp_base, &sz))
> + return 1;
> +
> + start = memremap(rmp_base, sz, MEMREMAP_WB);
> + if (!start) {
> + pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
^^^^^^

either write the size in decimal or do a normal interval.

> + return 1;
> + }
> +
> + /*
> + * Check if SEV-SNP is already enabled, this can happen if we are coming from

Who is "we"?

Pls get rid of all "we" in the comments and use passive formulations.

> + * kexec boot.
> + */
> + rdmsrl(MSR_AMD64_SYSCFG, val);
> + if (val & MSR_AMD64_SYSCFG_SNP_EN)
> + goto skip_enable;
> +
> + /* Initialize the RMP table to zero */
> + memset(start, 0, sz);

Do I understand it correctly that in the kexec case the second, kexec-ed
kernel is reusing the previous kernel's RMP table so it should not be
cleared?

> +
> + /* Flush the caches to ensure that data is written before SNP is enabled. */
> + wbinvd_on_all_cpus();
> +
> + /* Enable SNP on all CPUs. */
> + on_each_cpu(snp_enable, NULL, 1);
> +
> +skip_enable:
> + rmptable_start = (unsigned long)start;
> + rmptable_end = rmptable_start + sz;
> +
> + return 0;
> +}
> +
> +static int __init snp_rmptable_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))

cpu_feature_enabled

> + return 0;
> +
> + if (!iommu_sev_snp_supported())
> + goto nosnp;
> +
> + if (__snp_rmptable_init())
> + goto nosnp;
> +
> + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
> +
> + return 0;
> +
> +nosnp:
> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> + return 1;
> +}
> +
> +/*
> + * This must be called after the PCI subsystem. This is because before enabling
> + * the SNP feature we need to ensure that IOMMU supports the SEV-SNP feature.
> + * The iommu_sev_snp_support() is used for checking the feature, and it is
> + * available after subsys_initcall().

I'd much more appreciate here a short formulation explaining why is
IOMMU needed for SNP rather than the obvious.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-19 04:00:42

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

[AMD Official Use Only - General]

Hello Boris,

>> + * See PPR Family 19h Model 01h, Revision B1 section 2.1.4.2 for more
>> + * information on memory requirement.

>That section number will change over time - if you want to refer to some section just use its title so that people can at least grep for the relevant text.

This will all go into sev.c, instead of the header file, as this is non-architectural and per-processor and the structure won't be exposed to the rest
of the kernel. The above PPR reference and potentially in future an architectural method of reading the RMP table entries will be moved into it.

>> + */
>> + nr_pages = totalram_pages();
>> + calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) +
>> +RMPTABLE_CPU_BOOKKEEPING_SZ;

>use totalram_pages() directly and get rid of nr_pages.
Ok.

>> + * kexec boot.
>> + */
>> + rdmsrl(MSR_AMD64_SYSCFG, val);
>> + if (val & MSR_AMD64_SYSCFG_SNP_EN)
>> + goto skip_enable;
>> +
>> + /* Initialize the RMP table to zero */
>> + memset(start, 0, sz);

>Do I understand it correctly that in the kexec case the second, kexec-ed kernel is reusing the previous kernel's RMP table so it should not be cleared?
I believe that with kexec and after issuing the shutdown command, the RMP table needs to be fully initialized, so we should be re-initializing the RMP
table to zero here.

>>
>> +
>> +static int __init snp_rmptable_init(void) {
>> + if (!boot_cpu_has(X86_FEATURE_SEV_SNP))

>cpu_feature_enabled
Ok.

>> + return 0;
>> +
>> + if (!iommu_sev_snp_supported())
>> + goto nosnp;
>> +
>> + if (__snp_rmptable_init())
>> + goto nosnp;
>> +
>> + cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online",
>> +__snp_enable, NULL);
>> +
>> + return 0;
>> +
>> +nosnp:
>> + setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>> + return 1;
>> +}
>> +
>> +/*
>> + * This must be called after the PCI subsystem. This is because
>> +before enabling
>> + * the SNP feature we need to ensure that IOMMU supports the SEV-SNP feature.
>> + * The iommu_sev_snp_support() is used for checking the feature, and
>> +it is
>> + * available after subsys_initcall().

>I'd much more appreciate here a short formulation explaining why is IOMMU needed for SNP rather than the obvious.

Yes, IOMMU is enforced for SNP to ensure that HV cannot program DMA directly into guest private memory. In case of SNP,
the IOMMU makes sure that the page(s) used for DMA are HV owned.

Thanks,
Ashish

2022-07-19 08:41:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

On Tue, Jul 19, 2022 at 03:56:25AM +0000, Kalra, Ashish wrote:
> > That section number will change over time - if you want to refer to
> > some section just use its title so that people can at least grep for
> > the relevant text.
>
> This will all go into sev.c, instead of the header file, as this is
> non-architectural and per-processor and the structure won't be exposed
> to the rest of the kernel. The above PPR reference and potentially in
> future an architectural method of reading the RMP table entries will
> be moved into it.

I fail to see how this addresses my comment... All I'm saying is, the
"section 2.1.4.2" number will change so don't quote it in the text but
quote the section *name* instead.

> I believe that with kexec and after issuing the shutdown command,
> the RMP table needs to be fully initialized, so we should be
> re-initializing the RMP table to zero here.

And yet you're doing:

/*
* Check if SEV-SNP is already enabled, this can happen if we are coming from
* kexec boot.
*/
rdmsrl(MSR_AMD64_SYSCFG, val);
if (val & MSR_AMD64_SYSCFG_SNP_EN)
goto skip_enable; <-------- skip zeroing


So which is it?

> Yes, IOMMU is enforced for SNP to ensure that HV cannot program DMA
> directly into guest private memory. In case of SNP, the IOMMU makes
> sure that the page(s) used for DMA are HV owned.

Yes, now put that in the comment above the

fs_initcall(snp_rmptable_init);

line.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-19 11:46:37

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 03/49] x86/sev: Add the host SEV-SNP initialization support

[AMD Official Use Only - General]

Hello Boris,

>> > That section number will change over time - if you want to refer to
>> > some section just use its title so that people can at least grep for
>> > the relevant text.
>>
>> This will all go into sev.c, instead of the header file, as this is
>> non-architectural and per-processor and the structure won't be exposed
>> to the rest of the kernel. The above PPR reference and potentially in
>> future an architectural method of reading the RMP table entries will
>> be moved into it.

>I fail to see how this addresses my comment... All I'm saying is, the "section 2.1.4.2" number will change so don't quote it in the text but quote the section *name* instead.

Yes I agree with your comments, all I am saying is that these comments will move into sev.c instead of the header file.

>> I believe that with kexec and after issuing the shutdown command, the
>> RMP table needs to be fully initialized, so we should be
>> re-initializing the RMP table to zero here.

>And yet you're doing:

> /*
> * Check if SEV-SNP is already enabled, this can happen if we are coming from
> * kexec boot.
> */
> rdmsrl(MSR_AMD64_SYSCFG, val);
> if (val & MSR_AMD64_SYSCFG_SNP_EN)
> goto skip_enable; <-------- skip zeroing

>So which is it?

Again what I meant is that this will be fixed to reset the RMP table for kexec boot too.

>> Yes, IOMMU is enforced for SNP to ensure that HV cannot program DMA
>> directly into guest private memory. In case of SNP, the IOMMU makes
>> sure that the page(s) used for DMA are HV owned.

>>Yes, now put that in the comment above the

> fs_initcall(snp_rmptable_init);

>line.

Yes.

Thanks,
Ashish

2022-07-21 11:41:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 04/49] x86/sev: set SYSCFG.MFMD

On Mon, Jun 20, 2022 at 11:02:18PM +0000, Ashish Kalra wrote:
> Subject: [PATCH Part2 v6 04/49] x86/sev: set SYSCFG.MFMD

That subject title needs to be made human readable.

> From: Brijesh Singh <[email protected]>
>
> SEV-SNP FW >= 1.51 requires that SYSCFG.MFMD must be set.

Because?

Also, commit message needs to be human-readable and not pseudocode.

> @@ -2325,6 +2346,9 @@ static __init int __snp_rmptable_init(void)
> /* Flush the caches to ensure that data is written before SNP is enabled. */
> wbinvd_on_all_cpus();
>
> + /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
> + on_each_cpu(mfdm_enable, NULL, 1);
> +
> /* Enable SNP on all CPUs. */
> on_each_cpu(snp_enable, NULL, 1);

No, not two IPI generating function calls - one and do everything in it.
I.e., what Marc said.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-22 11:42:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Thu, Jun 23, 2022 at 10:43:40PM +0000, Kalra, Ashish wrote:
> Yes, that's a nice way to hide it from the rest of the kernel which
> does not require access to this structure anyway, in essence, it
> becomes a private structure.

So this whole discussion whether there should be a model check or not
in case a new RMP format gets added in the future is moot - when a new
model format comes along, *then* the distinction should be done and
added in code - not earlier.

This is nothing else but normal CPU enablement work - it should be done
when it is really needed.

Because the opposite can happen: you can add a model check which
excludes future model X, future model X comes along but does *not*
change the RMP format and then you're going to have to relax that model
check again to fix SNP on the new model X.

So pls add the model checks only when really needed.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-22 11:48:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Mon, Jun 20, 2022 at 11:02:33PM +0000, Ashish Kalra wrote:
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 25c7feb367f6..59e7ec6b0326 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -65,6 +65,8 @@
> * bookkeeping, the range need to be added during the RMP entry lookup.
> */
> #define RMPTABLE_CPU_BOOKKEEPING_SZ 0x4000
> +#define RMPENTRY_SHIFT 8
> +#define rmptable_page_offset(x) (RMPTABLE_CPU_BOOKKEEPING_SZ + (((unsigned long)x) >> RMPENTRY_SHIFT))
>
> /* For early boot hypervisor communication in SEV-ES enabled guests */
> static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> @@ -2386,3 +2388,44 @@ static int __init snp_rmptable_init(void)
> * available after subsys_initcall().
> */
> fs_initcall(snp_rmptable_init);
> +
> +static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
> +{
> + unsigned long vaddr, paddr = pfn << PAGE_SHIFT;
> + struct rmpentry *entry, *large_entry;
> +
> + if (!pfn_valid(pfn))
> + return ERR_PTR(-EINVAL);
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return ERR_PTR(-ENXIO);

That test should happen first.

> + vaddr = rmptable_start + rmptable_page_offset(paddr);

Wait, what does that macro do?

It takes the physical address and gives the offset from the beginning of
the RMP table in VA space?

So why don't you do

entry = rmptable_entry(paddr)

instead which simply gives you directly the entry in the RMP table with
which you can work further?

Instead of this macro doing half the work and then callers having to add
the RMP start address and cast.

And make it small function so that you can have typechecking too, while
at it.

> + if (unlikely(vaddr > rmptable_end))
> + return ERR_PTR(-ENXIO);
> +
> + entry = (struct rmpentry *)vaddr;
> +
> + /* Read a large RMP entry to get the correct page level used in RMP entry. */
> + vaddr = rmptable_start + rmptable_page_offset(paddr & PMD_MASK);
> + large_entry = (struct rmpentry *)vaddr;
> + *level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
> +
> + return entry;
> +}
> +
> +/*
> + * Return 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
> + * and -errno if there is no corresponding RMP entry.
> + */
> +int snp_lookup_rmpentry(u64 pfn, int *level)
> +{
> + struct rmpentry *e;
> +
> + e = __snp_lookup_rmpentry(pfn, level);
> + if (IS_ERR(e))
> + return PTR_ERR(e);
> +
> + return !!rmpentry_assigned(e);
> +}
> +EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
> diff --git a/include/linux/sev.h b/include/linux/sev.h
> new file mode 100644
> index 000000000000..1a68842789e1
> --- /dev/null
> +++ b/include/linux/sev.h

Why is this header in the linux/ namespace and not in arch/x86/ ?

All that stuff in here doesn't have any meaning outside of x86...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-22 19:09:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Fri, Jul 22, 2022, Borislav Petkov wrote:
> On Thu, Jun 23, 2022 at 10:43:40PM +0000, Kalra, Ashish wrote:
> > Yes, that's a nice way to hide it from the rest of the kernel which
> > does not require access to this structure anyway, in essence, it
> > becomes a private structure.
>
> So this whole discussion whether there should be a model check or not
> in case a new RMP format gets added in the future is moot - when a new
> model format comes along, *then* the distinction should be done and
> added in code - not earlier.

I disagree. Running an old kernel on new hardware with a different RMP layout
should refuse to use SNP, not read/write garbage and likely corrupt the RMP and/or
host memory.

And IMO, hiding the non-architectural RMP format in SNP-specific code so that we
don't have to churn a bunch of call sites that don't _need_ access to the raw RMP
format is a good idea regardless of whether we want to be optimistic or pessimistic
about future formats.

> This is nothing else but normal CPU enablement work - it should be done
> when it is really needed.
>
> Because the opposite can happen: you can add a model check which
> excludes future model X, future model X comes along but does *not*
> change the RMP format and then you're going to have to relax that model
> check again to fix SNP on the new model X.
>
> So pls add the model checks only when really needed.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2022-07-22 19:30:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Fri, Jul 22, 2022 at 07:04:23PM +0000, Sean Christopherson wrote:
> I disagree. Running an old kernel on new hardware with a different RMP layout
> should refuse to use SNP, not read/write garbage and likely corrupt the RMP and/or
> host memory.

See my example below.

> And IMO, hiding the non-architectural RMP format in SNP-specific code so that we
> don't have to churn a bunch of call sites that don't _need_ access to the raw RMP
> format is a good idea regardless of whether we want to be optimistic or pessimistic
> about future formats.

I don't think I ever objected to that.

> > This is nothing else but normal CPU enablement work - it should be done
> > when it is really needed.
> >

<--- this here.

> > Because the opposite can happen: you can add a model check which
> > excludes future model X, future model X comes along but does *not*
> > change the RMP format and then you're going to have to relax that model
> > check again to fix SNP on the new model X.

So constantly adding new models to a list which support a certain
version of the RMP format doesn't scale either.

If you corrupt the RMP because your kernel is old, you'll crash and burn
very visibly so that you'll be forced to have to look for an updated
kernel regardless.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-22 19:39:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

Btw,

what could work is to spec only a *version* field somewhere in the HW or
FW which says which version the RMP header has.

Then, OS would check that field and if it doesn't support that certain
version, it'll bail.

I'd need to talk to folks first, though, what the whole story is behind
not spec-ing the RMP format...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-22 22:18:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Fri, Jul 22, 2022, Borislav Petkov wrote:
> On Fri, Jul 22, 2022 at 07:04:23PM +0000, Sean Christopherson wrote:
> > I disagree. Running an old kernel on new hardware with a different RMP layout
> > should refuse to use SNP, not read/write garbage and likely corrupt the RMP and/or
> > host memory.
>
> See my example below.
>
> > And IMO, hiding the non-architectural RMP format in SNP-specific code so that we
> > don't have to churn a bunch of call sites that don't _need_ access to the raw RMP
> > format is a good idea regardless of whether we want to be optimistic or pessimistic
> > about future formats.
>
> I don't think I ever objected to that.

Yar, just wanted to be make sure we're all on the same page, I wasn't entirely
sure what was get nacked :-)

> > > This is nothing else but normal CPU enablement work - it should be done
> > > when it is really needed.
> > >
>
> <--- this here.
>
> > > Because the opposite can happen: you can add a model check which
> > > excludes future model X, future model X comes along but does *not*
> > > change the RMP format and then you're going to have to relax that model
> > > check again to fix SNP on the new model X.
>
> So constantly adding new models to a list which support a certain
> version of the RMP format doesn't scale either.

Yeah, but either we get AMD to give us an architectural layout or we'll have to
eat that cost at some point in the future.

> If you corrupt the RMP because your kernel is old, you'll crash and burn
> very visibly so that you'll be forced to have to look for an updated
> kernel regardless.

Heh, you're definitely more optimistic than me. I can just see something truly
ridiculous happening like moving the page size bit and then getting weird behavior
only when KVM happens to need the page size for some edge case.

Anyways, it's not a sticking point, and I certainly am not volunteering to
maintain the FMS list...

2022-07-22 22:35:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Fri, Jul 22, 2022 at 10:16:07PM +0000, Sean Christopherson wrote:
> Yar, just wanted to be make sure we're all on the same page, I wasn't entirely
> sure what was get nacked :-)

Not nacked - we're all just talking here. :-)

> Heh, you're definitely more optimistic than me. I can just see something truly
> ridiculous happening like moving the page size bit and then getting weird behavior
> only when KVM happens to need the page size for some edge case.
>
> Anyways, it's not a sticking point, and I certainly am not volunteering to
> maintain the FMS list...

Yeah, no need for it to be a sticking point because a pretty reliable
birdie just told me that we're worrying for nothing and it all will
solve itself.

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-24 17:36:50

by Dov Murik

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

Hi Ashish,

On 21/06/2022 2:02, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> The RMPUPDATE instruction writes a new RMP entry in the RMP Table. The
> hypervisor will use the instruction to add pages to the RMP table. See
> APM3 for details on the instruction operations.
>
> The PSMASH instruction expands a 2MB RMP entry into a corresponding set of
> contiguous 4KB-Page RMP entries. The hypervisor will use this instruction
> to adjust the RMP entry without invalidating the previous RMP entry.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/sev.h | 11 ++++++
> arch/x86/kernel/sev.c | 72 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 83 insertions(+)
>
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index cb16f0e5b585..6ab872311544 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -85,7 +85,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
>
> /* RMP page size */
> #define RMP_PG_SIZE_4K 0
> +#define RMP_PG_SIZE_2M 1
> #define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
> +#define X86_TO_RMP_PG_LEVEL(level) (((level) == PG_LEVEL_4K) ? RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)
>
> /*
> * The RMP entry format is not architectural. The format is defined in PPR
> @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
> u64 secrets_gpa;
> };
>
> +struct rmpupdate {
> + u64 gpa;
> + u8 assigned;
> + u8 pagesize;
> + u8 immutable;
> + u8 rsvd;
> + u32 asid;
> +} __packed;
> +
> #ifdef CONFIG_AMD_MEM_ENCRYPT
> extern struct static_key_false sev_es_enable_key;
> extern void __sev_es_ist_enter(struct pt_regs *regs);
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 59e7ec6b0326..f6c64a722e94 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2429,3 +2429,75 @@ int snp_lookup_rmpentry(u64 pfn, int *level)
> return !!rmpentry_assigned(e);
> }
> EXPORT_SYMBOL_GPL(snp_lookup_rmpentry);
> +
> +int psmash(u64 pfn)
> +{
> + unsigned long paddr = pfn << PAGE_SHIFT;
> + int ret;
> +
> + if (!pfn_valid(pfn))
> + return -EINVAL;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return -ENXIO;
> +
> + /* Binutils version 2.36 supports the PSMASH mnemonic. */
> + asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> + : "=a"(ret)
> + : "a"(paddr)
> + : "memory", "cc");
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(psmash);
> +
> +static int rmpupdate(u64 pfn, struct rmpupdate *val)
> +{
> + unsigned long paddr = pfn << PAGE_SHIFT;
> + int ret;
> +
> + if (!pfn_valid(pfn))
> + return -EINVAL;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> + return -ENXIO;
> +
> + /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
> + asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
> + : "=a"(ret)
> + : "a"(paddr), "c"((unsigned long)val)
> + : "memory", "cc");
> + return ret;
> +}
> +
> +int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid, bool immutable)
> +{
> + struct rmpupdate val;
> +
> + if (!pfn_valid(pfn))
> + return -EINVAL;
> +

Should we add more checks on the arguments?

1. asid must be > 0
2. gpa must be aligned according to 'level'
3. gpa must be below the maximal address for the guest

"Note that the guest physical address space is limited according to
CPUID Fn80000008_EAX and thus the GPAs used by the firmware in
measurement calculation are equally limited. Hypervisors should not
attempt to map pages outside of this limit."
(-SNP ABI spec page 86, section 8.17 SNP_LAUNCH_UPDATE)


But note that in patch 28 of this series we have:

+ /* Transition the VMSA page to a firmware state. */
+ ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);

That (u64)(-1) value for the gpa argument violates conditions 2 and 3
from my list above.

And indeed when calculating measurements we see that the GPA value
for the VMSA pages is 0x0000FFFF_FFFFF000, and not (u64)(-1). [1] [2]

Instead of checks, we can mask the gpa argument so that rmpupdate will
get the correct value. Not sure which approach is preferable.


[1] https://github.com/IBM/sev-snp-measure/blob/90f6e59831d20e44d03d5ee19388f624fca87291/sevsnpmeasure/gctx.py#L40
[2] https://github.com/slp/snp-digest-rs/blob/0e5a787e99069944467151101ae4db474793d657/src/main.rs#L86


-Dov


> + memset(&val, 0, sizeof(val));
> + val.assigned = 1;
> + val.asid = asid;
> + val.immutable = immutable;
> + val.gpa = gpa;
> + val.pagesize = X86_TO_RMP_PG_LEVEL(level);
> +
> + return rmpupdate(pfn, &val);
> +}
> +EXPORT_SYMBOL_GPL(rmp_make_private);
> +
> +int rmp_make_shared(u64 pfn, enum pg_level level)
> +{
> + struct rmpupdate val;
> +
> + if (!pfn_valid(pfn))
> + return -EINVAL;
> +
> + memset(&val, 0, sizeof(val));
> + val.pagesize = X86_TO_RMP_PG_LEVEL(level);
> +
> + return rmpupdate(pfn, &val);
> +}
> +EXPORT_SYMBOL_GPL(rmp_make_shared);

2022-07-25 11:21:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 33/49] KVM: x86: Update page-fault trace to log full 64-bit error code

On 6/21/22 01:10, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> The #NPT error code is a 64-bit value but the trace prints only the
> lower 32-bits. Some of the fault error code (e.g PFERR_GUEST_FINAL_MASK)
> are available in the upper 32-bits.
>
> Cc: <[email protected]>

Why stable?

> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/trace.h | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index e3a24b8f04be..9b9bc5468103 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -383,12 +383,12 @@ TRACE_EVENT(kvm_inj_exception,
> * Tracepoint for page fault.
> */
> TRACE_EVENT(kvm_page_fault,
> - TP_PROTO(unsigned long fault_address, unsigned int error_code),
> + TP_PROTO(unsigned long fault_address, u64 error_code),
> TP_ARGS(fault_address, error_code),
>
> TP_STRUCT__entry(
> __field( unsigned long, fault_address )
> - __field( unsigned int, error_code )
> + __field( u64, error_code )
> ),
>
> TP_fast_assign(
> @@ -396,7 +396,7 @@ TRACE_EVENT(kvm_page_fault,
> __entry->error_code = error_code;
> ),
>
> - TP_printk("address %lx error_code %x",
> + TP_printk("address %lx error_code %llx",
> __entry->fault_address, __entry->error_code)
> );
>

2022-07-25 13:25:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

On Tue, Jun 28, 2022 at 05:57:41PM +0000, Kalra, Ashish wrote:
> Yes, I will be adding a check for CPU family/model as following :

Why if the PPR is already kinda spelling the already architectural
pieces of the RMP entry?

"In order to assist software" it says.

So you call the specified ones by their name and the rest is __rsvd.

No need for model checks at all.

Right?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-25 14:33:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

On Mon, Jun 20, 2022 at 11:02:33PM +0000, Ashish Kalra wrote:
> +/*
> + * The RMP entry format is not architectural. The format is defined in PPR
> + * Family 19h Model 01h, Rev B1 processor.
> + */
> +struct __packed rmpentry {

That __packed goes...

> + union {
> + struct {
> + u64 assigned : 1,
> + pagesize : 1,
> + immutable : 1,
> + rsvd1 : 9,
> + gpa : 39,
> + asid : 10,
> + vmsa : 1,
> + validated : 1,
> + rsvd2 : 1;
> + } info;
> + u64 low;
> + };
> + u64 high;
> +};

... here, at the end.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-25 14:37:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

On Mon, Jun 20, 2022 at 11:02:52PM +0000, Ashish Kalra wrote:
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index cb16f0e5b585..6ab872311544 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -85,7 +85,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
>
> /* RMP page size */
> #define RMP_PG_SIZE_4K 0
> +#define RMP_PG_SIZE_2M 1
> #define RMP_TO_X86_PG_LEVEL(level) (((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
> +#define X86_TO_RMP_PG_LEVEL(level) (((level) == PG_LEVEL_4K) ? RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)
>
> /*
> * The RMP entry format is not architectural. The format is defined in PPR
> @@ -126,6 +128,15 @@ struct snp_guest_platform_data {
> u64 secrets_gpa;
> };
>
> +struct rmpupdate {

Why is there a struct rmpupdate *and* a struct rmpentry?!

One should be enough.

> + u64 gpa;
> + u8 assigned;
> + u8 pagesize;
> + u8 immutable;
> + u8 rsvd;
> + u32 asid;
> +} __packed;
> +

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-07-27 18:32:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Mon, Jun 20, 2022 at 11:03:07PM +0000, Ashish Kalra wrote:

> Subject: x86/sev: Invalid pages from direct map when adding it to RMP table

"...: Invalidate pages from the direct map when adding them to the RMP table"

> +static int restore_direct_map(u64 pfn, int npages)
> +{
> + int i, ret = 0;
> +
> + for (i = 0; i < npages; i++) {
> + ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));

set_memory_p() ?

> + if (ret)
> + goto cleanup;
> + }
> +
> +cleanup:
> + WARN(ret > 0, "Failed to restore direct map for pfn 0x%llx\n", pfn + i);

Warn for each pfn?!

That'll flood dmesg mightily.

> + return ret;
> +}
> +
> +static int invalid_direct_map(unsigned long pfn, int npages)
> +{
> + int i, ret = 0;
> +
> + for (i = 0; i < npages; i++) {
> + ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));

As above, set_memory_np() doesn't work here instead of looping over each
page?

> @@ -2462,11 +2494,38 @@ static int rmpupdate(u64 pfn, struct rmpupdate *val)
> if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> return -ENXIO;
>
> + level = RMP_TO_X86_PG_LEVEL(val->pagesize);
> + npages = page_level_size(level) / PAGE_SIZE;
> +
> + /*
> + * If page is getting assigned in the RMP table then unmap it from the
> + * direct map.
> + */
> + if (val->assigned) {
> + if (invalid_direct_map(pfn, npages)) {
> + pr_err("Failed to unmap pfn 0x%llx pages %d from direct_map\n",

"Failed to unmap %d pages at pfn 0x... from the direct map\n"

> + pfn, npages);
> + return -EFAULT;
> + }
> + }
> +
> /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
> asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
> : "=a"(ret)
> : "a"(paddr), "c"((unsigned long)val)
> : "memory", "cc");
> +
> + /*
> + * Restore the direct map after the page is removed from the RMP table.
> + */
> + if (!ret && !val->assigned) {
> + if (restore_direct_map(pfn, npages)) {
> + pr_err("Failed to map pfn 0x%llx pages %d in direct_map\n",

"Failed to map %d pages at pfn 0x... into the direct map\n"

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-01 21:17:43

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 04/49] x86/sev: set SYSCFG.MFMD

[AMD Official Use Only - General]

Hello Boris,

>> Subject: [PATCH Part2 v6 04/49] x86/sev: set SYSCFG.MFMD

>That subject title needs to be made human readable.
Ok.

>> SEV-SNP FW >= 1.51 requires that SYSCFG.MFMD must be set.

>Because?
This is a FW requirement.

>Also, commit message needs to be human-readable and not pseudocode.

>> @@ -2325,6 +2346,9 @@ static __init int __snp_rmptable_init(void)
>> /* Flush the caches to ensure that data is written before SNP is enabled. */
>> wbinvd_on_all_cpus();
>>
>> + /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
>> + on_each_cpu(mfdm_enable, NULL, 1);
>> +
>> /* Enable SNP on all CPUs. */
>> on_each_cpu(snp_enable, NULL, 1);

>No, not two IPI generating function calls - one and do everything in it.
>I.e., what Marc said.

Ok got that.

Thanks,
Ashish

2022-08-01 21:51:15

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]

>> I disagree. Running an old kernel on new hardware with a different
>> RMP layout should refuse to use SNP, not read/write garbage and likely
>> corrupt the RMP and/or host memory.

>See my example below.

> And IMO, hiding the non-architectural RMP format in SNP-specific code
> so that we don't have to churn a bunch of call sites that don't _need_
> access to the raw RMP format is a good idea regardless of whether we
> want to be optimistic or pessimistic about future formats.

>I don't think I ever objected to that.

I agree with what Sean is recommending to do.

As I mentioned earlier, in the long term and with respect to future platforms, we are going to add architectural support
to read RMP table entries, so this structure will exist only for older platform support.

Thanks,
Ashish

2022-08-01 21:56:44

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]

As I mentioned before, in the future we will have architectural support to read RMP table entries, we will first check for
availability of this feature and use it always if it is supported and enabled, and only fallback to doing raw RMP table access
if this architectural support is not available.

Thanks,
Ashish

-----Original Message-----
From: Borislav Petkov <[email protected]>
Sent: Friday, July 22, 2022 2:38 PM
To: Sean Christopherson <[email protected]>
Cc: Kalra, Ashish <[email protected]>; Dave Hansen <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

Btw,

what could work is to spec only a *version* field somewhere in the HW or FW which says which version the RMP header has.

Then, OS would check that field and if it doesn't support that certain version, it'll bail.

I'd need to talk to folks first, though, what the whole story is behind not spec-ing the RMP format...

--
Regards/Gruss,
Boris.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=05%7C01%7CAshish.Kalra%40amd.com%7Cfc8ed4feddb346bbae8a08da6c19b7d6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637941154968117489%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=khiE7a%2FAW8C%2B0RilZHWxGvMtnlQkDTlC5UtU8Q3L1Lo%3D&amp;reserved=0

2022-08-01 22:10:17

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 05/49] x86/sev: Add RMP entry lookup helpers

[AMD Official Use Only - General]

Hello Boris,

>> + * The RMP entry format is not architectural. The format is defined in PPR
>> + * Family 19h Model 01h, Rev B1 processor.
>> + */
>> +struct __packed rmpentry {

>That __packed goes...

>> + union {
>> + struct {
>> + u64 assigned : 1,
>> + pagesize : 1,
>> + immutable : 1,
>> + rsvd1 : 9,
>> + gpa : 39,
>> + asid : 10,
>> + vmsa : 1,
>> + validated : 1,
>> + rsvd2 : 1;
>> + } info;
>> + u64 low;
>> + };
>> + u64 high;
>> +};

>... here, at the end.

Yes, will fix that.

Thanks,
Ashish

2022-08-01 22:35:55

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

Hello Boris,

>> +struct rmpupdate {

Why is there a struct rmpupdate *and* a struct rmpentry?!

The struct rmpentry is the raw layout of the RMP table entry while struct rmpupdate is the structure
expected by the rmpupdate instruction for programming the RMP table entries.

Arguably, we can program a struct rmpupdate internally from a struct rmpentry.

But we will still need struct rmpupdate for issuing the rmpupdate instruction, so it is probably cleaner
to keep it this way, as it only has two main callers - rmp_make_private() and rmp_make_shared().

Also due to non-architectural aspect of struct rmpentry, the above functions may need to be modified
If there are changes in struct rmpentry, while struct rmpupdate remains consistent and persistent.

>One should be enough.

>> + u64 gpa;
>> + u8 assigned;
>> + u8 pagesize;
>> + u8 immutable;
>> + u8 rsvd;
>> + u32 asid;
>> +} __packed;
>> +

Thanks,
Ashish

2022-08-01 23:35:26

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

Hello Boris,

>> Yes, I will be adding a check for CPU family/model as following :

>Why if the PPR is already kinda spelling the already architectural pieces of the RMP entry?

>"In order to assist software" it says.

The PPR specifies select portions of the RMP entry format for a specific core/platform.

Therefore, the complete struct rmpentry definition is non-architectural.

As per PPR, software should not rely on any field definitions not specified
in this table and the format of an RMP entry may change in future processors.

>So you call the specified ones by their name and the rest is __rsvd.

>No need for model checks at all.

>Right?

But we can't use this struct on a core/platform which has a different layout, so aren't
the model checks required ?

Thanks,
Ashish

2022-08-02 00:02:09

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

[AMD Official Use Only - General]

Hello Boris,

>> Subject: x86/sev: Invalid pages from direct map when adding it to RMP
>> table

>"...: Invalidate pages from the direct map when adding them to the RMP table"
Ok

>> +static int restore_direct_map(u64 pfn, int npages) {
>> + int i, ret = 0;
>> +
>> + for (i = 0; i < npages; i++) {
>> + ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));

>set_memory_p() ?

You mean set_memory_present() ?

Is there an issue with not using set_direct_map_default_noflush(), it is easier to correlate with
this function and it's functionality of restoring the page in the kernel direct map ?

> + if (ret)
> + goto cleanup;
> + }
> +
> +cleanup:
> + WARN(ret > 0, "Failed to restore direct map for pfn 0x%llx\n", pfn +
> +i);

>Warn for each pfn?!

>That'll flood dmesg mightily.

> + return ret;
> +}
> +
> +static int invalid_direct_map(unsigned long pfn, int npages) {
> + int i, ret = 0;
> +
> + for (i = 0; i < npages; i++) {
> + ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));

>As above, set_memory_np() doesn't work here instead of looping over each page?

Yes, set_memory_np() looks more efficient to use instead of looping over each page.

But again, calling set_direct_map_invalid_noflush() is easier to understand from the
calling function's point of view as it correlates to the functionality of invalidating the
page from kernel direct map ?

>> + if (val->assigned) {
>> + if (invalid_direct_map(pfn, npages)) {
>. + pr_err("Failed to unmap pfn 0x%llx pages %d from direct_map\n",

>"Failed to unmap %d pages at pfn 0x... from the direct map\n"
Ok.

>> + if (!ret && !val->assigned) {
>> + if (restore_direct_map(pfn, npages)) {
>> + pr_err("Failed to map pfn 0x%llx pages %d in direct_map\n",

>"Failed to map %d pages at pfn 0x... into the direct map\n"
Ok.

Thanks,
Ashish

2022-08-02 04:58:01

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

[AMD Official Use Only - General]

Hello Dov,

>> +int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid,
>> +bool immutable) {
>> + struct rmpupdate val;
>> +
>> + if (!pfn_valid(pfn))
>> + return -EINVAL;
>> +

>Should we add more checks on the arguments?

>1. asid must be > 0
>2. gpa must be aligned according to 'level'
>3. gpa must be below the maximal address for the guest

Ok, yes it surely makes sense to add more checks on the arguments.

>"Note that the guest physical address space is limited according to CPUID Fn80000008_EAX and thus the GPAs used by the firmware in measurement calculation are equally limited. Hypervisors should not attempt to map pages outside of this limit."
>(-SNP ABI spec page 86, section 8.17 SNP_LAUNCH_UPDATE)


>But note that in patch 28 of this series we have:

>+ /* Transition the VMSA page to a firmware state. */
>+ ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);

>That (u64)(-1) value for the gpa argument violates conditions 2 and 3 from my list above.

>And indeed when calculating measurements we see that the GPA value for the VMSA pages is 0x0000FFFF_FFFFF000, and not (u64)(-1). [1] [2]

>Instead of checks, we can mask the gpa argument so that rmpupdate will get the correct value. Not sure which approach is preferable.

Well, the firmware is anyway masking the gpa argument as you observe in the launch digest, so probably we should do the same here too.

Thanks,
Ashish

2022-08-02 10:59:02

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Tue, Jun 21, 2022 at 03:43:13PM -0600, Peter Gonda wrote:
> (
>
> On Mon, Jun 20, 2022 at 5:05 PM Ashish Kalra <[email protected]> wrote:
> >
> > From: Brijesh Singh <[email protected]>
> >
> > Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
> > commands for SEV-SNP is defined in the SEV-SNP firmware specification.
> >
> > Signed-off-by: Brijesh Singh <[email protected]>
> > ---
> > drivers/crypto/ccp/sev-dev.c | 24 ++++++++++++
> > include/linux/psp-sev.h | 73 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 97 insertions(+)
> >
> > diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> > index f1173221d0b9..35d76333e120 100644
> > --- a/drivers/crypto/ccp/sev-dev.c
> > +++ b/drivers/crypto/ccp/sev-dev.c
> > @@ -1205,6 +1205,30 @@ int sev_guest_df_flush(int *error)
> > }
> > EXPORT_SYMBOL_GPL(sev_guest_df_flush);
> >
> > +int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error)
> > +{
> > + return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
> > +}
> > +EXPORT_SYMBOL_GPL(snp_guest_decommission);
> > +
> > +int snp_guest_df_flush(int *error)
> > +{
> > + return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
> > +}
> > +EXPORT_SYMBOL_GPL(snp_guest_df_flush);

Nit: undocumented exported functions. Both need kdoc.

>
> Why not instead change sev_guest_df_flush() to be SNP aware? That way
> callers get the right behavior without having to know if SNP is
> enabled or not.
>
> int sev_guest_df_flush(int *error)
> {
> if (!psp_master || !psp_master->sev_data)
> return -EINVAL;
>
> if (psp_master->sev_data->snp_inited)
> return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
>
> return sev_do_cmd(SEV_CMD_DF_FLUSH, NULL, error);
> }

Because it serves no purpose to fuse them into one, and is only more
obfuscated (and also undocumented).

Two exported symbols can be traced also separately with ftrace/kprobes.

Degrading transparency is not great idea IMHO.

BR, Jarkko


2022-08-02 11:25:11

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Jun 20, 2022 at 11:05:01PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> The behavior and requirement for the SEV-legacy command is altered when
> the SNP firmware is in the INIT state. See SEV-SNP firmware specification
> for more details.
>
> Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region
> when SNP is enabled to satify new requirements for the SNP. Continue
> allocating a 1mb region for !SNP configuration.
>
> While at it, provide API that can be used by others to allocate a page
> that can be used by the firmware. The immediate user for this API will
> be the KVM driver. The KVM driver to need to allocate a firmware context
> page during the guest creation. The context page need to be updated
> by the firmware. See the SEV-SNP specification for further details.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 173 +++++++++++++++++++++++++++++++++--
> include/linux/psp-sev.h | 11 +++
> 2 files changed, 178 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 35d76333e120..0dbd99f29b25 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -79,6 +79,14 @@ static void *sev_es_tmr;
> #define NV_LENGTH (32 * 1024)
> static void *sev_init_ex_buffer;
>
> +/* When SEV-SNP is enabled the TMR needs to be 2MB aligned and 2MB size. */
> +#define SEV_SNP_ES_TMR_SIZE (2 * 1024 * 1024)
> +
> +static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;
> +
> +static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
> +static int sev_do_cmd(int cmd, void *data, int *psp_ret);
> +
> static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
> {
> struct sev_device *sev = psp_master->sev_data;
> @@ -177,11 +185,161 @@ static int sev_cmd_buffer_len(int cmd)
> return 0;
> }
>
> +static void snp_leak_pages(unsigned long pfn, unsigned int npages)
> +{
> + WARN(1, "psc failed, pfn 0x%lx pages %d (leaking)\n", pfn, npages);
> + while (npages--) {
> + memory_failure(pfn, 0);
> + dump_rmpentry(pfn);
> + pfn++;
> + }
> +}
> +
> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages, bool locked)
> +{
> + struct sev_data_snp_page_reclaim data;
> + int ret, err, i, n = 0;
> +
> + for (i = 0; i < npages; i++) {
> + memset(&data, 0, sizeof(data));
> + data.paddr = pfn << PAGE_SHIFT;
> +
> + if (locked)
> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> + else
> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> + if (ret)
> + goto cleanup;
> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + goto cleanup;
> +
> + pfn++;
> + n++;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /*
> + * If failed to reclaim the page then page is no longer safe to
> + * be released, leak it.
> + */
> + snp_leak_pages(pfn, npages - n);
> + return ret;
> +}
> +
> +static inline int rmp_make_firmware(unsigned long pfn, int level)
> +{
> + return rmp_make_private(pfn, 0, level, 0, true);
> +}
> +
> +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
> + bool need_reclaim)
> +{
> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT; /* Cbit maybe set in the paddr */
> + int rc, n = 0, i;
> +
> + for (i = 0; i < npages; i++) {
> + if (to_fw)
> + rc = rmp_make_firmware(pfn, PG_LEVEL_4K);
> + else
> + rc = need_reclaim ? snp_reclaim_pages(pfn, 1, locked) :
> + rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (rc)
> + goto cleanup;
> +
> + pfn++;
> + n++;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /* Try unrolling the firmware state changes */
> + if (to_fw) {
> + /*
> + * Reclaim the pages which were already changed to the
> + * firmware state.
> + */
> + snp_reclaim_pages(paddr >> PAGE_SHIFT, n, locked);
> +
> + return rc;
> + }
> +
> + /*
> + * If failed to change the page state to shared, then its not safe
> + * to release the page back to the system, leak it.
> + */
> + snp_leak_pages(pfn, npages - n);
> +
> + return rc;
> +}
> +
> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
> +{
> + unsigned long npages = 1ul << order, paddr;
> + struct sev_device *sev;
> + struct page *page;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return NULL;
> +
> + page = alloc_pages(gfp_mask, order);
> + if (!page)
> + return NULL;
> +
> + /* If SEV-SNP is initialized then add the page in RMP table. */
> + sev = psp_master->sev_data;
> + if (!sev->snp_inited)
> + return page;
> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (snp_set_rmp_state(paddr, npages, true, locked, false))
> + return NULL;
> +
> + return page;
> +}
> +
> +void *snp_alloc_firmware_page(gfp_t gfp_mask)
> +{
> + struct page *page;
> +
> + page = __snp_alloc_firmware_pages(gfp_mask, 0, false);

Could be just

struct page *page == __snp_alloc_firmware_pages(gfp_mask, 0, false);

> +
> + return page ? page_address(page) : NULL;
> +}
> +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);

Undocumented API

Why don't you just export __snp_alloc_firmware_pages() and declare these
trivial wrappers as "static inline" inside psp-sev.h?

> +
> +static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
> +{
> + unsigned long paddr, npages = 1ul << order;
> +
> + if (!page)
> + return;

Silently ignored NULL pointer.

> +
> + paddr = __pa((unsigned long)page_address(page));
> + if (snp_set_rmp_state(paddr, npages, false, locked, true))
> + return;
> +
> + __free_pages(page, order);
> +}
> +
> +void snp_free_firmware_page(void *addr)
> +{
> + if (!addr)
> + return;

Why silently ignore a NULL pointer? At minimum, pr_warn() would be
appropriate.

> +
> + __snp_free_firmware_pages(virt_to_page(addr), 0, false);
> +}
> +EXPORT_SYMBOL(snp_free_firmware_page);

Ditto, same comments as for allocation part.

> +
> static void *sev_fw_alloc(unsigned long len)
> {
> struct page *page;
>
> - page = alloc_pages(GFP_KERNEL, get_order(len));
> + page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(len), false);
> if (!page)
> return NULL;
>
> @@ -393,7 +551,7 @@ static int __sev_init_locked(int *error)
> data.tmr_address = __pa(sev_es_tmr);
>
> data.flags |= SEV_INIT_FLAGS_SEV_ES;
> - data.tmr_len = SEV_ES_TMR_SIZE;
> + data.tmr_len = sev_es_tmr_size;
> }
>
> return __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
> @@ -421,7 +579,7 @@ static int __sev_init_ex_locked(int *error)
> data.tmr_address = __pa(sev_es_tmr);
>
> data.flags |= SEV_INIT_FLAGS_SEV_ES;
> - data.tmr_len = SEV_ES_TMR_SIZE;
> + data.tmr_len = sev_es_tmr_size;
> }
>
> return __sev_do_cmd_locked(SEV_CMD_INIT_EX, &data, error);
> @@ -818,6 +976,8 @@ static int __sev_snp_init_locked(int *error)
> sev->snp_inited = true;
> dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
>
> + sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
> +
> return rc;
> }
>
> @@ -1341,8 +1501,9 @@ static void sev_firmware_shutdown(struct sev_device *sev)
> /* The TMR area was encrypted, flush it from the cache */
> wbinvd_on_all_cpus();
>
> - free_pages((unsigned long)sev_es_tmr,
> - get_order(SEV_ES_TMR_SIZE));
> + __snp_free_firmware_pages(virt_to_page(sev_es_tmr),
> + get_order(sev_es_tmr_size),
> + false);
> sev_es_tmr = NULL;
> }
>
> @@ -1430,7 +1591,7 @@ void sev_pci_init(void)
> }
>
> /* Obtain the TMR memory area for SEV-ES use */
> - sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
> + sev_es_tmr = sev_fw_alloc(sev_es_tmr_size);
> if (!sev_es_tmr)
> dev_warn(sev->dev,
> "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 9f921d221b75..a3bb792bb842 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -12,6 +12,8 @@
> #ifndef __PSP_SEV_H__
> #define __PSP_SEV_H__
>
> +#include <linux/sev.h>
> +
> #include <uapi/linux/psp-sev.h>
>
> #ifdef CONFIG_X86
> @@ -940,6 +942,8 @@ int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
> int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
>
> void *psp_copy_user_blob(u64 uaddr, u32 len);
> +void *snp_alloc_firmware_page(gfp_t mask);
> +void snp_free_firmware_page(void *addr);
>
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> @@ -981,6 +985,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
> return -ENODEV;
> }
>
> +static inline void *snp_alloc_firmware_page(gfp_t mask)
> +{
> + return NULL;
> +}
> +
> +static inline void snp_free_firmware_page(void *addr) { }
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
> --
> 2.25.1
>

BR, Jarkko










2022-08-02 12:19:02

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Tue, Jun 21, 2022 at 08:17:15PM +0000, Kalra, Ashish wrote:
> [Public]
>
> Hello Peter,
>
> >> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages,
> >> +bool locked) {
> >> + struct sev_data_snp_page_reclaim data;
> >> + int ret, err, i, n = 0;
> >> +
> >> + for (i = 0; i < npages; i++) {
>
> >What about setting |n| here too, also the other increments.
>
> >for (i = 0, n = 0; i < npages; i++, n++, pfn++)
>
> Yes that is simpler.
>
> >> + memset(&data, 0, sizeof(data));
> >> + data.paddr = pfn << PAGE_SHIFT;
> >> +
> >> + if (locked)
> >> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> >> + else
> >> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM,
> >> + &data, &err);
>
> > Can we change `sev_cmd_mutex` to some sort of nesting lock type? That could clean up this if (locked) code.
>
> > +static inline int rmp_make_firmware(unsigned long pfn, int level) {
> > + return rmp_make_private(pfn, 0, level, 0, true); }
> > +
> > +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
> > + bool need_reclaim)
>
> >This function can do a lot and when I read the call sites its hard to see what its doing since we have a combination of arguments which tell us what behavior is happening, some of which are not valid (ex: to_fw == true and need_reclaim == true is an >invalid argument combination).
>
> to_fw is used to make a firmware page and need_reclaim is for freeing the firmware page, so they are going to be mutually exclusive.
>
> I actually can connect with it quite logically with the callers :
> snp_alloc_firmware_pages will call with to_fw = true and need_reclaim = false
> and snp_free_firmware_pages will do the opposite, to_fw = false and need_reclaim = true.
>
> That seems straightforward to look at.
>
> >Also this for loop over |npages| is duplicated from snp_reclaim_pages(). One improvement here is that on the current
> >snp_reclaim_pages() if we fail to reclaim a page we assume we cannot reclaim the next pages, this may cause us to snp_leak_pages() more pages than we actually need too.
>
> Yes that is true.
>
> >What about something like this?
>
> >static snp_leak_page(u64 pfn, enum pg_level level) {
> > memory_failure(pfn, 0);
> > dump_rmpentry(pfn);
> >}
>
> >static int snp_reclaim_page(u64 pfn, enum pg_level level) {
> > int ret;
> > struct sev_data_snp_page_reclaim data;
>
> > ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> > if (ret)
> > goto cleanup;
>
> > ret = rmp_make_shared(pfn, level);
> > if (ret)
> > goto cleanup;
>
> > return 0;
>
> >cleanup:
> > snp_leak_page(pfn, level)
> >}
>
> >typedef int (*rmp_state_change_func) (u64 pfn, enum pg_level level);
>
> >static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, rmp_state_change_func state_change, rmp_state_change_func cleanup) {
> > struct sev_data_snp_page_reclaim data;
> > int ret, err, i, n = 0;
>
> > for (i = 0, n = 0; i < npages; i++, n++, pfn++) {
> > ret = state_change(pfn, PG_LEVEL_4K)
> > if (ret)
> > goto cleanup;
> > }
>
> > return 0;
>
> > cleanup:
> > for (; i>= 0; i--, n--, pfn--) {
> > cleanup(pfn, PG_LEVEL_4K);
> > }
>
> > return ret;
> >}
>
> >Then inside of __snp_alloc_firmware_pages():
>
> >snp_set_rmp_state(paddr, npages, rmp_make_firmware, snp_reclaim_page);
>
> >And inside of __snp_free_firmware_pages():
>
> >snp_set_rmp_state(paddr, npages, snp_reclaim_page, snp_leak_page);
>
> >Just a suggestion feel free to ignore. The readability comment could be addressed much less invasively by just making separate functions for each valid combination of arguments here. Like snp_set_rmp_fw_state(), snp_set_rmp_shared_state(),
> >snp_set_rmp_release_state() or something.
>
> >> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int
> >> +order, bool locked) {
> >> + unsigned long npages = 1ul << order, paddr;
> >> + struct sev_device *sev;
> >> + struct page *page;
> >> +
> >> + if (!psp_master || !psp_master->sev_data)
> >> + return NULL;
> >> +
> >> + page = alloc_pages(gfp_mask, order);
> >> + if (!page)
> >> + return NULL;
> >> +
> >> + /* If SEV-SNP is initialized then add the page in RMP table. */
> >> + sev = psp_master->sev_data;
> >> + if (!sev->snp_inited)
> >> + return page;
> >> +
> >> + paddr = __pa((unsigned long)page_address(page));
> >> + if (snp_set_rmp_state(paddr, npages, true, locked, false))
> >> + return NULL;
>
> >So what about the case where snp_set_rmp_state() fails but we were able to reclaim all the pages? Should we be able to signal that to callers so that we could free |page| here? But given this is an error path already maybe we can optimize this in a >follow up series.
>
> Yes, we should actually tie in to snp_reclaim_pages() success or failure here in the case we were able to successfully unroll some or all of the firmware state change.
>
> > +
> > + return page;
> > +}
> > +
> > +void *snp_alloc_firmware_page(gfp_t gfp_mask) {
> > + struct page *page;
> > +
> > + page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
> > +
> > + return page ? page_address(page) : NULL; }
> > +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
> > +
> > +static void __snp_free_firmware_pages(struct page *page, int order,
> > +bool locked) {
> > + unsigned long paddr, npages = 1ul << order;
> > +
> > + if (!page)
> > + return;
> > +
> > + paddr = __pa((unsigned long)page_address(page));
> > + if (snp_set_rmp_state(paddr, npages, false, locked, true))
> > + return;
>
> > Here we may be able to free some of |page| depending how where inside of snp_set_rmp_state() we failed. But again given this is an error path already maybe we can optimize this in a follow up series.
>
> Yes, we probably should be able to free some of the page(s) depending on how many page(s) got reclaimed in snp_set_rmp_state().
> But these reclamation failures may not be very common, so any failure is indicative of a bigger issue, it might be the case when there is a single page reclamation error it might happen with all the subsequent
> pages and so follow a simple recovery procedure, then handling a more complex recovery for a chunk of pages being reclaimed and another chunk not.

Silent ignore is stil a bad idea. I.e. at minimum would
make sense to print a warning to klog.

>
> Thanks,
> Ashish

BR, Jarkko

2022-08-02 12:45:31

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 18/49] crypto: ccp: Provide APIs to query extended attestation report

I'd rephrase "Provide in-kernel API..." (e.g. not uapi).

On Mon, Jun 20, 2022 at 11:06:06PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> Version 2 of the GHCB specification defines VMGEXIT that is used to get
> the extended attestation report. The extended attestation report includes
> the certificate blobs provided through the SNP_SET_EXT_CONFIG.
>
> The snp_guest_ext_guest_request() will be used by the hypervisor to get
> the extended attestation report. See the GHCB specification for more
> details.

What is "the hypersivor"? Could it be replaced with e.g. KVM for
clarity?

>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 43 ++++++++++++++++++++++++++++++++++++
> include/linux/psp-sev.h | 24 ++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 97b479d5aa86..f6306b820b86 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -25,6 +25,7 @@
> #include <linux/fs.h>
>
> #include <asm/smp.h>
> +#include <asm/sev.h>
>
> #include "psp-dev.h"
> #include "sev-dev.h"
> @@ -1857,6 +1858,48 @@ int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> }
> EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
>
> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
> +{
> + unsigned long expected_npages;
> + struct sev_device *sev;
> + int rc;
> +
> + if (!psp_master || !psp_master->sev_data)
> + return -ENODEV;
> +
> + sev = psp_master->sev_data;
> +
> + if (!sev->snp_inited)
> + return -EINVAL;
> +
> + /*
> + * Check if there is enough space to copy the certificate chain. Otherwise
> + * return ERROR code defined in the GHCB specification.
> + */
> + expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
> + if (*npages < expected_npages) {
> + *npages = expected_npages;
> + *fw_err = SNP_GUEST_REQ_INVALID_LEN;
> + return -EINVAL;
> + }
> +
> + rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)&fw_err);
> + if (rc)
> + return rc;
> +
> + /* Copy the certificate blob */
> + if (sev->snp_certs_data) {
> + *npages = expected_npages;
> + memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
> + } else {
> + *npages = 0;
> + }
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);

Undocumented export.

> +
> static void sev_exit(struct kref *ref)
> {
> misc_deregister(&misc_dev->misc);
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index a3bb792bb842..cd37ccd1fa1f 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -945,6 +945,23 @@ void *psp_copy_user_blob(u64 uaddr, u32 len);
> void *snp_alloc_firmware_page(gfp_t mask);
> void snp_free_firmware_page(void *addr);
>
> +/**
> + * snp_guest_ext_guest_request - perform the SNP extended guest request command
> + * defined in the GHCB specification.
> + *
> + * @data: the input guest request structure
> + * @vaddr: address where the certificate blob need to be copied.
> + * @npages: number of pages for the certificate blob.
> + * If the specified page count is less than the certificate blob size, then the
> + * required page count is returned with error code defined in the GHCB spec.
> + * If the specified page count is more than the certificate blob size, then
> + * page count is updated to reflect the amount of valid data copied in the
> + * vaddr.
> + */

This kdoc is misplaced: it should be in sev-dev.c, right before the
implementation. Also it does not say anything about return value, and
still the return type is "int".

> +int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *npages,
> + unsigned long *error);
> +
> #else /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> static inline int
> @@ -992,6 +1009,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
>
> static inline void snp_free_firmware_page(void *addr) { }
>
> +static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
> + unsigned long vaddr, unsigned long *n,
> + unsigned long *error)
> +{
> + return -ENODEV;
> +}
> +
> #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
> #endif /* __PSP_SEV_H__ */
> --
> 2.25.1
>

BR, Jarkko

2022-08-02 12:51:20

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 26/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command

On Mon, Jun 20, 2022 at 11:08:05PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
> guest's memory. The data is encrypted with the cryptographic context
> created with the KVM_SEV_SNP_LAUNCH_START.
>
> In addition to the inserting data, it can insert a two special pages
> into the guests memory: the secrets page and the CPUID page.
>
> While terminating the guest, reclaim the guest pages added in the RMP
> table. If the reclaim fails, then the page is no longer safe to be
> released back to the system and leak them.

From this paragraph I get a picture that reclaimer is failing "all the
time", and that is totally normal and legit behaviour. Is this the case?

Stimuli/conditions/something is mandatory if failure is mentioned in any
context.

>
> For more information see the SEV-SNP specification.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 29 +++
> arch/x86/kvm/svm/sev.c | 187 ++++++++++++++++++
> include/uapi/linux/kvm.h | 19 ++
> 3 files changed, 235 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 878711f2dca6..62abd5c1f72b 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -486,6 +486,35 @@ Returns: 0 on success, -negative on error
>
> See the SEV-SNP specification for further detail on the launch input.
>
> +20. KVM_SNP_LAUNCH_UPDATE
> +-------------------------
> +
> +The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
> +calculates a measurement of the memory contents. The measurement is a signature
> +of the memory contents that can be sent to the guest owner as an attestation
> +that the memory was encrypted correctly by the firmware.
> +
> +Parameters (in): struct kvm_snp_launch_update
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_update {
> + __u64 start_gfn; /* Guest page number to start from. */
> + __u64 uaddr; /* userspace address need to be encrypted */
> + __u32 len; /* length of memory region */
> + __u8 imi_page; /* 1 if memory is part of the IMI */
> + __u8 page_type; /* page type */
> + __u8 vmpl3_perms; /* VMPL3 permission mask */
> + __u8 vmpl2_perms; /* VMPL2 permission mask */
> + __u8 vmpl1_perms; /* VMPL1 permission mask */
> + };
> +
> +See the SEV-SNP spec for further details on how to build the VMPL permission
> +mask and page type.
> +
> +
> References
> ==========
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 41b83aa6b5f4..b5f0707d7ed6 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -18,6 +18,7 @@
> #include <linux/processor.h>
> #include <linux/trace_events.h>
> #include <linux/hugetlb.h>
> +#include <linux/sev.h>
>
> #include <asm/pkru.h>
> #include <asm/trapnr.h>
> @@ -233,6 +234,49 @@ static void sev_decommission(unsigned int handle)
> sev_guest_decommission(&decommission, NULL);
> }
>
> +static inline void snp_leak_pages(u64 pfn, enum pg_level level)
> +{
> + unsigned int npages = page_level_size(level) >> PAGE_SHIFT;
> +
> + WARN(1, "psc failed pfn 0x%llx pages %d (leaking)\n", pfn, npages);
> +
> + while (npages) {
> + memory_failure(pfn, 0);
> + dump_rmpentry(pfn);
> + npages--;
> + pfn++;
> + }
> +}
> +
> +static int snp_page_reclaim(u64 pfn)
> +{
> + struct sev_data_snp_page_reclaim data = {0};
> + int err, rc;
> +
> + data.paddr = __sme_set(pfn << PAGE_SHIFT);
> + rc = snp_guest_page_reclaim(&data, &err);
> + if (rc) {
> + /*
> + * If the reclaim failed, then page is no longer safe
> + * to use.
> + */
> + snp_leak_pages(pfn, PG_LEVEL_4K);
> + }
> +
> + return rc;
> +}
> +
> +static int host_rmp_make_shared(u64 pfn, enum pg_level level, bool leak)
> +{
> + int rc;
> +
> + rc = rmp_make_shared(pfn, level);
> + if (rc && leak)
> + snp_leak_pages(pfn, level);
> +
> + return rc;
> +}
> +
> static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
> {
> struct sev_data_deactivate deactivate;
> @@ -1902,6 +1946,123 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return rc;
> }
>
> +static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct list_head *head = &sev->regions_list;
> + struct enc_region *i;
> +
> + lockdep_assert_held(&kvm->lock);
> +
> + list_for_each_entry(i, head, list) {
> + u64 start = i->uaddr;
> + u64 end = start + i->size;
> +
> + if (start <= hva && end >= (hva + len))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {0};
> + struct kvm_sev_snp_launch_update params;
> + unsigned long npages, pfn, n = 0;
> + int *error = &argp->error;
> + struct page **inpages;
> + int ret, i, level;
> + u64 gfn;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + /* Verify that the specified address range is registered. */
> + if (!is_hva_registered(kvm, params.uaddr, params.len))
> + return -EINVAL;
> +
> + /*
> + * The userspace memory is already locked so technically we don't
> + * need to lock it again. Later part of the function needs to know
> + * pfn so call the sev_pin_memory() so that we can get the list of
> + * pages to iterate through.
> + */
> + inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
> + if (!inpages)
> + return -ENOMEM;
> +
> + /*
> + * Verify that all the pages are marked shared in the RMP table before
> + * going further. This is avoid the cases where the userspace may try
> + * updating the same page twice.
> + */
> + for (i = 0; i < npages; i++) {
> + if (snp_lookup_rmpentry(page_to_pfn(inpages[i]), &level) != 0) {
> + sev_unpin_memory(kvm, inpages, npages);
> + return -EFAULT;
> + }
> + }
> +
> + gfn = params.start_gfn;
> + level = PG_LEVEL_4K;
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> +
> + for (i = 0; i < npages; i++) {
> + pfn = page_to_pfn(inpages[i]);
> +
> + ret = rmp_make_private(pfn, gfn << PAGE_SHIFT, level, sev_get_asid(kvm), true);
> + if (ret) {
> + ret = -EFAULT;
> + goto e_unpin;
> + }
> +
> + n++;
> + data.address = __sme_page_pa(inpages[i]);
> + data.page_size = X86_TO_RMP_PG_LEVEL(level);
> + data.page_type = params.page_type;
> + data.vmpl3_perms = params.vmpl3_perms;
> + data.vmpl2_perms = params.vmpl2_perms;
> + data.vmpl1_perms = params.vmpl1_perms;
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
> + if (ret) {
> + /*
> + * If the command failed then need to reclaim the page.
> + */
> + snp_page_reclaim(pfn);
> + goto e_unpin;
> + }
> +
> + gfn++;
> + }
> +
> +e_unpin:
> + /* Content of memory is updated, mark pages dirty */
> + for (i = 0; i < n; i++) {
> + set_page_dirty_lock(inpages[i]);
> + mark_page_accessed(inpages[i]);
> +
> + /*
> + * If its an error, then update RMP entry to change page ownership
> + * to the hypervisor.
> + */
> + if (ret)
> + host_rmp_make_shared(pfn, level, true);
> + }
> +
> + /* Unlock the user pages */
> + sev_unpin_memory(kvm, inpages, npages);
> +
> + return ret;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -1995,6 +2156,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_START:
> r = snp_launch_start(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_UPDATE:
> + r = snp_launch_update(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2113,6 +2277,29 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
> static void __unregister_enc_region_locked(struct kvm *kvm,
> struct enc_region *region)
> {
> + unsigned long i, pfn;
> + int level;
> +
> + /*
> + * The guest memory pages are assigned in the RMP table. Unassign it
> + * before releasing the memory.
> + */
> + if (sev_snp_guest(kvm)) {
> + for (i = 0; i < region->npages; i++) {
> + pfn = page_to_pfn(region->pages[i]);
> +
> + if (!snp_lookup_rmpentry(pfn, &level))
> + continue;
> +
> + cond_resched();
> +
> + if (level > PG_LEVEL_4K)
> + pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> +
> + host_rmp_make_shared(pfn, level, true);
> + }
> + }
> +
> sev_unpin_memory(kvm, region->pages, region->npages);
> list_del(&region->list);
> kfree(region);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0cb119d66ae5..9b36b07414ea 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1813,6 +1813,7 @@ enum sev_cmd_id {
> /* SNP specific commands */
> KVM_SEV_SNP_INIT,
> KVM_SEV_SNP_LAUNCH_START,
> + KVM_SEV_SNP_LAUNCH_UPDATE,
>
> KVM_SEV_NR_MAX,
> };
> @@ -1929,6 +1930,24 @@ struct kvm_sev_snp_launch_start {
> __u8 pad[6];
> };
>
> +#define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1
> +#define KVM_SEV_SNP_PAGE_TYPE_VMSA 0x2
> +#define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3
> +#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4
> +#define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5
> +#define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6
> +
> +struct kvm_sev_snp_launch_update {
> + __u64 start_gfn;
> + __u64 uaddr;
> + __u32 len;
> + __u8 imi_page;
> + __u8 page_type;
> + __u8 vmpl3_perms;
> + __u8 vmpl2_perms;
> + __u8 vmpl1_perms;
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
> --
> 2.25.1
>

BR, Jarkko

2022-08-02 13:23:13

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 24/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command

On Mon, Jun 20, 2022 at 11:07:35PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
> The command initializes a cryptographic digest context used to construct
> the measurement of the guest. If the guest is expected to be migrated,
> the command also binds a migration agent (MA) to the guest.
>
> For more information see the SEV-SNP specification.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 24 ++++
> arch/x86/kvm/svm/sev.c | 115 +++++++++++++++++-
> arch/x86/kvm/svm/svm.h | 1 +
> include/uapi/linux/kvm.h | 10 ++
> 4 files changed, 147 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 903023f524af..878711f2dca6 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -462,6 +462,30 @@ The flags bitmap is defined as::
> If the specified flags is not supported then return -EOPNOTSUPP, and the supported
> flags are returned.
>
> +19. KVM_SNP_LAUNCH_START
> +------------------------
> +
> +The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
> +context for the SEV-SNP guest. To create the encryption context, user must
> +provide a guest policy, migration agent (if any) and guest OS visible
> +workarounds value as defined SEV-SNP specification.
> +
> +Parameters (in): struct kvm_snp_launch_start
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_start {
> + __u64 policy; /* Guest policy to use. */
> + __u64 ma_uaddr; /* userspace address of migration agent */
> + __u8 ma_en; /* 1 if the migtation agent is enabled */
> + __u8 imi_en; /* set IMI to 1. */
> + __u8 gosvw[16]; /* guest OS visible workarounds */
> + };
> +
> +See the SEV-SNP specification for further detail on the launch input.
> +
> References
> ==========
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 813bda7f7b55..9e6fc7a94ed7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -21,6 +21,7 @@
> #include <asm/pkru.h>
> #include <asm/trapnr.h>
> #include <asm/fpu/xcr.h>
> +#include <asm/sev.h>
>
> #include "x86.h"
> #include "svm.h"
> @@ -73,6 +74,8 @@ static unsigned int nr_asids;
> static unsigned long *sev_asid_bitmap;
> static unsigned long *sev_reclaim_asid_bitmap;
>
> +static int snp_decommission_context(struct kvm *kvm);
> +
> struct enc_region {
> struct list_head list;
> unsigned long npages;
> @@ -98,12 +101,17 @@ static int sev_flush_asids(int min_asid, int max_asid)
> down_write(&sev_deactivate_lock);
>
> wbinvd_on_all_cpus();
> - ret = sev_guest_df_flush(&error);
> +
> + if (sev_snp_enabled)
> + ret = snp_guest_df_flush(&error);
> + else
> + ret = sev_guest_df_flush(&error);
>
> up_write(&sev_deactivate_lock);
>
> if (ret)
> - pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
> + pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
> + sev_snp_enabled ? "-SNP" : "", ret, error);
>
> return ret;
> }
> @@ -1825,6 +1833,74 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> return ret;
> }
>
> +static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct sev_data_snp_gctx_create data = {};
> + void *context;
> + int rc;
> +
> + /* Allocate memory for context page */

Nit: this comment has very little value, if any. It's just stating
the obvious.

Instead, I'd add a description for the function:

/*
* Allocate and initialize a digest for the guest measurement.
*/
static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)

This would be much more helpful to get a grasp on "what I'm looking at".

> + context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> + if (!context)
> + return NULL;
> +
> + data.gctx_paddr = __psp_pa(context);
> + rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
> + if (rc) {
> + snp_free_firmware_page(context);
> + return NULL;
> + }
> +
> + return context;
> +}
> +
> +static int snp_bind_asid(struct kvm *kvm, int *error)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_activate data = {0};
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> + data.asid = sev_get_asid(kvm);
> + return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
> +}
> +
> +static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_start start = {0};
> + struct kvm_sev_snp_launch_start params;
> + int rc;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + sev->snp_context = snp_context_create(kvm, argp);
> + if (!sev->snp_context)
> + return -ENOTTY;
> +
> + start.gctx_paddr = __psp_pa(sev->snp_context);
> + start.policy = params.policy;
> + memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
> + rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
> + if (rc)
> + goto e_free_context;
> +
> + sev->fd = argp->sev_fd;
> + rc = snp_bind_asid(kvm, &argp->error);
> + if (rc)
> + goto e_free_context;
> +
> + return 0;
> +
> +e_free_context:
> + snp_decommission_context(kvm);
> +
> + return rc;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -1915,6 +1991,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_RECEIVE_FINISH:
> r = sev_receive_finish(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_START:
> + r = snp_launch_start(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2106,6 +2185,28 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> return ret;
> }
>
> +static int snp_decommission_context(struct kvm *kvm)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_decommission data = {};
> + int ret;
> +
> + /* If context is not created then do nothing */
> + if (!sev->snp_context)
> + return 0;
> +
> + data.gctx_paddr = __sme_pa(sev->snp_context);
> + ret = snp_guest_decommission(&data, NULL);
> + if (WARN_ONCE(ret, "failed to release guest context"))
> + return ret;
> +
> + /* free the context page now */
> + snp_free_firmware_page(sev->snp_context);
> + sev->snp_context = NULL;
> +
> + return 0;
> +}
> +
> void sev_vm_destroy(struct kvm *kvm)
> {
> struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -2147,7 +2248,15 @@ void sev_vm_destroy(struct kvm *kvm)
> }
> }
>
> - sev_unbind_asid(kvm, sev->handle);
> + if (sev_snp_guest(kvm)) {
> + if (snp_decommission_context(kvm)) {
> + WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
> + return;
> + }
> + } else {
> + sev_unbind_asid(kvm, sev->handle);
> + }
> +
> sev_asid_free(sev);
> }
>
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 2f45589ee596..71c011af098e 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -91,6 +91,7 @@ struct kvm_sev_info {
> struct misc_cg *misc_cg; /* For misc cgroup accounting */
> atomic_t migration_in_progress;
> u64 snp_init_flags;
> + void *snp_context; /* SNP guest context page */
> };
>
> struct kvm_svm {
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0f912cefc544..0cb119d66ae5 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1812,6 +1812,7 @@ enum sev_cmd_id {
>
> /* SNP specific commands */
> KVM_SEV_SNP_INIT,
> + KVM_SEV_SNP_LAUNCH_START,
>
> KVM_SEV_NR_MAX,
> };
> @@ -1919,6 +1920,15 @@ struct kvm_snp_init {
> __u64 flags;
> };
>
> +struct kvm_sev_snp_launch_start {
> + __u64 policy;
> + __u64 ma_uaddr;
> + __u8 ma_en;
> + __u8 imi_en;
> + __u8 gosvw[16];
> + __u8 pad[6];
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
> --
> 2.25.1
>

BR, Jarkko

2022-08-02 13:31:42

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

On Mon, Jun 20, 2022 at 11:08:38PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
> it as the measurement of the guest at launch.
>
> While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
> to encrypt the VMSA pages.

Nit: for completeness sake it would nice to fully conclude whether
LAUNCH_UPDATE is usable after LAUNCH_FINISH in this paragraph.

>
> If its an SNP guest, then VMSA was added in the RMP entry as
> a guest owned page and also removed from the kernel direct map
> so flush it later after it is transitioned back to hypervisor
> state and restored in the direct map.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 22 ++++
> arch/x86/kvm/svm/sev.c | 119 ++++++++++++++++++
> include/uapi/linux/kvm.h | 14 +++
> 3 files changed, 155 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 62abd5c1f72b..750162cff87b 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -514,6 +514,28 @@ Returns: 0 on success, -negative on error
> See the SEV-SNP spec for further details on how to build the VMPL permission
> mask and page type.
>
> +21. KVM_SNP_LAUNCH_FINISH
> +-------------------------
> +
> +After completion of the SNP guest launch flow, the KVM_SNP_LAUNCH_FINISH command can be
> +issued to make the guest ready for the execution.

Some remark about LAUNCH_UPDATE post-LAUNCH_FINISH would be nice.

> +
> +Parameters (in): struct kvm_sev_snp_launch_finish
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_finish {
> + __u64 id_block_uaddr;
> + __u64 id_auth_uaddr;
> + __u8 id_block_en;
> + __u8 auth_key_en;
> + __u8 host_data[32];
> + };
> +
> +
> +See SEV-SNP specification for further details on launch finish input parameters.
>
> References
> ==========
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a9461d352eda..a5b90469683f 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2095,6 +2095,106 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return ret;
> }
>
> +static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {};
> + int i, ret;
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> + data.page_type = SNP_PAGE_TYPE_VMSA;
> +
> + for (i = 0; i < kvm->created_vcpus; i++) {
> + struct vcpu_svm *svm = to_svm(xa_load(&kvm->vcpu_array, i));
> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + /* Perform some pre-encryption checks against the VMSA */
> + ret = sev_es_sync_vmsa(svm);
> + if (ret)
> + return ret;
> +
> + /* Transition the VMSA page to a firmware state. */
> + ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);
> + if (ret)
> + return ret;
> +
> + /* Issue the SNP command to encrypt the VMSA */
> + data.address = __sme_pa(svm->sev_es.vmsa);
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> + &data, &argp->error);
> + if (ret) {
> + snp_page_reclaim(pfn);
> + return ret;
> + }
> +
> + svm->vcpu.arch.guest_state_protected = true;
> + }
> +
> + return 0;
> +}
> +
> +static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_finish *data;
> + void *id_block = NULL, *id_auth = NULL;
> + struct kvm_sev_snp_launch_finish params;

Nit: "params" should be the 2nd declaration (reverse
christmas tree order).

> + int ret;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + /* Measure all vCPUs using LAUNCH_UPDATE before we finalize the launch flow. */
> + ret = snp_launch_update_vmsa(kvm, argp);
> + if (ret)
> + return ret;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
> + if (!data)
> + return -ENOMEM;
> +
> + if (params.id_block_en) {
> + id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
> + if (IS_ERR(id_block)) {
> + ret = PTR_ERR(id_block);
> + goto e_free;
> + }
> +
> + data->id_block_en = 1;
> + data->id_block_paddr = __sme_pa(id_block);
> + }
> +
> + if (params.auth_key_en) {
> + id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
> + if (IS_ERR(id_auth)) {
> + ret = PTR_ERR(id_auth);
> + goto e_free_id_block;
> + }
> +
> + data->auth_key_en = 1;
> + data->id_auth_paddr = __sme_pa(id_auth);
> + }
> +
> + data->gctx_paddr = __psp_pa(sev->snp_context);
> + ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
> +
> + kfree(id_auth);
> +
> +e_free_id_block:
> + kfree(id_block);
> +
> +e_free:
> + kfree(data);
> +
> + return ret;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2191,6 +2291,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_UPDATE:
> r = snp_launch_update(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_FINISH:
> + r = snp_launch_finish(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2696,11 +2799,27 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>
> svm = to_svm(vcpu);
>
> + /*
> + * If its an SNP guest, then VMSA was added in the RMP entry as
> + * a guest owned page. Transition the page to hypervisor state
> + * before releasing it back to the system.
> + * Also the page is removed from the kernel direct map, so flush it
> + * later after it is transitioned back to hypervisor state and
> + * restored in the direct map.
> + */
> + if (sev_snp_guest(vcpu->kvm)) {
> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + if (host_rmp_make_shared(pfn, PG_LEVEL_4K, false))
> + goto skip_vmsa_free;
> + }
> +
> if (vcpu->arch.guest_state_protected)
> sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);
>
> __free_page(virt_to_page(svm->sev_es.vmsa));
>
> +skip_vmsa_free:
> if (svm->sev_es.ghcb_sa_free)
> kvfree(svm->sev_es.ghcb_sa);
> }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9b36b07414ea..5a4662716b6a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1814,6 +1814,7 @@ enum sev_cmd_id {
> KVM_SEV_SNP_INIT,
> KVM_SEV_SNP_LAUNCH_START,
> KVM_SEV_SNP_LAUNCH_UPDATE,
> + KVM_SEV_SNP_LAUNCH_FINISH,
>
> KVM_SEV_NR_MAX,
> };
> @@ -1948,6 +1949,19 @@ struct kvm_sev_snp_launch_update {
> __u8 vmpl1_perms;
> };
>
> +#define KVM_SEV_SNP_ID_BLOCK_SIZE 96
> +#define KVM_SEV_SNP_ID_AUTH_SIZE 4096
> +#define KVM_SEV_SNP_FINISH_DATA_SIZE 32
> +
> +struct kvm_sev_snp_launch_finish {
> + __u64 id_block_uaddr;
> + __u64 id_auth_uaddr;
> + __u8 id_block_en;
> + __u8 auth_key_en;
> + __u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
> + __u8 pad[6];
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
> --
> 2.25.1
>

BR, Jarkko

2022-08-02 14:20:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

On Mon, Aug 01, 2022 at 11:32:21PM +0000, Kalra, Ashish wrote:
> But we can't use this struct on a core/platform which has a different
> layout, so aren't the model checks required ?

That would be a problem only if the already specified fields move or get
resized.

If their offset and size don't change, you're good.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-03 20:28:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 06/49] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction

On Mon, Aug 01, 2022 at 10:31:26PM +0000, Kalra, Ashish wrote:
> The struct rmpentry is the raw layout of the RMP table entry
> while struct rmpupdate is the structure expected by the rmpupdate
> instruction for programming the RMP table entries.
>
> Arguably, we can program a struct rmpupdate internally from a struct
> rmpentry.
>
> But we will still need struct rmpupdate for issuing the rmpupdate
> instruction, so it is probably cleaner to keep it this way, as it only
> has two main callers - rmp_make_private() and rmp_make_shared().

Ok, but then call it struct rmp_state. The APM says in the RMPUPDATE
blurb:

"The RCX register provides the effective address of a 16-byte data
structure which contains the new RMP state."

so the function signature should be:

static int rmpupdate(u64 pfn, struct rmp_state *new)

and this is basically the description of that. It can't get any more
user-friendly than this.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-04 10:59:50

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 27/49] KVM: SVM: Mark the private vma unmerable for SEV-SNP guests

On 6/21/22 01:08, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled, the guest private pages are added in the RMP
> table; while adding the pages, the rmp_make_private() unmaps the pages
> from the direct map. If KSM attempts to access those unmapped pages then
> it will trigger #PF (page-not-present).
>
> Encrypted guest pages cannot be shared between the process, so an
> userspace should not mark the region mergeable but to be safe, mark the
> process vma unmerable before adding the pages in the RMP table.
>
> Signed-off-by: Brijesh Singh <[email protected]>

Note this doesn't really mark the vma unmergeable, rather it unmarks it as
mergeable, and unmerges any already merged pages.
Which seems like a good idea. Is snp_launch_update() the only place that
needs it or can private pages be added elsewhere too?

However, AFAICS nothing stops userspace to do another
madvise(MADV_MERGEABLE) afterwards, so we should make somehow sure that ksm
will still be prevented, as we should protect the kernel even from a buggy
userspace. So either we stop it with a flag at vma level (see ksm_madvise()
for which flags currently stop it), or page level - currently only
PageAnon() pages are handled. The vma level is probably easier/cheaper.

It's also possible that this will solve itself with the switch to UPM as
those vma's or pages might be incompatible with ksm naturally (didn't check
closely), and then this patch can be just dropped. But we should double-check.


2022-08-04 12:13:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Mon, Aug 01, 2022 at 11:57:09PM +0000, Kalra, Ashish wrote:
> You mean set_memory_present() ?

Right, that.

We have set_memory_np() but set_memory_present(). Talk about
consistence... ;-\

> But again, calling set_direct_map_invalid_noflush() is easier to
> understand from the calling function's point of view as it correlates
> to the functionality of invalidating the page from kernel direct map ?

You mean, we prefer easy to understand to performance?

set_direct_map_invalid_noflush() means crap to me. I have to go look it
up - set memory P or NP is much clearer.

The patch which added those things you consider easier to understand is:

commit d253ca0c3865a8d9a8c01143cf20425e0be4d0ce
Author: Rick Edgecombe <[email protected]>
Date: Thu Apr 25 17:11:34 2019 -0700

x86/mm/cpa: Add set_direct_map_*() functions

Add two new functions set_direct_map_default_noflush() and
set_direct_map_invalid_noflush() for setting the direct map alias for the
page to its default valid permissions and to an invalid state that cannot
be cached in a TLB, respectively. These functions do not flush the TLB.

I don't see how this fits with your use case...

Also, your helpers are called restore_direct_map and
invalidate_direct_map. That's already explaining what this is supposed
to do.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-08 13:20:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 08/49] x86/traps: Define RMP violation #PF error code

On Mon, Jun 20, 2022 at 11:03:27PM +0000, Ashish Kalra wrote:
> @@ -12,15 +14,17 @@
> * bit 4 == 1: fault was an instruction fetch
> * bit 5 == 1: protection keys block access
> * bit 15 == 1: SGX MMU page-fault
> + * bit 31 == 1: fault was due to RMP violation
> */
> enum x86_pf_error_code {
> - X86_PF_PROT = 1 << 0,
> - X86_PF_WRITE = 1 << 1,
> - X86_PF_USER = 1 << 2,
> - X86_PF_RSVD = 1 << 3,
> - X86_PF_INSTR = 1 << 4,
> - X86_PF_PK = 1 << 5,
> - X86_PF_SGX = 1 << 15,
> + X86_PF_PROT = BIT_ULL(0),
> + X86_PF_WRITE = BIT_ULL(1),
> + X86_PF_USER = BIT_ULL(2),
> + X86_PF_RSVD = BIT_ULL(3),
> + X86_PF_INSTR = BIT_ULL(4),
> + X86_PF_PK = BIT_ULL(5),
> + X86_PF_SGX = BIT_ULL(15),
> + X86_PF_RMP = BIT_ULL(31),

Yeah, I remember dhansen asked for those to use the BIT() macro but the
_ULL is an overkill. Those PF flags are 32 and they fit in an unsigned
int.

But we don't have BUT_UI() so I guess the next best thing - BIT() -
which uses UL internally, should be good enough.

So pls use BIT() here - not BIT_ULL().

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-08 19:39:25

by Dionna Amalie Glaze

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 17/49] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command

To preface, I don't want to delay this patch set, only have the
conversation at the most appropriate place.

>
> > The SEV-SNP firmware provides the SNP_CONFIG command used to set the
> > system-wide configuration value for SNP guests. The information includes
> > the TCB version string to be reported in guest attestation reports.
>

The system-wide aspect of this makes me wonder if we can also have a
VM instance-specific extension. This is important for the use case
that we may see secure boot variables included in the launch
measurement, making offline signing of the UEFI image impossible. We
can't sign the cross-product of all UEFI builds and every user's EFI
variables. We'd like to include an instance-specific certificate that
specifies the platform-endorsed golden measurement of the UEFI.

An alternative that doesn't require a change to the kernel is to just
make this certificate fetchable from a FAMILY_ID-keyed, predetermined
URL prefix + IMAGE_ID + '.crt', but this requires a download (and
continuous hosting) to do something as routine as collecting an
attestation report. It's up to the upstream community to determine if
that is an acceptable cost to keep the complexity of a certificate
table merge operation out of the kernel.

The SNP API specification gives an interpretation to the data blob
here as a table of GUID/offset pairs followed by data blobs that
presumably are at the appropriate offsets into the data pages. The
spec allows for the host to add any number of GUID/offset pairs it
wants, with 3 specific GUIDs recommended for the AMD PSP certificate
chain.

The snp_guest_ext_guest_request function in ccp is what passes back
the certificate data that was previously stored, so I'm wondering if
it can take an extra (pointer,len) pair of VM instance certificate
data to merge with the host certificate data before returning to the
guest. The new required length is the sum total of both the header
certs and instance certs. The operation to copy the data is no longer
a memcpy but a header merge that tracks the offset shifts caused by a
larger header and other certificates in the remaining data pages.

I can propose my own patch on top of this v6 patch set that adds a KVM
ioctl like KVM_{GET,SET}_INSTANCE_SNP_EXT_CONFIG and then pass along
the stored certificate blob in the request call. I'd prefer to have
the design agreed upon upfront though.

--
-Dionna Glaze, PhD (she/her)

2022-08-08 21:34:39

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 17/49] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command

On 8/8/22 14:27, Dionna Amalie Glaze wrote:
> To preface, I don't want to delay this patch set, only have the
> conversation at the most appropriate place.
>
>>
>>> The SEV-SNP firmware provides the SNP_CONFIG command used to set the
>>> system-wide configuration value for SNP guests. The information includes
>>> the TCB version string to be reported in guest attestation reports.
>>
>
> The system-wide aspect of this makes me wonder if we can also have a
> VM instance-specific extension. This is important for the use case
> that we may see secure boot variables included in the launch
> measurement, making offline signing of the UEFI image impossible. We
> can't sign the cross-product of all UEFI builds and every user's EFI
> variables. We'd like to include an instance-specific certificate that
> specifies the platform-endorsed golden measurement of the UEFI.
>
> An alternative that doesn't require a change to the kernel is to just
> make this certificate fetchable from a FAMILY_ID-keyed, predetermined
> URL prefix + IMAGE_ID + '.crt', but this requires a download (and
> continuous hosting) to do something as routine as collecting an
> attestation report. It's up to the upstream community to determine if
> that is an acceptable cost to keep the complexity of a certificate
> table merge operation out of the kernel.
>
> The SNP API specification gives an interpretation to the data blob

That's the GHCB specification, not the SNP API.

> here as a table of GUID/offset pairs followed by data blobs that
> presumably are at the appropriate offsets into the data pages. The
> spec allows for the host to add any number of GUID/offset pairs it
> wants, with 3 specific GUIDs recommended for the AMD PSP certificate
> chain.
>
> The snp_guest_ext_guest_request function in ccp is what passes back
> the certificate data that was previously stored, so I'm wondering if
> it can take an extra (pointer,len) pair of VM instance certificate
> data to merge with the host certificate data before returning to the
> guest. The new required length is the sum total of both the header
> certs and instance certs. The operation to copy the data is no longer
> a memcpy but a header merge that tracks the offset shifts caused by a
> larger header and other certificates in the remaining data pages.
>
> I can propose my own patch on top of this v6 patch set that adds a KVM
> ioctl like KVM_{GET,SET}_INSTANCE_SNP_EXT_CONFIG and then pass along

Would it be burden to supply all the certificates, both system and per-VM,
in this KVM call? On the SNP Extended Guest Request, the hypervisor could
just check if there is a per-VM blob and return that or else return the
system-wide blob (if present).

Thanks,
Tom


> the stored certificate blob in the request call. I'd prefer to have
> the design agreed upon upfront though.
>

2022-08-08 23:27:59

by Dionna Amalie Glaze

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 17/49] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command

> Would it be burden to supply all the certificates, both system and per-VM,
> in this KVM call? On the SNP Extended Guest Request, the hypervisor could
> just check if there is a per-VM blob and return that or else return the
> system-wide blob (if present).
>

I think that's fine by me. We can use SNP_GET_EXT_CONFIG, merge in
user space, and create an instance override with a KVM ioctl without
touching ccp.

--
-Dionna Glaze, PhD (she/her)

2022-08-09 14:01:47

by Sabin Rapan

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 26/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command

> +static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct list_head *head = &sev->regions_list;
> + struct enc_region *i;
> +
> + lockdep_assert_held(&kvm->lock);
> +
> + list_for_each_entry(i, head, list) {
> + u64 start = i->uaddr;
> + u64 end = start + i->size;
> +
> + if (start <= hva && end >= (hva + len))
> + return true;
> + }
> +
> + return false;
> +}

Since KVM_MEMORY_ENCRYPT_REG_REGION should be called for every memory region the user gives to kvm,
is the regions_list any different from kvm's memslots?

> +
> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {0};
> + struct kvm_sev_snp_launch_update params;
> + unsigned long npages, pfn, n = 0;
> + int *error = &argp->error;
> + struct page **inpages;
> + int ret, i, level;
> + u64 gfn;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + /* Verify that the specified address range is registered. */
> + if (!is_hva_registered(kvm, params.uaddr, params.len))
> + return -EINVAL;
> +
> + /*
> + * The userspace memory is already locked so technically we don't
> + * need to lock it again. Later part of the function needs to know
> + * pfn so call the sev_pin_memory() so that we can get the list of
> + * pages to iterate through.
> + */
> + inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
> + if (!inpages)
> + return -ENOMEM;

sev_pin_memory will call pin_user_pages() which fails for PFNMAP vmas that you
would get if you use memory allocated from an IO driver.
Using gfn_to_pfn instead will make this work with vmas backed by pages or raw
pfn mappings.




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

2022-08-09 17:02:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled globally, a write from the host goes through the

globally?

Can SNP be even enabled any other way?

I see the APM talks about it being enabled globally, I guess this means
the RMP represents *all* system memory?

> @@ -1209,6 +1210,60 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> }
> NOKPROBE_SYMBOL(do_kern_addr_fault);
>
> +static inline size_t pages_per_hpage(int level)
> +{
> + return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +/*
> + * Return 1 if the caller need to retry, 0 if it the address need to be split
> + * in order to resolve the fault.
> + */

Magic numbers.

Pls do instead:

enum rmp_pf_ret {
RMP_PF_SPLIT = 0,
RMP_PF_RETRY = 1,
};

and use those instead.

> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
> + unsigned long address)
> +{
> + int rmp_level, level;
> + pte_t *pte;
> + u64 pfn;
> +
> + pte = lookup_address_in_mm(current->mm, address, &level);
> +
> + /*
> + * It can happen if there was a race between an unmap event and
> + * the RMP fault delivery.
> + */

You need to elaborate more here: a RMP fault can happen and then the
page can get unmapped? What is the exact scenario here?

> + if (!pte || !pte_present(*pte))
> + return 1;
> +
> + pfn = pte_pfn(*pte);
> +
> + /* If its large page then calculte the fault pfn */
> + if (level > PG_LEVEL_4K) {
> + unsigned long mask;
> +
> + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> + pfn |= (address >> PAGE_SHIFT) & mask;

Oh boy, this is unnecessarily complicated. Isn't this

pfn |= pud_index(address);

or
pfn |= pmd_index(address);

depending on the level?

I think it is but it needs more explaining.

In any case, those are two static masks exactly and they don't need to
be computed for each #PF.

> diff --git a/mm/memory.c b/mm/memory.c
> index 7274f2b52bca..c2187ffcbb8e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4945,6 +4945,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> + if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
> + return VM_FAULT_SIGBUS;

Yah, this looks weird: generic code implies that page splitting after a
#PF makes sense only when SEV is present and none otherwise.

Why?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-10 04:02:44

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> When SEV-SNP is enabled globally, a write from the host goes through
>> the

>globally?

>Can SNP be even enabled any other way?

>I see the APM talks about it being enabled globally, I guess this means the RMP represents *all* system memory?

Actually SNP feature can be enabled globally, but SNP is activated on a per VM basis.

From the APM:
The term SNP-enabled indicates that SEV-SNP is globally enabled in the SYSCFG
MSR. The term SNP-active indicates that SEV-SNP is enabled for a specific VM in the
SEV_FEATURES field of its VMSA

>> +/*
>> + * Return 1 if the caller need to retry, 0 if it the address need to be split
>> + * in order to resolve the fault.
>> + */

>Magic numbers.

>Pls do instead:

>enum rmp_pf_ret {
> RMP_PF_SPLIT = 0,
> RMP_PF_RETRY = 1,
>};

>and use those instead.
Ok.

>> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
>> + unsigned long address)
>> +{
>> + int rmp_level, level;
>> + pte_t *pte;
>> + u64 pfn;
>> +
>> + pte = lookup_address_in_mm(current->mm, address, &level);
>> +
>> + /*
>> + * It can happen if there was a race between an unmap event and
>> + * the RMP fault delivery.
>> + */

>You need to elaborate more here: a RMP fault can happen and then the
>page can get unmapped? What is the exact scenario here?

Yes, if the page gets unmapped while the RMP fault was being handled,
will add more explanation here.

>> + if (!pte || !pte_present(*pte))
>> + return 1;
>> +
>> + pfn = pte_pfn(*pte);
>> +
>> + /* If its large page then calculte the fault pfn */
>> + if (level > PG_LEVEL_4K) {
>> + unsigned long mask;
>> +
>> + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
>> + pfn |= (address >> PAGE_SHIFT) & mask;

>Oh boy, this is unnecessarily complicated. Isn't this

> pfn |= pud_index(address);

>or
> pfn |= pmd_index(address);

>depending on the level?

Actually, the above computes an index into the RMP table. It is basically an index into
the 4K page within the hugepage mapped in the RMP table or in other words an index
into the RMP table entry for 4K page(s) corresponding to a hugepage.

So, pud_index()/pmd_index() can't be used for the same.

>I think it is but it needs more explaining.

>In any case, those are two static masks exactly and they don't need to
>be computed for each #PF.

>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7274f2b52bca..c2187ffcbb8e 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4945,6 +4945,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> return 0;
>> }
>>
>>. +static int handle_split_page_fault(struct vm_fault *vmf)
>> +{
> >+ if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
>> + return VM_FAULT_SIGBUS;

>Yah, this looks weird: generic code implies that page splitting after a
>#PF makes sense only when SEV is present and none otherwise.

It is mainly a wrapper around__split_huge_pmd() for SNP use case
where the host hugepage is split to be in sync with the RMP table.

Thanks,
Ashish

2022-08-10 09:51:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Wed, Aug 10, 2022 at 03:59:34AM +0000, Kalra, Ashish wrote:
> Actually SNP feature can be enabled globally, but SNP is activated on a per VM basis.
>
> From the APM:
> The term SNP-enabled indicates that SEV-SNP is globally enabled in the SYSCFG
> MSR. The term SNP-active indicates that SEV-SNP is enabled for a specific VM in the
> SEV_FEATURES field of its VMSA

Aha, and I was wondering whether "globally" meant the RMP needs to cover
all physical memory but I guess that isn't the case:

"RMP-Covered: Checks that the target page is covered by the RMP. A page
is covered by the RMP if its corresponding RMP entry is below RMP_END.
Any page not covered by the RMP is considered a Hypervisor-Owned page."

> >You need to elaborate more here: a RMP fault can happen and then the
> >page can get unmapped? What is the exact scenario here?
>
> Yes, if the page gets unmapped while the RMP fault was being handled,
> will add more explanation here.

So what's the logic here to return 1, i.e., retry?

Why should a fault for a page that gets unmapped be retried? The fault
in that case should be ignored, IMO. It'll have the same effect to
return from do_user_addr_fault() there, without splitting but you need
to have a separate return value definition so that it is clear what
needs to happen. And that return value should be != 0 so that the
current check still works.

> Actually, the above computes an index into the RMP table.

What index in the RMP table?

> It is basically an index into the 4K page within the hugepage mapped
> in the RMP table or in other words an index into the RMP table entry
> for 4K page(s) corresponding to a hugepage.

So pte_index(address) and for 1G pages, pmd_index(address).

So no reinventing the wheel if we already have helpers for that.

> It is mainly a wrapper around__split_huge_pmd() for SNP use case where
> the host hugepage is split to be in sync with the RMP table.

I see what it is. And I'm saying this looks wrong. You're enforcing page
splitting to be a valid thing to do only for SEV machines. Why?

Why is

if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
return VM_FAULT_SIGBUS;

there at all?

This is generic code you're touching - not arch/x86/.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-10 22:07:59

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> >You need to elaborate more here: a RMP fault can happen and then the
>> >page can get unmapped? What is the exact scenario here?
>>
>> Yes, if the page gets unmapped while the RMP fault was being handled,
>> will add more explanation here.

>So what's the logic here to return 1, i.e., retry?

>Why should a fault for a page that gets unmapped be retried? The fault in that case should be ignored, IMO. It'll have the same effect to return from do_user_addr_fault() there, without splitting but you need to have a separate return value >definition so that it is clear what needs to happen. And that return value should be != 0 so that the current check still works.

if (!pte || !pte_present(*pte))
return 1;

This is more like a sanity check and returning 1 will cause the fault handler to return and ignore the fault for current #PF case.
If the page got unmapped, the fault will not happen again and there will be no retry, so the fault in this case is
being ignored.
The other case where 1 is returned is RMP table lookup failure, in that case the faulting process is being terminated,
that resolves the fault.

>> Actually, the above computes an index into the RMP table.

>What index in the RMP table?

>> It is basically an index into the 4K page within the hugepage mapped
>> in the RMP table or in other words an index into the RMP table entry
>> for 4K page(s) corresponding to a hugepage.

>So pte_index(address) and for 1G pages, pmd_index(address).

>So no reinventing the wheel if we already have helpers for that.

Yes that makes sense and pte_index(address) is exactly what is
required for 2M hugepages.

Will use pte_index() for 2M pages and pmd_index() for 1G pages.

>> It is mainly a wrapper around__split_huge_pmd() for SNP use case where
>> the host hugepage is split to be in sync with the RMP table.

>I see what it is. And I'm saying this looks wrong. You're enforcing page splitting to be a valid thing to do only for SEV machines. Why?

>Why is

> if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
> return VM_FAULT_SIGBUS;

>there at all?

>This is generic code you're touching - not arch/x86/.

Ok, so you are suggesting that we remove this check and simply keep this function wrapping around __split_huge_pmd().
This becomes a generic utility function.

Thanks,
Ashish

2022-08-11 14:32:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Wed, Aug 10, 2022 at 10:00:57PM +0000, Kalra, Ashish wrote:
> This is more like a sanity check and returning 1 will cause the fault
> handler to return and ignore the fault for current #PF case. If the
> page got unmapped, the fault will not happen again and there will be
> no retry, so the fault in this case is being ignored.

I know what will happen. I'm asking you to make this explicit in the
code because this separate define documents the situation.

One more return type != 0 won't hurt.

> Ok, so you are suggesting that we remove this check and simply keep
> this function wrapping around __split_huge_pmd(). This becomes a
> generic utility function.

Yes, it is in generic code so it better be generic function. That's why
I'm questioning the vendor-specific check there.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-11 15:18:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On 6/21/22 01:03, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. When the host writes to pages, hardware checks the following
> conditions at the end of page walk:
>
> 1. Assigned bit in the RMP table is zero (i.e page is shared).
> 2. If the page table entry that gives the sPA indicates that the target
> page size is a large page, then all RMP entries for the 4KB
> constituting pages of the target must have the assigned bit 0.
> 3. Immutable bit in the RMP table is not zero.
>
> The hardware will raise page fault if one of the above conditions is not
> met. Try resolving the fault instead of taking fault again and again. If
> the host attempts to write to the guest private memory then send the
> SIGBUS signal to kill the process. If the page level between the host and
> RMP entry does not match, then split the address to keep the RMP and host
> page levels in sync.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/mm/fault.c | 66 ++++++++++++++++++++++++++++++++++++++++
> include/linux/mm.h | 3 +-
> include/linux/mm_types.h | 3 ++
> mm/memory.c | 13 ++++++++
> 4 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index a4c270e99f7f..f5de9673093a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -19,6 +19,7 @@
> #include <linux/uaccess.h> /* faulthandler_disabled() */
> #include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
> #include <linux/mm_types.h>
> +#include <linux/sev.h> /* snp_lookup_rmpentry() */
>
> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> #include <asm/traps.h> /* dotraplinkage, ... */
> @@ -1209,6 +1210,60 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> }
> NOKPROBE_SYMBOL(do_kern_addr_fault);
>
> +static inline size_t pages_per_hpage(int level)
> +{
> + return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +/*
> + * Return 1 if the caller need to retry, 0 if it the address need to be split
> + * in order to resolve the fault.
> + */
> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
> + unsigned long address)
> +{
> + int rmp_level, level;
> + pte_t *pte;
> + u64 pfn;
> +
> + pte = lookup_address_in_mm(current->mm, address, &level);
> +
> + /*
> + * It can happen if there was a race between an unmap event and
> + * the RMP fault delivery.
> + */
> + if (!pte || !pte_present(*pte))
> + return 1;
> +
> + pfn = pte_pfn(*pte);
> +
> + /* If its large page then calculte the fault pfn */
> + if (level > PG_LEVEL_4K) {
> + unsigned long mask;
> +
> + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> + pfn |= (address >> PAGE_SHIFT) & mask;
> + }
> +
> + /*
> + * If its a guest private page, then the fault cannot be resolved.
> + * Send a SIGBUS to terminate the process.
> + */
> + if (snp_lookup_rmpentry(pfn, &rmp_level)) {
> + do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
> + return 1;
> + }
> +
> + /*
> + * The backing page level is higher than the RMP page level, request
> + * to split the page.
> + */
> + if (level > rmp_level)
> + return 0;

I don't see any checks that make sure this is in fact a THP, and not e.g.
hugetlb (which is disallowed only later in patch 25/49), or even something
else unexpected. Calling blindly __split_huge_pmd() in
handle_split_page_fault() on anything that's not a THP will just make it
return without splitting anything, and then this will result in a page fault
loop? Some kind of warning and a SIGBUS would be more safe I think.

> +
> + return 1;
> +}
> +
> /*
> * Handle faults in the user portion of the address space. Nothing in here
> * should check X86_PF_USER without a specific justification: for almost
> @@ -1306,6 +1361,17 @@ void do_user_addr_fault(struct pt_regs *regs,
> if (error_code & X86_PF_INSTR)
> flags |= FAULT_FLAG_INSTRUCTION;
>
> + /*
> + * If its an RMP violation, try resolving it.
> + */
> + if (error_code & X86_PF_RMP) {
> + if (handle_user_rmp_page_fault(regs, error_code, address))
> + return;
> +
> + /* Ask to split the page */
> + flags |= FAULT_FLAG_PAGE_SPLIT;
> + }
> +
> #ifdef CONFIG_X86_64
> /*
> * Faults in the vsyscall page might need emulation. The
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index de32c0383387..2ccc562d166f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -463,7 +463,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
> { FAULT_FLAG_USER, "USER" }, \
> { FAULT_FLAG_REMOTE, "REMOTE" }, \
> { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
> - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
> + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
> + { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" }
>
> /*
> * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6dfaf271ebf8..aa2d8d48ce3e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -818,6 +818,8 @@ typedef struct {
> * mapped R/O.
> * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> * We should only access orig_pte if this flag set.
> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
> + * region to smaller page size and retry.
> *
> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
> * whether we would allow page faults to retry by specifying these two
> @@ -855,6 +857,7 @@ enum fault_flag {
> FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
> FAULT_FLAG_UNSHARE = 1 << 10,
> FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
> + FAULT_FLAG_PAGE_SPLIT = 1 << 12,
> };
>
> typedef unsigned int __bitwise zap_flags_t;
> diff --git a/mm/memory.c b/mm/memory.c
> index 7274f2b52bca..c2187ffcbb8e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4945,6 +4945,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> + if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
> + return VM_FAULT_SIGBUS;
> +
> + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> + return 0;
> +}
> +
> /*
> * By the time we get here, we already hold the mm semaphore
> *
> @@ -5024,6 +5033,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> pmd_migration_entry_wait(mm, vmf.pmd);
> return 0;
> }
> +
> + if (flags & FAULT_FLAG_PAGE_SPLIT)
> + return handle_split_page_fault(&vmf);
> +
> if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&vmf);

2022-08-16 05:59:44

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 26/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command

[AMD Official Use Only - General]

Hello Sabin,

>> +static bool is_hva_registered(struct kvm *kvm, hva_t hva, size_t len)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct list_head *head = &sev->regions_list;
>> + struct enc_region *i;
>> +
>> + lockdep_assert_held(&kvm->lock);
>> +
>> + list_for_each_entry(i, head, list) {
>> + u64 start = i->uaddr;
>> + u64 end = start + i->size;
>> +
>> + if (start <= hva && end >= (hva + len))
>> + return true;
>> + }
>> +
>> + return false;
>> +}

>Since KVM_MEMORY_ENCRYPT_REG_REGION should be called for every memory region the user gives to kvm, is the regions_list any different from kvm's memslots?

Actually, the KVM_MEMORY_ENCRYPT_REG_REGION is being called and the regions_list is only being setup for the guest RAM blocks.

>> +
>> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd
>> +*argp) {
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + struct sev_data_snp_launch_update data = {0};
>> + struct kvm_sev_snp_launch_update params;
>> + unsigned long npages, pfn, n = 0;
>> + int *error = &argp->error;
>> + struct page **inpages;
>> + int ret, i, level;
>> + u64 gfn;
>> +
>> + if (!sev_snp_guest(kvm))
>> + return -ENOTTY;
>> +
>> + if (!sev->snp_context)
>> + return -EINVAL;
>> +
>> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
>> + return -EFAULT;
>> +
>> + /* Verify that the specified address range is registered. */
>> + if (!is_hva_registered(kvm, params.uaddr, params.len))
>> + return -EINVAL;
>> +
>> + /*
>> + * The userspace memory is already locked so technically we don't
>> + * need to lock it again. Later part of the function needs to know
>> + * pfn so call the sev_pin_memory() so that we can get the list of
>> + * pages to iterate through.
>> + */
>> + inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
>> + if (!inpages)
>> + return -ENOMEM;

>sev_pin_memory will call pin_user_pages() which fails for PFNMAP vmas that you would get if you use memory allocated from an IO driver.
>Using gfn_to_pfn instead will make this work with vmas backed by pages or raw pfn mappings.

All the guest memory is being allocated via the userspace VMM, so how and where will we get memory allocated from an IO driver ?

Thanks,
Ashish

2022-08-18 03:49:52

by Alper Gun

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 39/49] KVM: SVM: Introduce ops for the post gfn map and unmap

On Mon, Jun 20, 2022 at 4:12 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> either be a private or shared. A write from the hypervisor goes through
> the RMP checks. If hardware sees that hypervisor is attempting to write
> to a guest private page, then it triggers an RMP violation #PF.
>
> To avoid the RMP violation with GHCB pages, added new post_{map,unmap}_gfn
> functions to verify if its safe to map GHCB pages. Uses a spinlock to
> protect against the page state change for existing mapped pages.
>
> Need to add generic post_{map,unmap}_gfn() ops that can be used to verify
> that its safe to map a given guest page in the hypervisor.
>
> This patch will need to be revisited later after consensus is reached on
> how to manage guest private memory as probably UPM private memslots will
> be able to handle this page state change more gracefully.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off by: Ashish Kalra <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 3 ++
> arch/x86/kvm/svm/sev.c | 48 ++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.c | 3 ++
> arch/x86/kvm/svm/svm.h | 11 +++++++
> 5 files changed, 64 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index e0068e702692..2dd2bc0cf4c3 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -130,6 +130,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP(alloc_apic_backing_page)
> KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
> +KVM_X86_OP(update_protected_guest_state)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 49b217dc8d7e..8abc0e724f5c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1522,7 +1522,10 @@ struct kvm_x86_ops {
> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> +
> void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
> +
> + int (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index cb2d1bbb862b..4ed90331bca0 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -341,6 +341,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (ret)
> goto e_free;
>
> + spin_lock_init(&sev->psc_lock);
> ret = sev_snp_init(&argp->error);
> } else {
> ret = sev_platform_init(&argp->error);
> @@ -2828,19 +2829,28 @@ static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
> {
> struct vmcb_control_area *control = &svm->vmcb->control;
> u64 gfn = gpa_to_gfn(control->ghcb_gpa);
> + struct kvm_vcpu *vcpu = &svm->vcpu;
>
> - if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
> + if (kvm_vcpu_map(vcpu, gfn, map)) {
> /* Unable to map GHCB from guest */
> pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
> return -EFAULT;
> }
>
> + if (sev_post_map_gfn(vcpu->kvm, map->gfn, map->pfn)) {
> + kvm_vcpu_unmap(vcpu, map, false);
> + return -EBUSY;
> + }
> +
> return 0;
> }
>
> static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
> {
> - kvm_vcpu_unmap(&svm->vcpu, map, true);
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> +
> + kvm_vcpu_unmap(vcpu, map, true);
> + sev_post_unmap_gfn(vcpu->kvm, map->gfn, map->pfn);
> }
>
> static void dump_ghcb(struct vcpu_svm *svm)
> @@ -3383,6 +3393,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
> return PSC_UNDEF_ERR;
> }
>
> + spin_lock(&sev->psc_lock);
> +
> write_lock(&kvm->mmu_lock);
>
> rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
> @@ -3417,6 +3429,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
>
> write_unlock(&kvm->mmu_lock);
>
> + spin_unlock(&sev->psc_lock);

There is a corner case where the psc_lock is not released. If
kvm_mmu_get_tdp_walk fails, the lock will be kept and will cause soft
lockup.

> +
> if (rc) {
> pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
> op, gpa, pfn, level, rc);
> @@ -3965,3 +3979,33 @@ void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)
> /* Adjust the level to keep the NPT and RMP in sync */
> *level = min_t(size_t, *level, rmp_level);
> }
> +
> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + int level;
> +
> + if (!sev_snp_guest(kvm))
> + return 0;
> +
> + spin_lock(&sev->psc_lock);
> +
> + /* If pfn is not added as private then fail */
> + if (snp_lookup_rmpentry(pfn, &level) == 1) {
> + spin_unlock(&sev->psc_lock);
> + pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
> + return -EBUSY;
> + }
> +
> + return 0;
> +}
> +
> +void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +
> + if (!sev_snp_guest(kvm))
> + return;
> +
> + spin_unlock(&sev->psc_lock);
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b24e0171cbf2..1c8e035ba011 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4734,7 +4734,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
>
> .alloc_apic_backing_page = svm_alloc_apic_backing_page,
> +
> .rmp_page_level_adjust = sev_rmp_page_level_adjust,
> +
> + .update_protected_guest_state = sev_snp_update_protected_guest_state,
> };
>
> /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 54ff56cb6125..3fd95193ed8d 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -79,19 +79,25 @@ struct kvm_sev_info {
> bool active; /* SEV enabled guest */
> bool es_active; /* SEV-ES enabled guest */
> bool snp_active; /* SEV-SNP enabled guest */
> +
> unsigned int asid; /* ASID used for this guest */
> unsigned int handle; /* SEV firmware handle */
> int fd; /* SEV device fd */
> +
> unsigned long pages_locked; /* Number of pages locked */
> struct list_head regions_list; /* List of registered regions */
> +
> u64 ap_jump_table; /* SEV-ES AP Jump Table address */
> +
> struct kvm *enc_context_owner; /* Owner of copied encryption context */
> struct list_head mirror_vms; /* List of VMs mirroring */
> struct list_head mirror_entry; /* Use as a list entry of mirrors */
> struct misc_cg *misc_cg; /* For misc cgroup accounting */
> atomic_t migration_in_progress;
> +
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> + spinlock_t psc_lock;
> };
>
> struct kvm_svm {
> @@ -702,6 +708,11 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
> void sev_es_unmap_ghcb(struct vcpu_svm *svm);
> struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
> void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);
> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
> +void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
> +void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
> +int sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu);
>
> /* vmenter.S */
>
> --
> 2.25.1
>

2022-08-19 17:42:03

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

> +
> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op, gpa_t gpa,
> + int level)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
> + struct kvm *kvm = vcpu->kvm;
> + int rc, npt_level;
> + kvm_pfn_t pfn;
> + gpa_t gpa_end;
> +
> + gpa_end = gpa + page_level_size(level);
> +
> + while (gpa < gpa_end) {
> + /*
> + * If the gpa is not present in the NPT then build the NPT.
> + */
> + rc = snp_check_and_build_npt(vcpu, gpa, level);
> + if (rc)
> + return -EINVAL;
> +
> + if (op == SNP_PAGE_STATE_PRIVATE) {
> + hva_t hva;
> +
> + if (snp_gpa_to_hva(kvm, gpa, &hva))
> + return -EINVAL;
> +
> + /*
> + * Verify that the hva range is registered. This enforcement is
> + * required to avoid the cases where a page is marked private
> + * in the RMP table but never gets cleanup during the VM
> + * termination path.
> + */
> + mutex_lock(&kvm->lock);
> + rc = is_hva_registered(kvm, hva, page_level_size(level));
> + mutex_unlock(&kvm->lock);
> + if (!rc)
> + return -EINVAL;
> +
> + /*
> + * Mark the userspace range unmerable before adding the pages
> + * in the RMP table.
> + */
> + mmap_write_lock(kvm->mm);
> + rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
> + mmap_write_unlock(kvm->mm);
> + if (rc)
> + return -EINVAL;
> + }
> +
> + write_lock(&kvm->mmu_lock);
> +
> + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
> + if (!rc) {
> + /*
> + * This may happen if another vCPU unmapped the page
> + * before we acquire the lock. Retry the PSC.
> + */
> + write_unlock(&kvm->mmu_lock);
> + return 0;
> + }

I think we want to return -EAGAIN or similar if we want the caller to
retry, right? I think returning 0 here hides the error.

> +
> + /*
> + * Adjust the level so that we don't go higher than the backing
> + * page level.
> + */
> + level = min_t(size_t, level, npt_level);
> +
> + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op, level);
> +
> + switch (op) {
> + case SNP_PAGE_STATE_SHARED:
> + rc = snp_make_page_shared(kvm, gpa, pfn, level);
> + break;
> + case SNP_PAGE_STATE_PRIVATE:
> + rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
> + break;
> + default:
> + rc = -EINVAL;
> + break;
> + }
> +
> + write_unlock(&kvm->mmu_lock);
> +
> + if (rc) {
> + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
> + op, gpa, pfn, level, rc);
> + return rc;
> + }
> +
> + gpa = gpa + page_level_size(level);
> + }
> +
> + return 0;
> +}
> +

2022-08-23 16:57:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 10/49] x86/fault: Add support to dump RMP entry on fault

On Mon, Jun 20, 2022 at 11:03:58PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. If the hardware encounters the check failure, then it raises
> the #PF (with RMP set). Dump the RMP entry at the faulting pfn to help
> the debug.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/include/asm/sev.h | 7 +++++++
> arch/x86/kernel/sev.c | 43 ++++++++++++++++++++++++++++++++++++++
> arch/x86/mm/fault.c | 17 +++++++++++----
> include/linux/sev.h | 2 ++
> 4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 6ab872311544..c0c4df817159 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -113,6 +113,11 @@ struct __packed rmpentry {
>
> #define rmpentry_assigned(x) ((x)->info.assigned)
> #define rmpentry_pagesize(x) ((x)->info.pagesize)
> +#define rmpentry_vmsa(x) ((x)->info.vmsa)
> +#define rmpentry_asid(x) ((x)->info.asid)
> +#define rmpentry_validated(x) ((x)->info.validated)
> +#define rmpentry_gpa(x) ((unsigned long)(x)->info.gpa)
> +#define rmpentry_immutable(x) ((x)->info.immutable)

If you're going to do that, use inline functions pls so that it checks
the argument at least.

Also, add such functions only when they're called multiple times - no
need to add one for every field if you're going to access that field
only once in the whole kernel.

>
> #define RMPADJUST_VMSA_PAGE_BIT BIT(16)
>
> @@ -205,6 +210,7 @@ void snp_set_wakeup_secondary_cpu(void);
> bool snp_init(struct boot_params *bp);
> void snp_abort(void);
> int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
> +void dump_rmpentry(u64 pfn);
> #else
> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> static inline void sev_es_ist_exit(void) { }
> @@ -229,6 +235,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
> {
> return -ENOTTY;
> }
> +static inline void dump_rmpentry(u64 pfn) {}
> #endif
>
> #endif
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 734cddd837f5..6640a639fffc 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2414,6 +2414,49 @@ static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
> return entry;
> }
>
> +void dump_rmpentry(u64 pfn)

External function - it better belong to a namespace:

sev_dump_rmpentry()

> +{
> + unsigned long pfn_end;
> + struct rmpentry *e;
> + int level;
> +
> + e = __snp_lookup_rmpentry(pfn, &level);
> + if (!e) {
> + pr_alert("failed to read RMP entry pfn 0x%llx\n", pfn);

Why alert?

Dumping stuff is either pr_debug or pr_info...

> + return;
> + }
> +
> + if (rmpentry_assigned(e)) {
> + pr_alert("RMPEntry paddr 0x%llx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
> + " asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
> + rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
> + rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
> + rmpentry_validated(e));
> + return;
> + }
> +
> + /*
> + * If the RMP entry at the faulting pfn was not assigned, then we do not

Who's "we"?

> + * know what caused the RMP violation. To get some useful debug information,
> + * let iterate through the entire 2MB region, and dump the RMP entries if
> + * one of the bit in the RMP entry is set.
> + */
> + pfn = pfn & ~(PTRS_PER_PMD - 1);
> + pfn_end = pfn + PTRS_PER_PMD;
> +
> + while (pfn < pfn_end) {
> + e = __snp_lookup_rmpentry(pfn, &level);
> + if (!e)
> + return;
> +
> + if (e->low || e->high)

This is going to confuse people because they're going to miss a zero
entry. Just dump the whole thing.

...

> @@ -579,7 +588,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
> show_ldttss(&gdt, "TR", tr);
> }
>
> - dump_pagetable(address);
> + dump_pagetable(address, error_code & X86_PF_RMP);

Eww.

I'd prefer to see

pfn = dump_pagetable(address);

if (error_code & X86_PF_RMP)
sev_dump_rmpentry(pfn);

instead of passing around this SEV-specific arg in generic x86 fault code.

The change to return the pfn from dump_pagetable() should be a pre-patch ofc.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-01 20:57:47

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> It is basically an index into the 4K page within the hugepage mapped
>> in the RMP table or in other words an index into the RMP table entry
>> for 4K page(s) corresponding to a hugepage.

>So pte_index(address) and for 1G pages, pmd_index(address).

>So no reinventing the wheel if we already have helpers for that.

>Yes that makes sense and pte_index(address) is exactly what is required for 2M hugepages.

>Will use pte_index() for 2M pages and pmd_index() for 1G pages.

Had a relook into this.

As I mentioned earlier, this is computing an index into a 4K page within a hugepage mapping,
therefore, though pte_index() works for 2M pages, but pmd_index() will not work for 1G pages.

We basically need to do :
pfn |= (address >> PAGE_SHIFT) & mask;

where mask is the (number of 4K pages per hugepage) - 1

So this still needs the original code but with a fix for mask computation as following :

static inline size_t pages_per_hpage(int level)
return page_level_size(level) / PAGE_SIZE;
}

static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
{
...
pfn = pte_pfn(*pte);

/* If its large page then calculte the fault pfn */
if (level > PG_LEVEL_4K) {
+ /*
+ * index into the 4K page within the hugepage mapping
+ * in the RMP table
+ */
unsigned long mask;

- mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+ mask = pages_per_hpage(level) - 1;
pfn |= (address >> PAGE_SHIFT) & mask;


Thanks,
Ashish

2022-09-02 07:03:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Thu, Sep 01, 2022 at 08:32:35PM +0000, Kalra, Ashish wrote:
> As I mentioned earlier, this is computing an index into a 4K page
> within a hugepage mapping, therefore, though pte_index() works for 2M
> pages, but pmd_index() will not work for 1G pages.

Why not? What exactly do you need to get here?

So the way I understand it is, you want to map the faulting address to a
RMP entry. And that is either the 2M PMD entry when the page is a 1G one
and the 4K PTE entry when the page is a 2M one?

Why doesn't pmd_index() work?

Also, why isn't the lookup function's signature:

int snp_lookup_rmpentry(unsigned long address, int *level)

and all that logic to do the conversion to a PFN also not in it?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-02 15:56:00

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> As I mentioned earlier, this is computing an index into a 4K page
>> within a hugepage mapping, therefore, though pte_index() works for 2M
>> pages, but pmd_index() will not work for 1G pages.

>Why not? What exactly do you need to get here?

>So the way I understand it is, you want to map the faulting address to a RMP entry. And that is either the 2M PMD entry when the page is a 1G one and the 4K PTE entry when the page is a 2M one?

>Why doesn't pmd_index() work?

Yes we want to map the faulting address to a RMP entry, but hugepage entries in RMP table are basically subpage 4K entries. So it is a 4K entry when the page is a 2M one
and also a 4K entry when the page is a 1G one.

That's why the computation to get a 4K page index within a 2M/1G hugepage mapping is required.

>Also, why isn't the lookup function's signature:

>int snp_lookup_rmpentry(unsigned long address, int *level)

>and all that logic to do the conversion to a PFN also not in it?

Thanks,
Ashish

2022-09-03 04:28:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Fri, Sep 02, 2022 at 03:33:20PM +0000, Kalra, Ashish wrote:
> Yes we want to map the faulting address to a RMP entry, but hugepage
> entries in RMP table are basically subpage 4K entries. So it is a 4K
> entry when the page is a 2M one and also a 4K entry when the page is a
> 1G one.

Wait, what?!

APM v2 section "15.36.11 Large Page Management" and PSMASH are then for
what exactly?

> That's why the computation to get a 4K page index within a 2M/1G
> hugepage mapping is required.

What if a guest RMP-faults on a 2M page and there's a corresponding 2M
RMP entry? What do you need the 4K entry then for?

Hell, __snp_lookup_rmpentry() even tries to return the proper page
level...

/me looks in disbelief in your direction...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-03 05:53:00

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> Yes we want to map the faulting address to a RMP entry, but hugepage
>> entries in RMP table are basically subpage 4K entries. So it is a 4K
>> entry when the page is a 2M one and also a 4K entry when the page is a
>> 1G one.

>Wait, what?!

>APM v2 section "15.36.11 Large Page Management" and PSMASH are then for what exactly?

This is what exactly PSMASH is for, in case the 2MB RMP entry needs to be smashed if guest PVALIDATES a 4K page,
the HV will need to PSMASH the 2MB RMP entry to corresponding 4K RMP entries during #VMEXIT(NPF).

What I meant above is that 4K RMP table entries need to be available in case the 2MB RMP entry needs to be
smashed.

>> That's why the computation to get a 4K page index within a 2M/1G
>> hugepage mapping is required.

>What if a guest RMP-faults on a 2M page and there's a corresponding 2M RMP entry? What do you need the 4K entry then for?

There is no fault here, if guest pvalidates a 2M page that is backed by a 2MB RMP entry.
We need the 4K entries in case the guest pvalidates a 4K page that is mapped by a 2MB RMP entry.

>Hell, __snp_lookup_rmpentry() even tries to return the proper page level...

>/me looks in disbelief in your direction...

Thanks,
Ashish

2022-09-03 07:16:07

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

So essentially we want to map the faulting address to a RMP entry, considering the fact that a 2M host hugepage can be mapped as
4K RMP table entries and 1G host hugepage can be mapped as 2M RMP table entries.

Hence, this mask computation is done as:
mask = pages_per_hpage(level) - pages_per_hpage(level -1);

and the final faulting pfn is computed as:
pfn |= (address >> PAGE_SHIFT) & mask;

Thanks,
Ashish

-----Original Message-----
From: Kalra, Ashish
Sent: Saturday, September 3, 2022 12:51 AM
To: Borislav Petkov <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Lendacky, Thomas <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Roth, Michael <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> Yes we want to map the faulting address to a RMP entry, but hugepage
>> entries in RMP table are basically subpage 4K entries. So it is a 4K
>> entry when the page is a 2M one and also a 4K entry when the page is
>> a 1G one.

>Wait, what?!

>APM v2 section "15.36.11 Large Page Management" and PSMASH are then for what exactly?

This is what exactly PSMASH is for, in case the 2MB RMP entry needs to be smashed if guest PVALIDATES a 4K page, the HV will need to PSMASH the 2MB RMP entry to corresponding 4K RMP entries during #VMEXIT(NPF).

What I meant above is that 4K RMP table entries need to be available in case the 2MB RMP entry needs to be smashed.

>> That's why the computation to get a 4K page index within a 2M/1G
>> hugepage mapping is required.

>What if a guest RMP-faults on a 2M page and there's a corresponding 2M RMP entry? What do you need the 4K entry then for?

There is no fault here, if guest pvalidates a 2M page that is backed by a 2MB RMP entry.
We need the 4K entries in case the guest pvalidates a 4K page that is mapped by a 2MB RMP entry.

>Hell, __snp_lookup_rmpentry() even tries to return the proper page level...

>/me looks in disbelief in your direction...

Thanks,
Ashish

2022-09-03 08:33:31

by Borislav Petkov

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On September 3, 2022 6:57:51 AM UTC, "Kalra, Ashish" <[email protected]> wrote:
>[AMD Official Use Only - General]
>
>So essentially we want to map the faulting address to a RMP entry, considering the fact that a 2M host hugepage can be mapped as
>4K RMP table entries and 1G host hugepage can be mapped as 2M RMP table entries.

So something's seriously confusing or missing here because if you fault on a 2M host page and the underlying RMP entries are 4K then you can use pte_index().

If the host page is 1G and the underlying RMP entries are 2M, pmd_index() should work here too.

But this piecemeal back'n'forth doesn't seem to resolve this so I'd like to ask you pls to sit down, take your time and give a detailed example of the two possible cases and what the difference is between pte_/pmd_index and your way. Feel free to add actual debug output and paste it here.

Thanks.

--
Sent from a small device: formatting sux and brevity is inevitable.

2022-09-03 17:31:43

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>>So essentially we want to map the faulting address to a RMP entry,
>>considering the fact that a 2M host hugepage can be mapped as 4K RMP table entries and 1G host hugepage can be mapped as 2M RMP table entries.

>So something's seriously confusing or missing here because if you fault on a 2M host page and the underlying RMP entries are 4K then you can use pte_index().

>If the host page is 1G and the underlying RMP entries are 2M, pmd_index() should work here too.

>But this piecemeal back'n'forth doesn't seem to resolve this so I'd like to ask you pls to sit down, take your time and give a detailed example of the two possible cases and what the difference is between pte_/pmd_index and your way. Feel free to >add actual debug output and paste it here.

There is 1 64-bit RMP entry for every physical 4k page of DRAM, so essentially every 4K page of DRAM is represented by a RMP entry.

So even if host page is 1G and underlying (smashed/split) RMP entries are 2M, the RMP table entry has to be indexed to a 4K entry
corresponding to that.

If it was simply a 2M entry in the RMP table, then pmd_index() will work correctly.

Considering the following example:

address = 0x40200000;
level = PG_LEVEL_1G;
pfn = 0x40000;
pfn |= pmd_index(address);
This will give the RMP table index as 0x40001.
And it will work if the RMP table entry was simply a 2MB entry, but we need to map this further to its corresponding 4K entry.

With the same example as above:
level = PG_LEVEL_1G;
mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
pfn |= (address >> PAGE_SHIFT) & mask;
This will give the RMP table index as 0x40200.
Which is the correct RMP table entry for a 2MB smashed/split 1G page mapped further to its corresponding 4K entry.

Hopefully this clarifies why pmd_index() can't be used here.

Thanks,
Ashish



2022-09-04 06:38:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Sat, Sep 03, 2022 at 05:30:28PM +0000, Kalra, Ashish wrote:
> There is 1 64-bit RMP entry for every physical 4k page of DRAM, so
> essentially every 4K page of DRAM is represented by a RMP entry.

Before we get to the rest - this sounds wrong to me. My APM has:

"PSMASH Page Smash

Expands a 2MB-page RMP entry into a corresponding set of contiguous
4KB-page RMP entries. The 2MB page’s system physical address is
specified in the RAX register. The new entries inherit the attributes
of the original entry. Upon completion, a return code is stored in EAX.
rFLAGS bits OF, ZF, AF, PF and SF are set based on this return code..."

So there *are* 2M entries in the RMP table.

> So even if host page is 1G and underlying (smashed/split) RMP
> entries are 2M, the RMP table entry has to be indexed to a 4K entry
> corresponding to that.

So if there are 2M entries in the RMP table, how does that indexing with
4K entries is supposed to work?

Hell, even PSMASH pseudocode shows how you go and write all those 512 4K
entries using the 2M entry as a template. So *before* you have smashed
that 2M entry, it *is* an *actual* 2M entry.

So if you fault on a page which is backed by that 2M RMP entry, you will
get that 2M RMP entry.

> If it was simply a 2M entry in the RMP table, then pmd_index() will
> work correctly.

Judging by the above text, it *can* *be* a 2M RMP entry!

By reading your example you're trying to tell me that a RMP #PF will
always need to work on 4K entries. Which would then need for a 2M entry
as above to be PSMASHed in order to get the 4K thing. But that would be
silly - RMP PFs will this way gradually break all 2M pages and degrage
performance for no real reason.

So this still looks real wrong to me.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-06 02:34:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On 6/20/22 16:03, Ashish Kalra wrote:
>
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. When the host writes to pages, hardware checks the following
> conditions at the end of page walk:
>
> 1. Assigned bit in the RMP table is zero (i.e page is shared).
> 2. If the page table entry that gives the sPA indicates that the target
> page size is a large page, then all RMP entries for the 4KB
> constituting pages of the target must have the assigned bit 0.
> 3. Immutable bit in the RMP table is not zero.
>
> The hardware will raise page fault if one of the above conditions is not
> met. Try resolving the fault instead of taking fault again and again. If
> the host attempts to write to the guest private memory then send the
> SIGBUS signal to kill the process. If the page level between the host and
> RMP entry does not match, then split the address to keep the RMP and host
> page levels in sync.

When you're working on this changelog for Borislav, I'd like to make one
other suggestion: Please write it more logically and _less_ about what
the hardware is doing. We don't need about the internal details of what
hardware is doing in the changelog. Mentioning whether an RMP bit is 0
or 1 is kinda silly unless it matters to the code.

For instance, what does the immutable bit have to do with all of this?
There's no specific handling for it. There are really only faults that
you can handle and faults that you can't.

There's also some major missing context here about how it guarantees
that pages that can't be handled *CAN* be split. I think it has to do
with disallowing hugetlbfs which implies that the only pages that might
need splitting are THP's.+ /*
> + * If its an RMP violation, try resolving it.
> + */
> + if (error_code & X86_PF_RMP) {
> + if (handle_user_rmp_page_fault(regs, error_code, address))
> + return;
> +
> + /* Ask to split the page */
> + flags |= FAULT_FLAG_PAGE_SPLIT;
> + }

This also needs some chatter about why any failure to handle the fault
automatically means splitting a page.

2022-09-06 10:31:41

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> > + pfn = pte_pfn(*pte);
> > +
> > + /* If its large page then calculte the fault pfn */
> > + if (level > PG_LEVEL_4K) {
> > + unsigned long mask;
> > +
> > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> > + pfn |= (address >> PAGE_SHIFT) & mask;
>
> Oh boy, this is unnecessarily complicated. Isn't this
>
> pfn |= pud_index(address);
>
> or
> pfn |= pmd_index(address);

I played with this a bit and ended up with

pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level - 1));

Unless I got something terribly wrong, this should do the
same (see the attached patch) as the existing calculations.

BR, Jarkko


Attachments:
(No filename) (809.00 B)
0001-x86-fault-Simplify-PFN-calculation-in-handle_user_rm.patch (1.76 kB)
Download all attachments

2022-09-06 10:42:05

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Sep 06, 2022 at 01:25:10PM +0300, Jarkko Sakkinen wrote:
> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> > > + pfn = pte_pfn(*pte);
> > > +
> > > + /* If its large page then calculte the fault pfn */
> > > + if (level > PG_LEVEL_4K) {
> > > + unsigned long mask;
> > > +
> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> > > + pfn |= (address >> PAGE_SHIFT) & mask;
> >
> > Oh boy, this is unnecessarily complicated. Isn't this
> >
> > pfn |= pud_index(address);
> >
> > or
> > pfn |= pmd_index(address);
>
> I played with this a bit and ended up with
>
> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level - 1));
>
> Unless I got something terribly wrong, this should do the
> same (see the attached patch) as the existing calculations.

IMHO a better name for this function would be do_user_rmp_addr_fault() as
it is more consistent with the existing function names.

BR, Jarkko

2022-09-06 14:57:48

by Marc Orr

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Sep 6, 2022 at 3:25 AM Jarkko Sakkinen <[email protected]> wrote:
>
> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> > > + pfn = pte_pfn(*pte);
> > > +
> > > + /* If its large page then calculte the fault pfn */
> > > + if (level > PG_LEVEL_4K) {
> > > + unsigned long mask;
> > > +
> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> > > + pfn |= (address >> PAGE_SHIFT) & mask;
> >
> > Oh boy, this is unnecessarily complicated. Isn't this
> >
> > pfn |= pud_index(address);
> >
> > or
> > pfn |= pmd_index(address);
>
> I played with this a bit and ended up with
>
> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level - 1));
>
> Unless I got something terribly wrong, this should do the
> same (see the attached patch) as the existing calculations.

Actually, I don't think they're the same. I think Jarkko's version is
correct. Specifically:
- For level = PG_LEVEL_2M they're the same.
- For level = PG_LEVEL_1G:
The current code calculates a garbage mask:
mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
translates to:
>>> hex(262144 - 512)
'0x3fe00'

But I believe Jarkko's version calculates the correct mask (below),
incorporating all 18 offset bits into the 1G page.
>>> hex(262144 -1)
'0x3ffff'

2022-09-06 15:18:47

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

Hello Boris,

>> There is 1 64-bit RMP entry for every physical 4k page of DRAM, so
>> essentially every 4K page of DRAM is represented by a RMP entry.

>Before we get to the rest - this sounds wrong to me. My APM has:

>"PSMASH Page Smash

>Expands a 2MB-page RMP entry into a corresponding set of contiguous 4KB-page RMP entries. The 2MB page's system physical address is specified in the RAX register. The new entries inherit the attributes of the original entry. Upon completion, a >return code is stored in EAX.
>rFLAGS bits OF, ZF, AF, PF and SF are set based on this return code..."

>So there *are* 2M entries in the RMP table.

> So even if host page is 1G and underlying (smashed/split) RMP entries
> are 2M, the RMP table entry has to be indexed to a 4K entry
> corresponding to that.

>So if there are 2M entries in the RMP table, how does that indexing with 4K entries is supposed to work?

>Hell, even PSMASH pseudocode shows how you go and write all those 512 4K entries using the 2M entry as a template. So *before* you have smashed that 2M entry, it *is* an *actual* 2M entry.

>So if you fault on a page which is backed by that 2M RMP entry, you will get that 2M RMP entry.

> If it was simply a 2M entry in the RMP table, then pmd_index() will
> work correctly.

>Judging by the above text, it *can* *be* a 2M RMP entry!

>By reading your example you're trying to tell me that a RMP #PF will always need to work on 4K entries. Which would then need for a 2M entry as above to be PSMASHed in order to get the 4K thing. But that would be silly - RMP PFs will this way >gradually break all 2M pages and degrage performance for no real reason.

>So this still looks real wrong to me.

Please note that RMP table entries have only 2 page size indicators 4k and 2M, so it covers a max physical address range of 2MB.
In all cases, there is one RMP entry per 4K page and the index into the RMP table is basically address /PAGE_SIZE, and that does
not change for hugepages. Therefore we need to capture the address bits (from address) so that we index into the
4K entry in the RMP table.

An important point to note here is that RMPUPDATE instruction sets the Assigned bit for all the sub-page entries for
a hugepage mapping in RMP table, so we will get the correct "assigned" page information when we index into the 4K entry
in the RMP table and additionally, __snp_lookup_rmpentry() gets the 2MB aligned entry in the RMP table to get the correct Page size.
(as below)

static struct rmpentry *__snp_lookup_rmpentry(u64 pfn, int *level)
{
..
/* Read a large RMP entry to get the correct page level used in RMP entry. */
large_entry = rmptable_entry(paddr & PMD_MASK);
*level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
..

Therefore, the 2M entry and it's subpages in the RMP table will always exist because of the RMPUPDATE instruction even
without smashing/splitting of the hugepage, so we really don't need the 2MB entry to be PSMASHed in order to get the 4K thing.

Thanks,
Ashish

2022-09-06 15:21:46

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

>> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
>> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
>> > > + pfn = pte_pfn(*pte);
>> > > +
>> > > + /* If its large page then calculte the fault pfn */
>> > > + if (level > PG_LEVEL_4K) {
>> > > + unsigned long mask;
>> > > +
>> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
>> > > + pfn |= (address >> PAGE_SHIFT) & mask;
>> >
>> > Oh boy, this is unnecessarily complicated. Isn't this
>> >
>> > pfn |= pud_index(address);
>> >
>> > or
>> > pfn |= pmd_index(address);
>>
>> I played with this a bit and ended up with
>>
>> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level
>> - 1));
>>
>> Unless I got something terribly wrong, this should do the same (see
>> the attached patch) as the existing calculations.

>Actually, I don't think they're the same. I think Jarkko's version is correct. Specifically:
>- For level = PG_LEVEL_2M they're the same.
>- For level = PG_LEVEL_1G:
>The current code calculates a garbage mask:
>mask = pages_per_hpage(level) - pages_per_hpage(level - 1); translates to:
>>> hex(262144 - 512)
>'0x3fe00'

No actually this is not a garbage mask, as I explained in earlier responses we need to capture the address bits
to get to the correct 4K index into the RMP table.
Therefore, for level = PG_LEVEL_1G:
mask = pages_per_hpage(level) - pages_per_hpage(level - 1) => 0x3fe00 (which is the correct mask).

>But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
>>> hex(262144 -1)
>'0x3ffff'

We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.

Thanks,
Ashish

2022-09-06 15:54:58

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Sep 06, 2022 at 09:17:15AM -0500, Kalra, Ashish wrote:
> [AMD Official Use Only - General]
>
> >> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> >> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> >> > > + pfn = pte_pfn(*pte);
> >> > > +
> >> > > + /* If its large page then calculte the fault pfn */
> >> > > + if (level > PG_LEVEL_4K) {
> >> > > + unsigned long mask;
> >> > > +
> >> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> >> > > + pfn |= (address >> PAGE_SHIFT) & mask;
> >> >
> >> > Oh boy, this is unnecessarily complicated. Isn't this
> >> >
> >> > pfn |= pud_index(address);
> >> >
> >> > or
> >> > pfn |= pmd_index(address);
> >>
> >> I played with this a bit and ended up with
> >>
> >> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level
> >> - 1));
> >>
> >> Unless I got something terribly wrong, this should do the same (see
> >> the attached patch) as the existing calculations.
>
> >Actually, I don't think they're the same. I think Jarkko's version is correct. Specifically:
> >- For level = PG_LEVEL_2M they're the same.
> >- For level = PG_LEVEL_1G:
> >The current code calculates a garbage mask:
> >mask = pages_per_hpage(level) - pages_per_hpage(level - 1); translates to:
> >>> hex(262144 - 512)
> >'0x3fe00'
>
> No actually this is not a garbage mask, as I explained in earlier responses we need to capture the address bits
> to get to the correct 4K index into the RMP table.
> Therefore, for level = PG_LEVEL_1G:
> mask = pages_per_hpage(level) - pages_per_hpage(level - 1) => 0x3fe00 (which is the correct mask).

That's the correct mask to grab the 2M-aligned address bits, e.g:

pfn_mask = 3fe00h = 11 1111 1110 0000 0000b

So the last 9 bits are ignored, e.g. anything PFNs that are multiples
of 512 (2^9), and the upper bits comes from the 1GB PTE entry.

But there is an open question of whether we actually want to index using
2M-aligned or specific 4K-aligned PFN indicated by the faulting address.

>
> >But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
> >>> hex(262144 -1)
> >'0x3ffff'
>
> We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.

If we actually want the 4K page, I think we would want to use the 0x3ffff
mask as Marc suggested to get to the specific 4K RMP entry, which I don't
think the current code is trying to do. But maybe that *should* be what we
should be doing.

Based on your earlier explanation, if we index into the RMP table using
2M-aligned address, we might find that the entry does not have the
page-size bit set (maybe it was PSMASH'd for some reason). If that's the
cause we'd then have to calculate the index for the specific RMP entry
for the specific 4K address that caused the fault, and then check that
instead.

If however we simply index directly in the 4K RMP entry from the start,
snp_lookup_rmpentry() should still tell us whether the page is private
or not, because RMPUPDATE/PSMASH are both documented to also update the
assigned bits for each 4K RMP entry even if you're using a 2M RMP entry
and setting the page-size bit to cover the whole 2M range.

Additionally, snp_lookup_rmpentry() already has logic to also go check
the 2M-aligned RMP entry to provide an indication of what level it is
mapped at in the RMP table, so we can still use that to determine if the
host mapping needs to be split or not.

One thing that could use some confirmation is what happens if you do an
RMPUPDATE for a 2MB RMP entry, and then go back and try to RMPUPDATE a
sub-page and change the assigned bit so it's not consistent with 2MB RMP
entry. I would assume that would fail the instruction, but we should
confirm that before relying on this logic.

-Mike

>
> Thanks,
> Ashish

2022-09-06 16:26:00

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Sep 06, 2022 at 02:17:15PM +0000, Kalra, Ashish wrote:
> [AMD Official Use Only - General]
>
> >> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> >> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> >> > > + pfn = pte_pfn(*pte);
> >> > > +
> >> > > + /* If its large page then calculte the fault pfn */
> >> > > + if (level > PG_LEVEL_4K) {
> >> > > + unsigned long mask;
> >> > > +
> >> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> >> > > + pfn |= (address >> PAGE_SHIFT) & mask;
> >> >
> >> > Oh boy, this is unnecessarily complicated. Isn't this
> >> >
> >> > pfn |= pud_index(address);
> >> >
> >> > or
> >> > pfn |= pmd_index(address);
> >>
> >> I played with this a bit and ended up with
> >>
> >> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level
> >> - 1));
> >>
> >> Unless I got something terribly wrong, this should do the same (see
> >> the attached patch) as the existing calculations.
>
> >Actually, I don't think they're the same. I think Jarkko's version is correct. Specifically:
> >- For level = PG_LEVEL_2M they're the same.
> >- For level = PG_LEVEL_1G:
> >The current code calculates a garbage mask:
> >mask = pages_per_hpage(level) - pages_per_hpage(level - 1); translates to:
> >>> hex(262144 - 512)
> >'0x3fe00'
>
> No actually this is not a garbage mask, as I explained in earlier responses we need to capture the address bits
> to get to the correct 4K index into the RMP table.
> Therefore, for level = PG_LEVEL_1G:
> mask = pages_per_hpage(level) - pages_per_hpage(level - 1) => 0x3fe00 (which is the correct mask).
>
> >But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
> >>> hex(262144 -1)
> >'0x3ffff'
>
> We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.

I think you're correct, so I'll retry:

(address / PAGE_SIZE) & (pages_per_hpage(level) - pages_per_hpage(level - 1)) =

(address / PAGE_SIZE) & ((page_level_size(level) / PAGE_SIZE) - (page_level_size(level - 1) / PAGE_SIZE)) =

[ factor out 1 / PAGE_SIZE ]

(address & (page_level_size(level) - page_level_size(level - 1))) / PAGE_SIZE =

[ Substitute with PFN_DOWN() ]

PFN_DOWN(address & (page_level_size(level) - page_level_size(level - 1)))

So you can just:

pfn = pte_pfn(*pte) | PFN_DOWN(address & (page_level_size(level) - page_level_size(level - 1)));

Which is IMHO way better still what it is now because no branching
and no ad-hoc helpers (the current is essentially just page_level_size
wrapper).

BR, Jarkko

2022-09-06 17:01:00

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

[AMD Official Use Only - General]

>> >Actually, I don't think they're the same. I think Jarkko's version is correct. Specifically:
>> >- For level = PG_LEVEL_2M they're the same.
>> >- For level = PG_LEVEL_1G:
>> >The current code calculates a garbage mask:
>> >mask = pages_per_hpage(level) - pages_per_hpage(level - 1); translates to:
>> >>> hex(262144 - 512)
>> >'0x3fe00'
>>
>> No actually this is not a garbage mask, as I explained in earlier
>> responses we need to capture the address bits to get to the correct 4K index into the RMP table.
>> Therefore, for level = PG_LEVEL_1G:
>> mask = pages_per_hpage(level) - pages_per_hpage(level - 1) => 0x3fe00 (which is the correct mask).

>That's the correct mask to grab the 2M-aligned address bits, e.g:

> pfn_mask = 3fe00h = 11 1111 1110 0000 0000b

> So the last 9 bits are ignored, e.g. anything PFNs that are multiples
> of 512 (2^9), and the upper bits comes from the 1GB PTE entry.

> But there is an open question of whether we actually want to index using 2M-aligned or specific 4K-aligned PFN indicated by the faulting address.

>>
>> >But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
>> >>> hex(262144 -1)
>> >'0x3ffff'
>>
>> We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.

>If we actually want the 4K page, I think we would want to use the 0x3ffff mask as Marc suggested to get to the specific 4K RMP entry, which I don't think the current code is trying to do. But maybe that *should* be what we should be doing.

Ok, I agree to get to the specific 4K RMP entry.

>Based on your earlier explanation, if we index into the RMP table using 2M-aligned address, we might find that the entry does not have the page-size bit set (maybe it was PSMASH'd for some reason).

I believe that PSMASH does update the 2M-aligned RMP table entry to the smashed page size.
It sets all the 4K intermediate/smashed pages size to 4K and changes the page size of the base RMP table (2M-aligned) entry to 4K.

>If that's the cause we'd then have to calculate the index for the specific RMP entry for the specific 4K address that caused the fault, and then check that instead.

>If however we simply index directly in the 4K RMP entry from the start,
>snp_lookup_rmpentry() should still tell us whether the page is private or not, because RMPUPDATE/PSMASH are both documented to also update the assigned bits for each 4K RMP entry even if you're using a 2M RMP entry and setting the page-size >bit to cover the whole 2M range.

I think it does make sense to index directly into the 4K RMP entry, as we should be indexing into the most granular entry in the RMP table, and that will have the page "assigned" information as both RMPUPDATE/PSMASH would update
the assigned bits for each 4K RMP entry even if we using a 2MB RMP entry (this is an important point to note).

>Additionally, snp_lookup_rmpentry() already has logic to also go check the 2M-aligned RMP entry to provide an indication of what level it is mapped at in the RMP table, so we can still use that to determine if the host mapping needs to be split or >not.

Yes.

>One thing that could use some confirmation is what happens if you do an RMPUPDATE for a 2MB RMP entry, and then go back and try to RMPUPDATE a sub-page and change the assigned bit so it's not consistent with 2MB RMP entry. I would assume >that would fail the instruction, but we should confirm that before relying on this logic.

I agree.

Thanks,
Ashish

2022-09-07 05:23:50

by Marc Orr

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

> >> >But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
> >> >>> hex(262144 -1)
> >> >'0x3ffff'
> >>
> >> We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.
>
> >If we actually want the 4K page, I think we would want to use the 0x3ffff mask as Marc suggested to get to the specific 4K RMP entry, which I don't think the current code is trying to do. But maybe that *should* be what we should be doing.
>
> Ok, I agree to get to the specific 4K RMP entry.

Thanks, Michael, for a thorough and complete reply! I have to admit,
there was some nuance I missed in my earlier reply. But after reading
through what you wrote, I agree, always going to the 4k-entry to get
the "assigned" bit and also leveraging the implementation of
snp_lookup_rmpentry() to lookup the size bit in the 2M-aligned entry
seems like an elegant way to code this up. Assuming this suggestion
becomes the consensus, we might consider a comment in the source code
to capture this discussion. Otherwise, I think I'll forget all of this
the next time I'm reading this code :-). Something like:

/*
* The guest-assigned bit is always propagated to the paddr's respective 4k RMP
* entry -- regardless of the actual RMP page size. In contrast, the RMP page
* size, handled in snp_lookup_rmpentry(), is indicated by the 2M-aligned RMP
* entry.
*/

2022-09-08 07:57:12

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Tue, Sep 06, 2022 at 06:44:23PM +0300, Jarkko Sakkinen wrote:
> On Tue, Sep 06, 2022 at 02:17:15PM +0000, Kalra, Ashish wrote:
> > [AMD Official Use Only - General]
> >
> > >> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> > >> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> > >> > > + pfn = pte_pfn(*pte);
> > >> > > +
> > >> > > + /* If its large page then calculte the fault pfn */
> > >> > > + if (level > PG_LEVEL_4K) {
> > >> > > + unsigned long mask;
> > >> > > +
> > >> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> > >> > > + pfn |= (address >> PAGE_SHIFT) & mask;
> > >> >
> > >> > Oh boy, this is unnecessarily complicated. Isn't this
> > >> >
> > >> > pfn |= pud_index(address);
> > >> >
> > >> > or
> > >> > pfn |= pmd_index(address);
> > >>
> > >> I played with this a bit and ended up with
> > >>
> > >> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level
> > >> - 1));
> > >>
> > >> Unless I got something terribly wrong, this should do the same (see
> > >> the attached patch) as the existing calculations.
> >
> > >Actually, I don't think they're the same. I think Jarkko's version is correct. Specifically:
> > >- For level = PG_LEVEL_2M they're the same.
> > >- For level = PG_LEVEL_1G:
> > >The current code calculates a garbage mask:
> > >mask = pages_per_hpage(level) - pages_per_hpage(level - 1); translates to:
> > >>> hex(262144 - 512)
> > >'0x3fe00'
> >
> > No actually this is not a garbage mask, as I explained in earlier responses we need to capture the address bits
> > to get to the correct 4K index into the RMP table.
> > Therefore, for level = PG_LEVEL_1G:
> > mask = pages_per_hpage(level) - pages_per_hpage(level - 1) => 0x3fe00 (which is the correct mask).
> >
> > >But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
> > >>> hex(262144 -1)
> > >'0x3ffff'
> >
> > We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.
>
> I think you're correct, so I'll retry:
>
> (address / PAGE_SIZE) & (pages_per_hpage(level) - pages_per_hpage(level - 1)) =
>
> (address / PAGE_SIZE) & ((page_level_size(level) / PAGE_SIZE) - (page_level_size(level - 1) / PAGE_SIZE)) =
>
> [ factor out 1 / PAGE_SIZE ]
>
> (address & (page_level_size(level) - page_level_size(level - 1))) / PAGE_SIZE =
>
> [ Substitute with PFN_DOWN() ]
>
> PFN_DOWN(address & (page_level_size(level) - page_level_size(level - 1)))
>
> So you can just:
>
> pfn = pte_pfn(*pte) | PFN_DOWN(address & (page_level_size(level) - page_level_size(level - 1)));
>
> Which is IMHO way better still what it is now because no branching
> and no ad-hoc helpers (the current is essentially just page_level_size
> wrapper).

I created a small test program:

$ cat test.c
#include <stdio.h>
int main(void)
{
unsigned long arr[] = {0x8, 0x1000, 0x200000, 0x40000000, 0x8000000000};
int i;

for (i = 1; i < sizeof(arr)/sizeof(unsigned long); i++) {
printf("%048b\n", arr[i] - arr[i - 1]);
printf("%048b\n", (arr[i] - 1) ^ (arr[i - 1] - 1));
}
}

kultaheltta in linux on  host-snp-v7 [?]
$ gcc -o test test.c

kultaheltta in linux on  host-snp-v7 [?]
$ ./test
000000000000000000000000000000000000111111111000
000000000000000000000000000000000000111111111000
000000000000000000000000000111111111000000000000
000000000000000000000000000111111111000000000000
000000000000000000111111111000000000000000000000
000000000000000000111111111000000000000000000000
000000000000000011000000000000000000000000000000
000000000000000011000000000000000000000000000000

So the operation could be described as:

pfn = PFN_DOWN(address & (~page_level_mask(level) ^ ~page_level_mask(level - 1)));

Which IMHO already documents itself quite well: index
with the granularity of PGD by removing bits used for
PGD's below it.

BR, Jarkko

2022-09-08 08:06:56

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 09/49] x86/fault: Add support to handle the RMP fault for user address

On Thu, Sep 08, 2022 at 10:46:51AM +0300, Jarkko Sakkinen wrote:
> On Tue, Sep 06, 2022 at 06:44:23PM +0300, Jarkko Sakkinen wrote:
> > On Tue, Sep 06, 2022 at 02:17:15PM +0000, Kalra, Ashish wrote:
> > > [AMD Official Use Only - General]
> > >
> > > >> On Tue, Aug 09, 2022 at 06:55:43PM +0200, Borislav Petkov wrote:
> > > >> > On Mon, Jun 20, 2022 at 11:03:43PM +0000, Ashish Kalra wrote:
> > > >> > > + pfn = pte_pfn(*pte);
> > > >> > > +
> > > >> > > + /* If its large page then calculte the fault pfn */
> > > >> > > + if (level > PG_LEVEL_4K) {
> > > >> > > + unsigned long mask;
> > > >> > > +
> > > >> > > + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> > > >> > > + pfn |= (address >> PAGE_SHIFT) & mask;
> > > >> >
> > > >> > Oh boy, this is unnecessarily complicated. Isn't this
> > > >> >
> > > >> > pfn |= pud_index(address);
> > > >> >
> > > >> > or
> > > >> > pfn |= pmd_index(address);
> > > >>
> > > >> I played with this a bit and ended up with
> > > >>
> > > >> pfn = pte_pfn(*pte) | PFN_DOWN(address & page_level_mask(level
> > > >> - 1));
> > > >>
> > > >> Unless I got something terribly wrong, this should do the same (see
> > > >> the attached patch) as the existing calculations.
> > >
> > > >Actually, I don't think they're the same. I think Jarkko's version is correct. Specifically:
> > > >- For level = PG_LEVEL_2M they're the same.
> > > >- For level = PG_LEVEL_1G:
> > > >The current code calculates a garbage mask:
> > > >mask = pages_per_hpage(level) - pages_per_hpage(level - 1); translates to:
> > > >>> hex(262144 - 512)
> > > >'0x3fe00'
> > >
> > > No actually this is not a garbage mask, as I explained in earlier responses we need to capture the address bits
> > > to get to the correct 4K index into the RMP table.
> > > Therefore, for level = PG_LEVEL_1G:
> > > mask = pages_per_hpage(level) - pages_per_hpage(level - 1) => 0x3fe00 (which is the correct mask).
> > >
> > > >But I believe Jarkko's version calculates the correct mask (below), incorporating all 18 offset bits into the 1G page.
> > > >>> hex(262144 -1)
> > > >'0x3ffff'
> > >
> > > We can get this simply by doing (page_per_hpage(level)-1), but as I mentioned above this is not what we need.
> >
> > I think you're correct, so I'll retry:
> >
> > (address / PAGE_SIZE) & (pages_per_hpage(level) - pages_per_hpage(level - 1)) =
> >
> > (address / PAGE_SIZE) & ((page_level_size(level) / PAGE_SIZE) - (page_level_size(level - 1) / PAGE_SIZE)) =
> >
> > [ factor out 1 / PAGE_SIZE ]
> >
> > (address & (page_level_size(level) - page_level_size(level - 1))) / PAGE_SIZE =
> >
> > [ Substitute with PFN_DOWN() ]
> >
> > PFN_DOWN(address & (page_level_size(level) - page_level_size(level - 1)))
> >
> > So you can just:
> >
> > pfn = pte_pfn(*pte) | PFN_DOWN(address & (page_level_size(level) - page_level_size(level - 1)));
> >
> > Which is IMHO way better still what it is now because no branching
> > and no ad-hoc helpers (the current is essentially just page_level_size
> > wrapper).
>
> I created a small test program:
>
> $ cat test.c
> #include <stdio.h>
> int main(void)
> {
> unsigned long arr[] = {0x8, 0x1000, 0x200000, 0x40000000, 0x8000000000};
> int i;
>
> for (i = 1; i < sizeof(arr)/sizeof(unsigned long); i++) {
> printf("%048b\n", arr[i] - arr[i - 1]);
> printf("%048b\n", (arr[i] - 1) ^ (arr[i - 1] - 1));
> }
> }
>
> kultaheltta in linux on  host-snp-v7 [?]
> $ gcc -o test test.c
>
> kultaheltta in linux on  host-snp-v7 [?]
> $ ./test
> 000000000000000000000000000000000000111111111000
> 000000000000000000000000000000000000111111111000
> 000000000000000000000000000111111111000000000000
> 000000000000000000000000000111111111000000000000
> 000000000000000000111111111000000000000000000000
> 000000000000000000111111111000000000000000000000
> 000000000000000011000000000000000000000000000000
> 000000000000000011000000000000000000000000000000
>
> So the operation could be described as:
>
> pfn = PFN_DOWN(address & (~page_level_mask(level) ^ ~page_level_mask(level - 1)));
>
> Which IMHO already documents itself quite well: index
> with the granularity of PGD by removing bits used for
> PGD's below it.

I mean:

pfn = pte_pfn(*pte) | PFN_DOWN(address & (~page_level_mask(level) ^ ~page_level_mask(level - 1)));

Note that PG_LEVEL_4K check is unnecessary as the result
will be zero after PFN_DOWN().

BR, Jarkko

2022-09-08 15:13:04

by Harald Hoyer

[permalink] [raw]
Subject: [[PATCH for v6]] KVM: SEV: fix snp_launch_finish

The `params.auth_key_en` indicator does _not_ specify, whether an
ID_AUTH struct should be sent or not, but, wheter the ID_AUTH struct
contains an author key or not. The firmware always expects an ID_AUTH block.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Harald Hoyer <[email protected]>
---
arch/x86/kvm/svm/sev.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 85357dc4d231..5cf4be6a33ba 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2242,17 +2242,18 @@ static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)

data->id_block_en = 1;
data->id_block_paddr = __sme_pa(id_block);
- }

- if (params.auth_key_en) {
id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
if (IS_ERR(id_auth)) {
ret = PTR_ERR(id_auth);
goto e_free_id_block;
}

- data->auth_key_en = 1;
data->id_auth_paddr = __sme_pa(id_auth);
+
+ if (params.auth_key_en) {
+ data->auth_key_en = 1;
+ }
}

data->gctx_paddr = __psp_pa(sev->snp_context);
--
2.37.1

2022-09-08 15:18:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [[PATCH for v6]] KVM: SEV: fix snp_launch_finish

On Thu, Sep 08, 2022, Harald Hoyer wrote:
> The `params.auth_key_en` indicator does _not_ specify, whether an
> ID_AUTH struct should be sent or not, but, wheter the ID_AUTH struct
> contains an author key or not. The firmware always expects an ID_AUTH block.
>
> Link: https://lore.kernel.org/all/[email protected]/

Please provide feedback by directly responding to whatever patch/email is buggy.
Or if that's too complicated for some reason (unlikely in this case), provide the
fixup patch to the author *off-list*.

> Signed-off-by: Harald Hoyer <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 85357dc4d231..5cf4be6a33ba 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2242,17 +2242,18 @@ static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
>
> data->id_block_en = 1;
> data->id_block_paddr = __sme_pa(id_block);
> - }
>
> - if (params.auth_key_en) {
> id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
> if (IS_ERR(id_auth)) {
> ret = PTR_ERR(id_auth);
> goto e_free_id_block;
> }
>
> - data->auth_key_en = 1;
> data->id_auth_paddr = __sme_pa(id_auth);
> +
> + if (params.auth_key_en) {

While I'm here though... Single line if-statements don't need curly braces.

> + data->auth_key_en = 1;
> + }
> }
>
> data->gctx_paddr = __psp_pa(sev->snp_context);
> --
> 2.37.1
>

2022-09-08 20:39:31

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [[PATCH for v6]] KVM: SEV: fix snp_launch_finish

On Thu, Sep 08, 2022 at 03:11:30PM +0000, Sean Christopherson wrote:
> On Thu, Sep 08, 2022, Harald Hoyer wrote:
> > The `params.auth_key_en` indicator does _not_ specify, whether an
> > ID_AUTH struct should be sent or not, but, wheter the ID_AUTH struct
> > contains an author key or not. The firmware always expects an ID_AUTH block.
> >
> > Link: https://lore.kernel.org/all/[email protected]/
>
> Please provide feedback by directly responding to whatever patch/email is buggy.
> Or if that's too complicated for some reason (unlikely in this case), provide the
> fixup patch to the author *off-list*.

I'd guess that'd be:

https://lore.kernel.org/all/6a513cf79bf71c479dbd72165faf1d804d77b3af.1655761627.git.ashish.kalra@amd.com/

BR, Jarkko

2022-09-09 10:29:56

by Harald Hoyer

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 28/49] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command

Replying inline to the patch (and not with a in-reply-to patch, as nitted by Sean Christopherson).

Am 21.06.22 um 01:08 schrieb Ashish Kalra:
> From: Brijesh Singh <[email protected]>
>
> The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
> it as the measurement of the guest at launch.
>
> While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
> to encrypt the VMSA pages.
>
> If its an SNP guest, then VMSA was added in the RMP entry as
> a guest owned page and also removed from the kernel direct map
> so flush it later after it is transitioned back to hypervisor
> state and restored in the direct map.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> Signed-off-by: Ashish Kalra <[email protected]>
> ---
> .../virt/kvm/x86/amd-memory-encryption.rst | 22 ++++
> arch/x86/kvm/svm/sev.c | 119 ++++++++++++++++++
> include/uapi/linux/kvm.h | 14 +++
> 3 files changed, 155 insertions(+)
>
> diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> index 62abd5c1f72b..750162cff87b 100644
> --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
> @@ -514,6 +514,28 @@ Returns: 0 on success, -negative on error
> See the SEV-SNP spec for further details on how to build the VMPL permission
> mask and page type.
>
> +21. KVM_SNP_LAUNCH_FINISH
> +-------------------------
> +
> +After completion of the SNP guest launch flow, the KVM_SNP_LAUNCH_FINISH command can be
> +issued to make the guest ready for the execution.
> +
> +Parameters (in): struct kvm_sev_snp_launch_finish
> +
> +Returns: 0 on success, -negative on error
> +
> +::
> +
> + struct kvm_sev_snp_launch_finish {
> + __u64 id_block_uaddr;
> + __u64 id_auth_uaddr;
> + __u8 id_block_en;
> + __u8 auth_key_en;
> + __u8 host_data[32];
> + };
> +
> +
> +See SEV-SNP specification for further details on launch finish input parameters.
>
> References
> ==========
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index a9461d352eda..a5b90469683f 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2095,6 +2095,106 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> return ret;
> }
>
> +static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_update data = {};
> + int i, ret;
> +
> + data.gctx_paddr = __psp_pa(sev->snp_context);
> + data.page_type = SNP_PAGE_TYPE_VMSA;
> +
> + for (i = 0; i < kvm->created_vcpus; i++) {
> + struct vcpu_svm *svm = to_svm(xa_load(&kvm->vcpu_array, i));
> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + /* Perform some pre-encryption checks against the VMSA */
> + ret = sev_es_sync_vmsa(svm);
> + if (ret)
> + return ret;
> +
> + /* Transition the VMSA page to a firmware state. */
> + ret = rmp_make_private(pfn, -1, PG_LEVEL_4K, sev->asid, true);
> + if (ret)
> + return ret;
> +
> + /* Issue the SNP command to encrypt the VMSA */
> + data.address = __sme_pa(svm->sev_es.vmsa);
> + ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
> + &data, &argp->error);
> + if (ret) {
> + snp_page_reclaim(pfn);
> + return ret;
> + }
> +
> + svm->vcpu.arch.guest_state_protected = true;
> + }
> +
> + return 0;
> +}
> +
> +static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + struct sev_data_snp_launch_finish *data;
> + void *id_block = NULL, *id_auth = NULL;
> + struct kvm_sev_snp_launch_finish params;
> + int ret;
> +
> + if (!sev_snp_guest(kvm))
> + return -ENOTTY;
> +
> + if (!sev->snp_context)
> + return -EINVAL;
> +
> + if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> + return -EFAULT;
> +
> + /* Measure all vCPUs using LAUNCH_UPDATE before we finalize the launch flow. */
> + ret = snp_launch_update_vmsa(kvm, argp);

This poses a real problem for those, who want to precalculate the digest beforehand and sign their TEE without loading the TEE:
1. We don't know the contents of the VMSA, nor the hash of it.
2. Who guarantees, that future kernels have the same VMSA contents?

I would propose at least one additional ioctl parameter specifying the final VMSA for the SNP_PAGE_TYPE_VMSA snp_launch_update_vmsa.
This parameter could specify to use:
- the current VMSA
- or a VMSA resembling the CPU state on reset, where the contents is guaranteed to never change and have a defined digest
- or a user provided VMSA

> + if (ret)
> + return ret;
> +
> + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
> + if (!data)
> + return -ENOMEM;
> +
> + if (params.id_block_en) {
> + id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
> + if (IS_ERR(id_block)) {
> + ret = PTR_ERR(id_block);
> + goto e_free;
> + }
> +
> + data->id_block_en = 1;
> + data->id_block_paddr = __sme_pa(id_block);
> + }
> +
> + if (params.auth_key_en) {

The `params.auth_key_en` indicator does _not_ specify, whether an ID_AUTH struct should be sent or not,
but wheter the ID_AUTH struct contains an author key or not. The firmware always expects an ID_AUTH block.

So, please move the upper `if` to enclose only `data->auth_key_en = 1;`, or use my patch sent in-reply to this mail yesterday.

> + id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
> + if (IS_ERR(id_auth)) {
> + ret = PTR_ERR(id_auth);
> + goto e_free_id_block;
> + }
> +
> + data->auth_key_en = 1;
> + data->id_auth_paddr = __sme_pa(id_auth);
> + }
> +
> + data->gctx_paddr = __psp_pa(sev->snp_context);
> + ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
> +
> + kfree(id_auth);
> +
> +e_free_id_block:
> + kfree(id_block);
> +
> +e_free:
> + kfree(data);
> +
> + return ret;
> +}
> +
> int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_sev_cmd sev_cmd;
> @@ -2191,6 +2291,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> case KVM_SEV_SNP_LAUNCH_UPDATE:
> r = snp_launch_update(kvm, &sev_cmd);
> break;
> + case KVM_SEV_SNP_LAUNCH_FINISH:
> + r = snp_launch_finish(kvm, &sev_cmd);
> + break;
> default:
> r = -EINVAL;
> goto out;
> @@ -2696,11 +2799,27 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>
> svm = to_svm(vcpu);
>
> + /*
> + * If its an SNP guest, then VMSA was added in the RMP entry as
> + * a guest owned page. Transition the page to hypervisor state
> + * before releasing it back to the system.
> + * Also the page is removed from the kernel direct map, so flush it
> + * later after it is transitioned back to hypervisor state and
> + * restored in the direct map.
> + */
> + if (sev_snp_guest(vcpu->kvm)) {
> + u64 pfn = __pa(svm->sev_es.vmsa) >> PAGE_SHIFT;
> +
> + if (host_rmp_make_shared(pfn, PG_LEVEL_4K, false))
> + goto skip_vmsa_free;
> + }
> +
> if (vcpu->arch.guest_state_protected)
> sev_flush_encrypted_page(vcpu, svm->sev_es.vmsa);
>
> __free_page(virt_to_page(svm->sev_es.vmsa));
>
> +skip_vmsa_free:
> if (svm->sev_es.ghcb_sa_free)
> kvfree(svm->sev_es.ghcb_sa);
> }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9b36b07414ea..5a4662716b6a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1814,6 +1814,7 @@ enum sev_cmd_id {
> KVM_SEV_SNP_INIT,
> KVM_SEV_SNP_LAUNCH_START,
> KVM_SEV_SNP_LAUNCH_UPDATE,
> + KVM_SEV_SNP_LAUNCH_FINISH,
>
> KVM_SEV_NR_MAX,
> };
> @@ -1948,6 +1949,19 @@ struct kvm_sev_snp_launch_update {
> __u8 vmpl1_perms;
> };
>
> +#define KVM_SEV_SNP_ID_BLOCK_SIZE 96
> +#define KVM_SEV_SNP_ID_AUTH_SIZE 4096
> +#define KVM_SEV_SNP_FINISH_DATA_SIZE 32
> +
> +struct kvm_sev_snp_launch_finish {
> + __u64 id_block_uaddr;
> + __u64 id_auth_uaddr;
> + __u8 id_block_en;
> + __u8 auth_key_en;
> + __u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
> + __u8 pad[6];
> +};
> +
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
> #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)

2022-09-19 18:10:46

by Alper Gun

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
>
> > +
> > +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op, gpa_t gpa,
> > + int level)
> > +{
> > + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
> > + struct kvm *kvm = vcpu->kvm;
> > + int rc, npt_level;
> > + kvm_pfn_t pfn;
> > + gpa_t gpa_end;
> > +
> > + gpa_end = gpa + page_level_size(level);
> > +
> > + while (gpa < gpa_end) {
> > + /*
> > + * If the gpa is not present in the NPT then build the NPT.
> > + */
> > + rc = snp_check_and_build_npt(vcpu, gpa, level);
> > + if (rc)
> > + return -EINVAL;
> > +
> > + if (op == SNP_PAGE_STATE_PRIVATE) {
> > + hva_t hva;
> > +
> > + if (snp_gpa_to_hva(kvm, gpa, &hva))
> > + return -EINVAL;
> > +
> > + /*
> > + * Verify that the hva range is registered. This enforcement is
> > + * required to avoid the cases where a page is marked private
> > + * in the RMP table but never gets cleanup during the VM
> > + * termination path.
> > + */
> > + mutex_lock(&kvm->lock);
> > + rc = is_hva_registered(kvm, hva, page_level_size(level));
> > + mutex_unlock(&kvm->lock);
> > + if (!rc)
> > + return -EINVAL;
> > +
> > + /*
> > + * Mark the userspace range unmerable before adding the pages
> > + * in the RMP table.
> > + */
> > + mmap_write_lock(kvm->mm);
> > + rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
> > + mmap_write_unlock(kvm->mm);
> > + if (rc)
> > + return -EINVAL;
> > + }
> > +
> > + write_lock(&kvm->mmu_lock);
> > +
> > + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
> > + if (!rc) {
> > + /*
> > + * This may happen if another vCPU unmapped the page
> > + * before we acquire the lock. Retry the PSC.
> > + */
> > + write_unlock(&kvm->mmu_lock);
> > + return 0;
> > + }
>
> I think we want to return -EAGAIN or similar if we want the caller to
> retry, right? I think returning 0 here hides the error.
>

The problem here is that the caller(linux guest kernel) doesn't retry
if PSC fails. The current implementation in the guest kernel is that
if a page state change request fails, it terminates the VM with
GHCB_TERM_PSC reason.
Returning 0 here is not a good option because it will fail the PSC
silently and will probably cause a nested RMP fault later. Returning
an error also terminates the guest immediately with current guest
implementation. I think the best approach here is adding a retry logic
to this function. Retrying without returning an error should help it
work because snp_check_and_build_npt will be called again and in the
second attempt this should work.

> > +
> > + /*
> > + * Adjust the level so that we don't go higher than the backing
> > + * page level.
> > + */
> > + level = min_t(size_t, level, npt_level);
> > +
> > + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op, level);
> > +
> > + switch (op) {
> > + case SNP_PAGE_STATE_SHARED:
> > + rc = snp_make_page_shared(kvm, gpa, pfn, level);
> > + break;
> > + case SNP_PAGE_STATE_PRIVATE:
> > + rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
> > + break;
> > + default:
> > + rc = -EINVAL;
> > + break;
> > + }
> > +
> > + write_unlock(&kvm->mmu_lock);
> > +
> > + if (rc) {
> > + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
> > + op, gpa, pfn, level, rc);
> > + return rc;
> > + }
> > +
> > + gpa = gpa + page_level_size(level);
> > + }
> > +
> > + return 0;
> > +}
> > +

2022-09-19 21:40:41

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On 9/19/22 12:53, Alper Gun wrote:
> On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
>>
>>> +
>>> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op, gpa_t gpa,
>>> + int level)
>>> +{
>>> + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
>>> + struct kvm *kvm = vcpu->kvm;
>>> + int rc, npt_level;
>>> + kvm_pfn_t pfn;
>>> + gpa_t gpa_end;
>>> +
>>> + gpa_end = gpa + page_level_size(level);
>>> +
>>> + while (gpa < gpa_end) {
>>> + /*
>>> + * If the gpa is not present in the NPT then build the NPT.
>>> + */
>>> + rc = snp_check_and_build_npt(vcpu, gpa, level);
>>> + if (rc)
>>> + return -EINVAL;
>>> +
>>> + if (op == SNP_PAGE_STATE_PRIVATE) {
>>> + hva_t hva;
>>> +
>>> + if (snp_gpa_to_hva(kvm, gpa, &hva))
>>> + return -EINVAL;
>>> +
>>> + /*
>>> + * Verify that the hva range is registered. This enforcement is
>>> + * required to avoid the cases where a page is marked private
>>> + * in the RMP table but never gets cleanup during the VM
>>> + * termination path.
>>> + */
>>> + mutex_lock(&kvm->lock);
>>> + rc = is_hva_registered(kvm, hva, page_level_size(level));
>>> + mutex_unlock(&kvm->lock);
>>> + if (!rc)
>>> + return -EINVAL;
>>> +
>>> + /*
>>> + * Mark the userspace range unmerable before adding the pages
>>> + * in the RMP table.
>>> + */
>>> + mmap_write_lock(kvm->mm);
>>> + rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
>>> + mmap_write_unlock(kvm->mm);
>>> + if (rc)
>>> + return -EINVAL;
>>> + }
>>> +
>>> + write_lock(&kvm->mmu_lock);
>>> +
>>> + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
>>> + if (!rc) {
>>> + /*
>>> + * This may happen if another vCPU unmapped the page
>>> + * before we acquire the lock. Retry the PSC.
>>> + */
>>> + write_unlock(&kvm->mmu_lock);
>>> + return 0;
>>> + }
>>
>> I think we want to return -EAGAIN or similar if we want the caller to
>> retry, right? I think returning 0 here hides the error.
>>
>
> The problem here is that the caller(linux guest kernel) doesn't retry
> if PSC fails. The current implementation in the guest kernel is that
> if a page state change request fails, it terminates the VM with
> GHCB_TERM_PSC reason.
> Returning 0 here is not a good option because it will fail the PSC
> silently and will probably cause a nested RMP fault later. Returning

Returning 0 here is ok because the PSC current index into the PSC
structure will not be updated and the guest will then retry (see the loop
in vmgexit_psc() in arch/x86/kernel/sev.c).

Thanks,
Tom

> an error also terminates the guest immediately with current guest
> implementation. I think the best approach here is adding a retry logic
> to this function. Retrying without returning an error should help it
> work because snp_check_and_build_npt will be called again and in the
> second attempt this should work.
>
>>> +
>>> + /*
>>> + * Adjust the level so that we don't go higher than the backing
>>> + * page level.
>>> + */
>>> + level = min_t(size_t, level, npt_level);
>>> +
>>> + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op, level);
>>> +
>>> + switch (op) {
>>> + case SNP_PAGE_STATE_SHARED:
>>> + rc = snp_make_page_shared(kvm, gpa, pfn, level);
>>> + break;
>>> + case SNP_PAGE_STATE_PRIVATE:
>>> + rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
>>> + break;
>>> + default:
>>> + rc = -EINVAL;
>>> + break;
>>> + }
>>> +
>>> + write_unlock(&kvm->mmu_lock);
>>> +
>>> + if (rc) {
>>> + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
>>> + op, gpa, pfn, level, rc);
>>> + return rc;
>>> + }
>>> +
>>> + gpa = gpa + page_level_size(level);
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +

2022-09-19 22:03:26

by Alper Gun

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On Mon, Sep 19, 2022 at 2:38 PM Tom Lendacky <[email protected]> wrote:
>
> On 9/19/22 12:53, Alper Gun wrote:
> > On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
> >>
> >>> +
> >>> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op, gpa_t gpa,
> >>> + int level)
> >>> +{
> >>> + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
> >>> + struct kvm *kvm = vcpu->kvm;
> >>> + int rc, npt_level;
> >>> + kvm_pfn_t pfn;
> >>> + gpa_t gpa_end;
> >>> +
> >>> + gpa_end = gpa + page_level_size(level);
> >>> +
> >>> + while (gpa < gpa_end) {
> >>> + /*
> >>> + * If the gpa is not present in the NPT then build the NPT.
> >>> + */
> >>> + rc = snp_check_and_build_npt(vcpu, gpa, level);
> >>> + if (rc)
> >>> + return -EINVAL;
> >>> +
> >>> + if (op == SNP_PAGE_STATE_PRIVATE) {
> >>> + hva_t hva;
> >>> +
> >>> + if (snp_gpa_to_hva(kvm, gpa, &hva))
> >>> + return -EINVAL;
> >>> +
> >>> + /*
> >>> + * Verify that the hva range is registered. This enforcement is
> >>> + * required to avoid the cases where a page is marked private
> >>> + * in the RMP table but never gets cleanup during the VM
> >>> + * termination path.
> >>> + */
> >>> + mutex_lock(&kvm->lock);
> >>> + rc = is_hva_registered(kvm, hva, page_level_size(level));
> >>> + mutex_unlock(&kvm->lock);
> >>> + if (!rc)
> >>> + return -EINVAL;
> >>> +
> >>> + /*
> >>> + * Mark the userspace range unmerable before adding the pages
> >>> + * in the RMP table.
> >>> + */
> >>> + mmap_write_lock(kvm->mm);
> >>> + rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
> >>> + mmap_write_unlock(kvm->mm);
> >>> + if (rc)
> >>> + return -EINVAL;
> >>> + }
> >>> +
> >>> + write_lock(&kvm->mmu_lock);
> >>> +
> >>> + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
> >>> + if (!rc) {
> >>> + /*
> >>> + * This may happen if another vCPU unmapped the page
> >>> + * before we acquire the lock. Retry the PSC.
> >>> + */
> >>> + write_unlock(&kvm->mmu_lock);
> >>> + return 0;
> >>> + }
> >>
> >> I think we want to return -EAGAIN or similar if we want the caller to
> >> retry, right? I think returning 0 here hides the error.
> >>
> >
> > The problem here is that the caller(linux guest kernel) doesn't retry
> > if PSC fails. The current implementation in the guest kernel is that
> > if a page state change request fails, it terminates the VM with
> > GHCB_TERM_PSC reason.
> > Returning 0 here is not a good option because it will fail the PSC
> > silently and will probably cause a nested RMP fault later. Returning
>
> Returning 0 here is ok because the PSC current index into the PSC
> structure will not be updated and the guest will then retry (see the loop
> in vmgexit_psc() in arch/x86/kernel/sev.c).
>
> Thanks,
> Tom

But the host code updates the index. It doesn't leave the loop because
rc is 0. The guest will think that it is successful.
rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
if (rc)
goto out;

Also the page state change request with MSR is not retried. It
terminates the VM if the MSR request fails.

>
> > an error also terminates the guest immediately with current guest
> > implementation. I think the best approach here is adding a retry logic
> > to this function. Retrying without returning an error should help it
> > work because snp_check_and_build_npt will be called again and in the
> > second attempt this should work.
> >
> >>> +
> >>> + /*
> >>> + * Adjust the level so that we don't go higher than the backing
> >>> + * page level.
> >>> + */
> >>> + level = min_t(size_t, level, npt_level);
> >>> +
> >>> + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op, level);
> >>> +
> >>> + switch (op) {
> >>> + case SNP_PAGE_STATE_SHARED:
> >>> + rc = snp_make_page_shared(kvm, gpa, pfn, level);
> >>> + break;
> >>> + case SNP_PAGE_STATE_PRIVATE:
> >>> + rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
> >>> + break;
> >>> + default:
> >>> + rc = -EINVAL;
> >>> + break;
> >>> + }
> >>> +
> >>> + write_unlock(&kvm->mmu_lock);
> >>> +
> >>> + if (rc) {
> >>> + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
> >>> + op, gpa, pfn, level, rc);
> >>> + return rc;
> >>> + }
> >>> +
> >>> + gpa = gpa + page_level_size(level);
> >>> + }
> >>> +
> >>> + return 0;
> >>> +}
> >>> +

2022-09-19 22:25:22

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On 9/19/22 17:02, Alper Gun wrote:
> On Mon, Sep 19, 2022 at 2:38 PM Tom Lendacky <[email protected]> wrote:
>>
>> On 9/19/22 12:53, Alper Gun wrote:
>>> On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
>>>>
>>>>> +
>>>>> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op, gpa_t gpa,
>>>>> + int level)
>>>>> +{
>>>>> + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
>>>>> + struct kvm *kvm = vcpu->kvm;
>>>>> + int rc, npt_level;
>>>>> + kvm_pfn_t pfn;
>>>>> + gpa_t gpa_end;
>>>>> +
>>>>> + gpa_end = gpa + page_level_size(level);
>>>>> +
>>>>> + while (gpa < gpa_end) {
>>>>> + /*
>>>>> + * If the gpa is not present in the NPT then build the NPT.
>>>>> + */
>>>>> + rc = snp_check_and_build_npt(vcpu, gpa, level);
>>>>> + if (rc)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (op == SNP_PAGE_STATE_PRIVATE) {
>>>>> + hva_t hva;
>>>>> +
>>>>> + if (snp_gpa_to_hva(kvm, gpa, &hva))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + /*
>>>>> + * Verify that the hva range is registered. This enforcement is
>>>>> + * required to avoid the cases where a page is marked private
>>>>> + * in the RMP table but never gets cleanup during the VM
>>>>> + * termination path.
>>>>> + */
>>>>> + mutex_lock(&kvm->lock);
>>>>> + rc = is_hva_registered(kvm, hva, page_level_size(level));
>>>>> + mutex_unlock(&kvm->lock);
>>>>> + if (!rc)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + /*
>>>>> + * Mark the userspace range unmerable before adding the pages
>>>>> + * in the RMP table.
>>>>> + */
>>>>> + mmap_write_lock(kvm->mm);
>>>>> + rc = snp_mark_unmergable(kvm, hva, page_level_size(level));
>>>>> + mmap_write_unlock(kvm->mm);
>>>>> + if (rc)
>>>>> + return -EINVAL;
>>>>> + }
>>>>> +
>>>>> + write_lock(&kvm->mmu_lock);
>>>>> +
>>>>> + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
>>>>> + if (!rc) {
>>>>> + /*
>>>>> + * This may happen if another vCPU unmapped the page
>>>>> + * before we acquire the lock. Retry the PSC.
>>>>> + */
>>>>> + write_unlock(&kvm->mmu_lock);
>>>>> + return 0;
>>>>> + }
>>>>
>>>> I think we want to return -EAGAIN or similar if we want the caller to
>>>> retry, right? I think returning 0 here hides the error.
>>>>
>>>
>>> The problem here is that the caller(linux guest kernel) doesn't retry
>>> if PSC fails. The current implementation in the guest kernel is that
>>> if a page state change request fails, it terminates the VM with
>>> GHCB_TERM_PSC reason.
>>> Returning 0 here is not a good option because it will fail the PSC
>>> silently and will probably cause a nested RMP fault later. Returning
>>
>> Returning 0 here is ok because the PSC current index into the PSC
>> structure will not be updated and the guest will then retry (see the loop
>> in vmgexit_psc() in arch/x86/kernel/sev.c).
>>
>> Thanks,
>> Tom
>
> But the host code updates the index. It doesn't leave the loop because
> rc is 0. The guest will think that it is successful.
> rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
> if (rc)
> goto out;
>
> Also the page state change request with MSR is not retried. It
> terminates the VM if the MSR request fails.

Ah, right. I see what you mean. It should probably return a -EAGAIN
instead of 0 and then the if (rc) check should be modified to specifically
look for -EAGAIN and goto out after setting rc to 0.

But that does leave the MSR protocol open to the problem that you mention,
so, yes, retry logic in snp_handle_page_state_change() for a -EAGAIN seems
reasonable.

Thanks,
Tom

>
>>
>>> an error also terminates the guest immediately with current guest
>>> implementation. I think the best approach here is adding a retry logic
>>> to this function. Retrying without returning an error should help it
>>> work because snp_check_and_build_npt will be called again and in the
>>> second attempt this should work.
>>>
>>>>> +
>>>>> + /*
>>>>> + * Adjust the level so that we don't go higher than the backing
>>>>> + * page level.
>>>>> + */
>>>>> + level = min_t(size_t, level, npt_level);
>>>>> +
>>>>> + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op, level);
>>>>> +
>>>>> + switch (op) {
>>>>> + case SNP_PAGE_STATE_SHARED:
>>>>> + rc = snp_make_page_shared(kvm, gpa, pfn, level);
>>>>> + break;
>>>>> + case SNP_PAGE_STATE_PRIVATE:
>>>>> + rc = rmp_make_private(pfn, gpa, level, sev->asid, false);
>>>>> + break;
>>>>> + default:
>>>>> + rc = -EINVAL;
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + write_unlock(&kvm->mmu_lock);
>>>>> +
>>>>> + if (rc) {
>>>>> + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
>>>>> + op, gpa, pfn, level, rc);
>>>>> + return rc;
>>>>> + }
>>>>> +
>>>>> + gpa = gpa + page_level_size(level);
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +

2022-09-19 23:48:06

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT


On 9/19/22 22:18, Tom Lendacky wrote:
> On 9/19/22 17:02, Alper Gun wrote:
>> On Mon, Sep 19, 2022 at 2:38 PM Tom Lendacky
>> <[email protected]> wrote:
>>>
>>> On 9/19/22 12:53, Alper Gun wrote:
>>>> On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
>>>>>
>>>>>> +
>>>>>> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu,
>>>>>> enum psc_op op, gpa_t gpa,
>>>>>> +                                         int level)
>>>>>> +{
>>>>>> +       struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
>>>>>> +       struct kvm *kvm = vcpu->kvm;
>>>>>> +       int rc, npt_level;
>>>>>> +       kvm_pfn_t pfn;
>>>>>> +       gpa_t gpa_end;
>>>>>> +
>>>>>> +       gpa_end = gpa + page_level_size(level);
>>>>>> +
>>>>>> +       while (gpa < gpa_end) {
>>>>>> +               /*
>>>>>> +                * If the gpa is not present in the NPT then
>>>>>> build the NPT.
>>>>>> +                */
>>>>>> +               rc = snp_check_and_build_npt(vcpu, gpa, level);
>>>>>> +               if (rc)
>>>>>> +                       return -EINVAL;
>>>>>> +
>>>>>> +               if (op == SNP_PAGE_STATE_PRIVATE) {
>>>>>> +                       hva_t hva;
>>>>>> +
>>>>>> +                       if (snp_gpa_to_hva(kvm, gpa, &hva))
>>>>>> +                               return -EINVAL;
>>>>>> +
>>>>>> +                       /*
>>>>>> +                        * Verify that the hva range is
>>>>>> registered. This enforcement is
>>>>>> +                        * required to avoid the cases where a
>>>>>> page is marked private
>>>>>> +                        * in the RMP table but never gets
>>>>>> cleanup during the VM
>>>>>> +                        * termination path.
>>>>>> +                        */
>>>>>> +                       mutex_lock(&kvm->lock);
>>>>>> +                       rc = is_hva_registered(kvm, hva,
>>>>>> page_level_size(level));
>>>>>> +                       mutex_unlock(&kvm->lock);
>>>>>> +                       if (!rc)
>>>>>> +                               return -EINVAL;
>>>>>> +
>>>>>> +                       /*
>>>>>> +                        * Mark the userspace range unmerable
>>>>>> before adding the pages
>>>>>> +                        * in the RMP table.
>>>>>> +                        */
>>>>>> +                       mmap_write_lock(kvm->mm);
>>>>>> +                       rc = snp_mark_unmergable(kvm, hva,
>>>>>> page_level_size(level));
>>>>>> +                       mmap_write_unlock(kvm->mm);
>>>>>> +                       if (rc)
>>>>>> +                               return -EINVAL;
>>>>>> +               }
>>>>>> +
>>>>>> +               write_lock(&kvm->mmu_lock);
>>>>>> +
>>>>>> +               rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn,
>>>>>> &npt_level);
>>>>>> +               if (!rc) {
>>>>>> +                       /*
>>>>>> +                        * This may happen if another vCPU
>>>>>> unmapped the page
>>>>>> +                        * before we acquire the lock. Retry the
>>>>>> PSC.
>>>>>> +                        */
>>>>>> + write_unlock(&kvm->mmu_lock);
>>>>>> +                       return 0;
>>>>>> +               }
>>>>>
>>>>> I think we want to return -EAGAIN or similar if we want the caller to
>>>>> retry, right? I think returning 0 here hides the error.
>>>>>
>>>>
>>>> The problem here is that the caller(linux guest kernel) doesn't retry
>>>> if PSC fails. The current implementation in the guest kernel is that
>>>> if a page state change request fails, it terminates the VM with
>>>> GHCB_TERM_PSC reason.
>>>> Returning 0 here is not a good option because it will fail the PSC
>>>> silently and will probably cause a nested RMP fault later. Returning
>>>
>>> Returning 0 here is ok because the PSC current index into the PSC
>>> structure will not be updated and the guest will then retry (see the
>>> loop
>>> in vmgexit_psc() in arch/x86/kernel/sev.c).
>>>
>>> Thanks,
>>> Tom
>>
>> But the host code updates the index. It doesn't leave the loop because
>> rc is 0. The guest will think that it is successful.
>> rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
>> if (rc)
>> goto out;
>>
>> Also the page state change request with MSR is not retried. It
>> terminates the VM if the MSR request fails.
>
> Ah, right. I see what you mean. It should probably return a -EAGAIN
> instead of 0 and then the if (rc) check should be modified to
> specifically look for -EAGAIN and goto out after setting rc to 0.
>
> But that does leave the MSR protocol open to the problem that you
> mention, so, yes, retry logic in snp_handle_page_state_change() for a
> -EAGAIN seems reasonable.
>
> Thanks,
> Tom

I believe it makes more sense to add the retry logic within
__snp_handle_page_state_change() itself, as that will make it work for
both the GHCB MSR protocol and the GHCB VMGEXIT requests.

Thanks, Ashish

>
>>
>>>
>>>> an error also terminates the guest immediately with current guest
>>>> implementation. I think the best approach here is adding a retry logic
>>>> to this function. Retrying without returning an error should help it
>>>> work because snp_check_and_build_npt will be called again and in the
>>>> second attempt this should work.
>>>>
>>>>>> +
>>>>>> +               /*
>>>>>> +                * Adjust the level so that we don't go higher
>>>>>> than the backing
>>>>>> +                * page level.
>>>>>> +                */
>>>>>> +               level = min_t(size_t, level, npt_level);
>>>>>> +
>>>>>> +               trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op,
>>>>>> level);
>>>>>> +
>>>>>> +               switch (op) {
>>>>>> +               case SNP_PAGE_STATE_SHARED:
>>>>>> +                       rc = snp_make_page_shared(kvm, gpa, pfn,
>>>>>> level);
>>>>>> +                       break;
>>>>>> +               case SNP_PAGE_STATE_PRIVATE:
>>>>>> +                       rc = rmp_make_private(pfn, gpa, level,
>>>>>> sev->asid, false);
>>>>>> +                       break;
>>>>>> +               default:
>>>>>> +                       rc = -EINVAL;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               write_unlock(&kvm->mmu_lock);
>>>>>> +
>>>>>> +               if (rc) {
>>>>>> +                       pr_err_ratelimited("Error op %d gpa %llx
>>>>>> pfn %llx level %d rc %d\n",
>>>>>> +                                          op, gpa, pfn, level, rc);
>>>>>> +                       return rc;
>>>>>> +               }
>>>>>> +
>>>>>> +               gpa = gpa + page_level_size(level);
>>>>>> +       }
>>>>>> +
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +

2022-09-20 13:11:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 11/49] crypto:ccp: Define the SEV-SNP commands

On Mon, Jun 20, 2022 at 11:04:14PM +0000, Ashish Kalra wrote:
> +/**
> + * struct sev_data_snp_platform_status_buf - SNP_PLATFORM_STATUS command params
> + *
> + * @address: physical address where the status should be copied
> + */
> +struct sev_data_snp_platform_status_buf {
> + u64 status_paddr; /* In */
> +} __packed;
> +
> +/**
> + * struct sev_data_snp_download_firmware - SNP_DOWNLOAD_FIRMWARE command params
> + *
> + * @address: physical address of firmware image
> + * @len: len of the firmware image
> + */
> +struct sev_data_snp_download_firmware {
> + u64 address; /* In */
> + u32 len; /* In */
> +} __packed;
> +
> +/**
> + * struct sev_data_snp_gctx_create - SNP_GCTX_CREATE command params
> + *
> + * @gctx_paddr: system physical address of the page donated to firmware by
> + * the hypervisor to contain the guest context.
> + */
> +struct sev_data_snp_gctx_create {
> + u64 gctx_paddr; /* In */
> +} __packed;

So some of those structs have the same layout. Let's unify them pls.
I.e., for

sev_data_send_finish, sev_data_send_cancel, sev_data_receive_finish

you do

struct sev_data_tx {
u32 handle; /* In */
} __packed;

For sev_data_snp_platform_status_buf, sev_data_snp_gctx_create,
sev_data_snp_decommission and all those others who are a single u64, you
use a single

struct sev_data_addr {
u64 gctx_paddr; /* In */
} __packed;

so that we don't have gazillion structs all of different names but a lot
of them identical in content.


...

> +/**
> + * struct sev_data_snp_launch_finish - SNP_LAUNCH_FINISH command params
> + *
> + * @gctx_addr: system pphysical address of guest context page
^^^^^^^^^

physical

> + */
> +struct sev_data_snp_launch_finish {
> + u64 gctx_paddr;
> + u64 id_block_paddr;
> + u64 id_auth_paddr;
> + u8 id_block_en:1;
> + u8 auth_key_en:1;
> + u64 rsvd:62;
> + u8 host_data[32];
> +} __packed;
> +
> +/**
> + * struct sev_data_snp_guest_status - SNP_GUEST_STATUS command params
> + *
> + * @gctx_paddr: system physical address of guest context page
> + * @address: system physical address of guest status page
> + */
> +struct sev_data_snp_guest_status {
> + u64 gctx_paddr;
> + u64 address;
> +} __packed;
> +
> +/**
> + * struct sev_data_snp_page_reclaim - SNP_PAGE_RECLAIM command params
> + *
> + * @paddr: system physical address of page to be claimed. The BIT0 indicate
> + * the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
> + */
> +struct sev_data_snp_page_reclaim {
> + u64 paddr;
> +} __packed;
> +
> +/**
> + * struct sev_data_snp_page_unsmash - SNP_PAGE_UNMASH command params
> + *
> + * @paddr: system physical address of page to be unmashed. The BIT0 indicate

Is "BIT0" the 0th bit in the address? This needs to be spelled out
explicitly.

also, s/unmash/unsmash/gi

Also, FW SPEC says bits 11:0 are MBZ. So I'm guessing bit 0 is being
cleared before sending it to sw. I guess I'll see that later.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-20 13:52:53

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 11/49] crypto:ccp: Define the SEV-SNP commands

[AMD Official Use Only - General]

Hello Boris,

>> +/**
>> + * struct sev_data_snp_platform_status_buf - SNP_PLATFORM_STATUS
>> +command params
>> + *
>> + * @address: physical address where the status should be copied */
>> +struct sev_data_snp_platform_status_buf {
>> + u64 status_paddr; /* In */
>> +} __packed;
>> +
>> +/**
>> + * struct sev_data_snp_download_firmware - SNP_DOWNLOAD_FIRMWARE
>> +command params
>> + *
>> + * @address: physical address of firmware image
>> + * @len: len of the firmware image
>> + */
>> +struct sev_data_snp_download_firmware {
>> + u64 address; /* In */
>> + u32 len; /* In */
>> +} __packed;
>> +
>> +/**
>> + * struct sev_data_snp_gctx_create - SNP_GCTX_CREATE command params
>> + *
>> + * @gctx_paddr: system physical address of the page donated to firmware by
>> + * the hypervisor to contain the guest context.
>> + */
>> +struct sev_data_snp_gctx_create {
>> + u64 gctx_paddr; /* In */
>> +} __packed;

>So some of those structs have the same layout. Let's unify them pls.
>I.e., for

>sev_data_send_finish, sev_data_send_cancel, sev_data_receive_finish

>you do

>struct sev_data_tx {
> u32 handle; /* In */
>} __packed;

>For sev_data_snp_platform_status_buf, sev_data_snp_gctx_create, sev_data_snp_decommission and all those others who are a single u64, you use a single

>struct sev_data_addr {
> u64 gctx_paddr; /* In */
>} __packed;

>so that we don't have gazillion structs all of different names but a lot of them identical in content.

These are structure definitions as per SNP Firmware API specifications, and they match the SNP Firmware commands and required arguments.

As an example below:

8.12 SNP_DECOMMISSION
This command destroys a guest context. After this command successfully completes, the guest
will not long be runnable.
8.12.1 Parameters
Table 55. Layout of the CMDBUF_SNP_DECOMMISSION Structure
GCTX_PADDR Bits 63:12 of the sPA of the guest's
context page

Isn't it better to have 1:1 mapping between specification and structure definitions here ?

Thanks,
Ashish

2022-09-20 14:09:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 11/49] crypto:ccp: Define the SEV-SNP commands

On Tue, Sep 20, 2022 at 01:46:25PM +0000, Kalra, Ashish wrote:
> These are structure definitions as per SNP Firmware API
> specifications, and they match the SNP Firmware commands and required
> arguments.

Yes, I have the spec.

> Isn't it better to have 1:1 mapping between specification and
> structure definitions here ?

Why would it be better if you can have a single struct serving multiple
purposes and thus less code to stare at and deal with?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-26 16:43:09

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On Mon, Sep 19, 2022 at 5:47 PM Ashish Kalra <[email protected]> wrote:
>
>
> On 9/19/22 22:18, Tom Lendacky wrote:
> > On 9/19/22 17:02, Alper Gun wrote:
> >> On Mon, Sep 19, 2022 at 2:38 PM Tom Lendacky
> >> <[email protected]> wrote:
> >>>
> >>> On 9/19/22 12:53, Alper Gun wrote:
> >>>> On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
> >>>>>
> >>>>>> +
> >>>>>> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu,
> >>>>>> enum psc_op op, gpa_t gpa,
> >>>>>> + int level)
> >>>>>> +{
> >>>>>> + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
> >>>>>> + struct kvm *kvm = vcpu->kvm;
> >>>>>> + int rc, npt_level;
> >>>>>> + kvm_pfn_t pfn;
> >>>>>> + gpa_t gpa_end;
> >>>>>> +
> >>>>>> + gpa_end = gpa + page_level_size(level);
> >>>>>> +
> >>>>>> + while (gpa < gpa_end) {
> >>>>>> + /*
> >>>>>> + * If the gpa is not present in the NPT then
> >>>>>> build the NPT.
> >>>>>> + */
> >>>>>> + rc = snp_check_and_build_npt(vcpu, gpa, level);
> >>>>>> + if (rc)
> >>>>>> + return -EINVAL;
> >>>>>> +
> >>>>>> + if (op == SNP_PAGE_STATE_PRIVATE) {
> >>>>>> + hva_t hva;
> >>>>>> +
> >>>>>> + if (snp_gpa_to_hva(kvm, gpa, &hva))
> >>>>>> + return -EINVAL;
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * Verify that the hva range is
> >>>>>> registered. This enforcement is
> >>>>>> + * required to avoid the cases where a
> >>>>>> page is marked private
> >>>>>> + * in the RMP table but never gets
> >>>>>> cleanup during the VM
> >>>>>> + * termination path.
> >>>>>> + */
> >>>>>> + mutex_lock(&kvm->lock);
> >>>>>> + rc = is_hva_registered(kvm, hva,
> >>>>>> page_level_size(level));
> >>>>>> + mutex_unlock(&kvm->lock);
> >>>>>> + if (!rc)
> >>>>>> + return -EINVAL;
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * Mark the userspace range unmerable
> >>>>>> before adding the pages
> >>>>>> + * in the RMP table.
> >>>>>> + */
> >>>>>> + mmap_write_lock(kvm->mm);
> >>>>>> + rc = snp_mark_unmergable(kvm, hva,
> >>>>>> page_level_size(level));
> >>>>>> + mmap_write_unlock(kvm->mm);
> >>>>>> + if (rc)
> >>>>>> + return -EINVAL;
> >>>>>> + }
> >>>>>> +
> >>>>>> + write_lock(&kvm->mmu_lock);
> >>>>>> +
> >>>>>> + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn,
> >>>>>> &npt_level);
> >>>>>> + if (!rc) {
> >>>>>> + /*
> >>>>>> + * This may happen if another vCPU
> >>>>>> unmapped the page
> >>>>>> + * before we acquire the lock. Retry the
> >>>>>> PSC.
> >>>>>> + */
> >>>>>> + write_unlock(&kvm->mmu_lock);
> >>>>>> + return 0;
> >>>>>> + }
> >>>>>
> >>>>> I think we want to return -EAGAIN or similar if we want the caller to
> >>>>> retry, right? I think returning 0 here hides the error.
> >>>>>
> >>>>
> >>>> The problem here is that the caller(linux guest kernel) doesn't retry
> >>>> if PSC fails. The current implementation in the guest kernel is that
> >>>> if a page state change request fails, it terminates the VM with
> >>>> GHCB_TERM_PSC reason.
> >>>> Returning 0 here is not a good option because it will fail the PSC
> >>>> silently and will probably cause a nested RMP fault later. Returning
> >>>
> >>> Returning 0 here is ok because the PSC current index into the PSC
> >>> structure will not be updated and the guest will then retry (see the
> >>> loop
> >>> in vmgexit_psc() in arch/x86/kernel/sev.c).
> >>>
> >>> Thanks,
> >>> Tom
> >>
> >> But the host code updates the index. It doesn't leave the loop because
> >> rc is 0. The guest will think that it is successful.
> >> rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
> >> if (rc)
> >> goto out;
> >>
> >> Also the page state change request with MSR is not retried. It
> >> terminates the VM if the MSR request fails.
> >
> > Ah, right. I see what you mean. It should probably return a -EAGAIN
> > instead of 0 and then the if (rc) check should be modified to
> > specifically look for -EAGAIN and goto out after setting rc to 0.
> >
> > But that does leave the MSR protocol open to the problem that you
> > mention, so, yes, retry logic in snp_handle_page_state_change() for a
> > -EAGAIN seems reasonable.
> >
> > Thanks,
> > Tom
>
> I believe it makes more sense to add the retry logic within
> __snp_handle_page_state_change() itself, as that will make it work for
> both the GHCB MSR protocol and the GHCB VMGEXIT requests.

You are suggesting we just retry 'kvm_mmu_get_tdp_walk' inside of
__snp_handle_page_state_change()? That should work but how many times
do we retry? If we return EAGAIN or error we can leave it up to the
caller

>
> Thanks, Ashish
>
> >
> >>
> >>>
> >>>> an error also terminates the guest immediately with current guest
> >>>> implementation. I think the best approach here is adding a retry logic
> >>>> to this function. Retrying without returning an error should help it
> >>>> work because snp_check_and_build_npt will be called again and in the
> >>>> second attempt this should work.
> >>>>
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * Adjust the level so that we don't go higher
> >>>>>> than the backing
> >>>>>> + * page level.
> >>>>>> + */
> >>>>>> + level = min_t(size_t, level, npt_level);
> >>>>>> +
> >>>>>> + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op,
> >>>>>> level);
> >>>>>> +
> >>>>>> + switch (op) {
> >>>>>> + case SNP_PAGE_STATE_SHARED:
> >>>>>> + rc = snp_make_page_shared(kvm, gpa, pfn,
> >>>>>> level);
> >>>>>> + break;
> >>>>>> + case SNP_PAGE_STATE_PRIVATE:
> >>>>>> + rc = rmp_make_private(pfn, gpa, level,
> >>>>>> sev->asid, false);
> >>>>>> + break;
> >>>>>> + default:
> >>>>>> + rc = -EINVAL;
> >>>>>> + break;
> >>>>>> + }
> >>>>>> +
> >>>>>> + write_unlock(&kvm->mmu_lock);
> >>>>>> +
> >>>>>> + if (rc) {
> >>>>>> + pr_err_ratelimited("Error op %d gpa %llx
> >>>>>> pfn %llx level %d rc %d\n",
> >>>>>> + op, gpa, pfn, level, rc);
> >>>>>> + return rc;
> >>>>>> + }
> >>>>>> +
> >>>>>> + gpa = gpa + page_level_size(level);
> >>>>>> + }
> >>>>>> +
> >>>>>> + return 0;
> >>>>>> +}
> >>>>>> +

2022-10-01 20:21:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Mon, Jun 20, 2022 at 11:04:45PM +0000, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
> commands for SEV-SNP is defined in the SEV-SNP firmware specification.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> drivers/crypto/ccp/sev-dev.c | 24 ++++++++++++
> include/linux/psp-sev.h | 73 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 97 insertions(+)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index f1173221d0b9..35d76333e120 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1205,6 +1205,30 @@ int sev_guest_df_flush(int *error)
> }
> EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>
> +int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_decommission);
> +
> +int snp_guest_df_flush(int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_df_flush);
> +
> +int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);
> +
> +int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> +{
> + return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);

So this mindless repetition is getting annoying. I see ~70 SEV commands.
Adding ~70 functions which parrot all the same call to sev_do_cmd() is
just insane.

I think you should simply export sev_do_cmd() and call it instead.

Yes, when it turns out that a command and the preparation to issue it
before it starts repeating pretty often, you could do a helper. But
adding those silly wrappers doesn't bring anything besides confusion and
code bloat.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-03 14:46:34

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

[AMD Official Use Only - General]

Hello Boris,

>> +int snp_guest_decommission(struct sev_data_snp_decommission *data,
>> +int *error) {
>> + return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error); }
>> +EXPORT_SYMBOL_GPL(snp_guest_decommission);
>> +
>> +int snp_guest_df_flush(int *error)
>> +{
>> + return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error); }
>> +EXPORT_SYMBOL_GPL(snp_guest_df_flush);
>> +
>> +int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data,
>> +int *error) {
>> + return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error); }
>> +EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);
>> +
>> +int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
>> +{
>> + return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error); }
>> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);

>So this mindless repetition is getting annoying. I see ~70 SEV commands.
>Adding ~70 functions which parrot all the same call to sev_do_cmd() is just insane.

>I think you should simply export sev_do_cmd() and call it instead.

>Yes, when it turns out that a command and the preparation to issue it before it starts repeating pretty often, you could do a helper. But adding those silly wrappers doesn't bring anything besides confusion and code bloat.

There are actually only 8 functions in total which are simply calling sev_do_cmd(), all other functions calling sev_do_cmd() have a lot of other specific functionality in them.

Those are the earlier - sev_platform_status(), sev_guest_deactivate(), sev_guest_decomission() and sev_guest_df_flush().

And the 4 functions added in this patch - snp_guest_decomission(), snp_guest_df_flush(), snp_guest_page_reclaim(), and snp_guest_dbg_decrypt().

Thanks,
Ashish

2022-10-03 16:21:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Mon, Oct 03, 2022 at 02:38:41PM +0000, Kalra, Ashish wrote:
> There are actually only 8 functions

Only 8?

Lemme ask it differently then: what is the point of the wrappers at all?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-03 17:16:51

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

[AMD Official Use Only - General]

>> There are actually only 8 functions

>Only 8?

>Lemme ask it differently then: what is the point of the wrappers at all?

They are basically providing the APIs for the hypervisor to manage a SNP guest.

And this is the original commit for the API to manage the SEV guest:

commit 200664d5237f3f8cd2a2f9f5c5dea08502336bd1
Author: Brijesh Singh <[email protected]>
Date: Mon Dec 4 10:57:28 2017 -0600

crypto: ccp: Add Secure Encrypted Virtualization (SEV) command support

AMD's new Secure Encrypted Virtualization (SEV) feature allows the
memory contents of virtual machines to be transparently encrypted with a
key unique to the VM. The programming and management of the encryption
keys are handled by the AMD Secure Processor (AMD-SP) which exposes the
commands for these tasks. The complete spec is available at:

http://support.amd.com/TechDocs/55766_SEV-KM%20API_Specification.pdf

Extend the AMD-SP driver to provide the following support:

- an in-kernel API to communicate with the SEV firmware. The API can be
used by the hypervisor to create encryption context for a SEV guest.

- a userspace IOCTL to manage the platform certificates.

Cc: Paolo Bonzini <[email protected]>
Cc: "Radim Kr?m??" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Gary Hook <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Improvements-by: Borislav Petkov <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>


Thanks,
Ashish

2022-10-03 17:50:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Mon, Oct 03, 2022 at 05:11:05PM +0000, Kalra, Ashish wrote:
> They are basically providing the APIs for the hypervisor to manage a
> SNP guest.

Yes, I know. But that is not my question. Lemme try again.

My previous comment was:

"I think you should simply export sev_do_cmd() and call it instead."

In this case, the API is a single function - sev_do_cmd() - which the
hypervisor calls.

So my question still stands: why is it better to have silly wrappers
of sev_do_cmd() instead of having the hypervisor call sev_do_cmd()
directly?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-03 18:04:07

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Mon, Oct 3, 2022 at 11:45 AM Borislav Petkov <[email protected]> wrote:
>
> On Mon, Oct 03, 2022 at 05:11:05PM +0000, Kalra, Ashish wrote:
> > They are basically providing the APIs for the hypervisor to manage a
> > SNP guest.
>
> Yes, I know. But that is not my question. Lemme try again.
>
> My previous comment was:
>
> "I think you should simply export sev_do_cmd() and call it instead."
>
> In this case, the API is a single function - sev_do_cmd() - which the
> hypervisor calls.
>
> So my question still stands: why is it better to have silly wrappers
> of sev_do_cmd() instead of having the hypervisor call sev_do_cmd()
> directly?

We already have sev_issue_cmd_external_user() exported right?

Another option could be to make these wrappers more helpful and less
silly. For example callers need to know the PSP command format right
now, see sev_guest_decommission().

int sev_guest_decommission(struct sev_data_decommission *data, int *error)

Instead of taking @data this function could just take inputs to create
sev_data_decommission:

int sev_guest_decommission(u32 handle, int *error)
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2022-10-03 18:23:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Mon, Oct 03, 2022 at 12:01:19PM -0600, Peter Gonda wrote:
> We already have sev_issue_cmd_external_user() exported right?
>
> Another option could be to make these wrappers more helpful and less
> silly.

For example.

My point is, whenever something needs to issue a SEV* fw command,
something adds a silly wrapper and this will become unwieldy pretty
quickly.

If a wrapper is not a dumb one but it actually does preparatory work
before issuing the command so that the caller's life is made easy, then
yes, by all means.

If it is only plain forwarding a call to sev_do_cmd(), then I question
the whole point of the exercise...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-03 18:52:12

by Ashish Kalra

[permalink] [raw]
Subject: RE: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

[AMD Official Use Only - General]

>> We already have sev_issue_cmd_external_user() exported right?
>>
>> Another option could be to make these wrappers more helpful and less
>> silly.

>For example.

>My point is, whenever something needs to issue a SEV* fw command, something adds a silly wrapper and this will become unwieldy pretty quickly.

>If a wrapper is not a dumb one but it actually does preparatory work before issuing the command so that the caller's life is made easy, then yes, by all means.

>If it is only plain forwarding a call to sev_do_cmd(), then I question the whole point of the exercise...

Well, these all were added as APIs to serve as a abstraction for SEV/SNP guest, and probably it is nice to have an abstracted interface, but I have no issues
with replacing these simply with calls to sev_do_cmd().

Thanks,
Ashish

2022-10-03 19:03:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 13/49] crypto:ccp: Provide APIs to issue SEV-SNP commands

On Mon, Oct 03, 2022 at 06:43:08PM +0000, Kalra, Ashish wrote:
> probably it is nice to have an abstracted interface,

Why is it "probably nice" to have an abstracted interface?

Is the hypervisor allowed to issue only a subset of the commands?

Do you want to control the arguments the hypervisor is supposed to send
down to the firmware?

There must be a reason why one would do an abstracted interface. Not
just because and probably.

Because from where I'm standing this looks like adding a bunch of random
wrappers without any logic to it.

So, if you wanna have an interface, you should think this through and
design it properly and explain why it is there and how it is supposed to
be used.

Don't get me wrong - a properly designed interface to control what the
HV issues to the firmware is not a bad idea. But it needs to be properly
designed.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-10 22:06:35

by Alper Gun

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On Mon, Jun 20, 2022 at 4:13 PM Ashish Kalra <[email protected]> wrote:
>
> From: Brijesh Singh <[email protected]>
>
> When SEV-SNP is enabled in the guest, the hardware places restrictions on
> all memory accesses based on the contents of the RMP table. When hardware
> encounters RMP check failure caused by the guest memory access it raises
> the #NPF. The error code contains additional information on the access
> type. See the APM volume 2 for additional information.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/svm/svm.c | 14 +++++---
> 2 files changed, 86 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 4ed90331bca0..7fc0fad87054 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>
> spin_unlock(&sev->psc_lock);
> }
> +
> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> +{
> + int rmp_level, npt_level, rc, assigned;
> + struct kvm *kvm = vcpu->kvm;
> + gfn_t gfn = gpa_to_gfn(gpa);
> + bool need_psc = false;
> + enum psc_op psc_op;
> + kvm_pfn_t pfn;
> + bool private;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))
> + goto unlock;
> +
> + assigned = snp_lookup_rmpentry(pfn, &rmp_level);
> + if (unlikely(assigned < 0))
> + goto unlock;
> +
> + private = !!(error_code & PFERR_GUEST_ENC_MASK);
> +
> + /*
> + * If the fault was due to size mismatch, or NPT and RMP page level's
> + * are not in sync, then use PSMASH to split the RMP entry into 4K.
> + */
> + if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
> + (npt_level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M && private)) {
> + rc = snp_rmptable_psmash(kvm, pfn);


Regarding this case:
RMP level is 4K
Page table level is 2M

Does this also cause a page fault with size mismatch? If so, we
shouldn't try psmash because the rmp entry is already 4K.

I see these errors in our tests and I think it may be happening
because rmp size is already 4K.

[ 1848.752952] psmash failed, gpa 0x191560000 pfn 0x536cd60 rc 7
[ 2922.879635] psmash failed, gpa 0x102830000 pfn 0x37c8230 rc 7
[ 3010.983090] psmash failed, gpa 0x104220000 pfn 0x6cf1e20 rc 7
[ 3170.792050] psmash failed, gpa 0x108a80000 pfn 0x20e0080 rc 7
[ 3345.955147] psmash failed, gpa 0x11b480000 pfn 0x1545e480 rc 7

Shouldn't we use AND instead of OR in the if statement?

if ((error_code & PFERR_GUEST_SIZEM_MASK) && ...

> + if (rc)
> + pr_err_ratelimited("psmash failed, gpa 0x%llx pfn 0x%llx rc %d\n",
> + gpa, pfn, rc);
> + goto out;
> + }
> +
> + /*
> + * If it's a private access, and the page is not assigned in the
> + * RMP table, create a new private RMP entry. This can happen if
> + * guest did not use the PSC VMGEXIT to transition the page state
> + * before the access.
> + */
> + if (!assigned && private) {
> + need_psc = 1;
> + psc_op = SNP_PAGE_STATE_PRIVATE;
> + goto out;
> + }
> +
> + /*
> + * If it's a shared access, but the page is private in the RMP table
> + * then make the page shared in the RMP table. This can happen if
> + * the guest did not use the PSC VMGEXIT to transition the page
> + * state before the access.
> + */
> + if (assigned && !private) {
> + need_psc = 1;
> + psc_op = SNP_PAGE_STATE_SHARED;
> + }
> +
> +out:
> + write_unlock(&kvm->mmu_lock);
> +
> + if (need_psc)
> + rc = __snp_handle_page_state_change(vcpu, psc_op, gpa, PG_LEVEL_4K);
> +
> + /*
> + * The fault handler has updated the RMP pagesize, zap the existing
> + * rmaps for large entry ranges so that nested page table gets rebuilt
> + * with the updated RMP pagesize.
> + */
> + gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> + kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
> + return;
> +
> +unlock:
> + write_unlock(&kvm->mmu_lock);
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 1c8e035ba011..7742bc986afc 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1866,15 +1866,21 @@ static int pf_interception(struct kvm_vcpu *vcpu)
> static int npf_interception(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> + int rc;
>
> u64 fault_address = svm->vmcb->control.exit_info_2;
> u64 error_code = svm->vmcb->control.exit_info_1;
>
> trace_kvm_page_fault(fault_address, error_code);
> - return kvm_mmu_page_fault(vcpu, fault_address, error_code,
> - static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
> - svm->vmcb->control.insn_bytes : NULL,
> - svm->vmcb->control.insn_len);
> + rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
> + static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
> + svm->vmcb->control.insn_bytes : NULL,
> + svm->vmcb->control.insn_len);
> +
> + if (error_code & PFERR_GUEST_RMP_MASK)
> + handle_rmp_page_fault(vcpu, fault_address, error_code);
> +
> + return rc;
> }
>
> static int db_interception(struct kvm_vcpu *vcpu)
> --
> 2.25.1
>
>

2022-10-12 20:25:08

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On 9/26/2022 10:19 AM, Peter Gonda wrote:
> On Mon, Sep 19, 2022 at 5:47 PM Ashish Kalra <[email protected]> wrote:
>>
>>
>> On 9/19/22 22:18, Tom Lendacky wrote:
>>> On 9/19/22 17:02, Alper Gun wrote:
>>>> On Mon, Sep 19, 2022 at 2:38 PM Tom Lendacky
>>>> <[email protected]> wrote:
>>>>>
>>>>> On 9/19/22 12:53, Alper Gun wrote:
>>>>>> On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
>>>>>>>
>>>>>>>> +
>>>>>>>> +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu,
>>>>>>>> enum psc_op op, gpa_t gpa,
>>>>>>>> + int level)
>>>>>>>> +{
>>>>>>>> + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
>>>>>>>> + struct kvm *kvm = vcpu->kvm;
>>>>>>>> + int rc, npt_level;
>>>>>>>> + kvm_pfn_t pfn;
>>>>>>>> + gpa_t gpa_end;
>>>>>>>> +
>>>>>>>> + gpa_end = gpa + page_level_size(level);
>>>>>>>> +
>>>>>>>> + while (gpa < gpa_end) {
>>>>>>>> + /*
>>>>>>>> + * If the gpa is not present in the NPT then
>>>>>>>> build the NPT.
>>>>>>>> + */
>>>>>>>> + rc = snp_check_and_build_npt(vcpu, gpa, level);
>>>>>>>> + if (rc)
>>>>>>>> + return -EINVAL;
>>>>>>>> +
>>>>>>>> + if (op == SNP_PAGE_STATE_PRIVATE) {
>>>>>>>> + hva_t hva;
>>>>>>>> +
>>>>>>>> + if (snp_gpa_to_hva(kvm, gpa, &hva))
>>>>>>>> + return -EINVAL;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * Verify that the hva range is
>>>>>>>> registered. This enforcement is
>>>>>>>> + * required to avoid the cases where a
>>>>>>>> page is marked private
>>>>>>>> + * in the RMP table but never gets
>>>>>>>> cleanup during the VM
>>>>>>>> + * termination path.
>>>>>>>> + */
>>>>>>>> + mutex_lock(&kvm->lock);
>>>>>>>> + rc = is_hva_registered(kvm, hva,
>>>>>>>> page_level_size(level));
>>>>>>>> + mutex_unlock(&kvm->lock);
>>>>>>>> + if (!rc)
>>>>>>>> + return -EINVAL;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * Mark the userspace range unmerable
>>>>>>>> before adding the pages
>>>>>>>> + * in the RMP table.
>>>>>>>> + */
>>>>>>>> + mmap_write_lock(kvm->mm);
>>>>>>>> + rc = snp_mark_unmergable(kvm, hva,
>>>>>>>> page_level_size(level));
>>>>>>>> + mmap_write_unlock(kvm->mm);
>>>>>>>> + if (rc)
>>>>>>>> + return -EINVAL;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + write_lock(&kvm->mmu_lock);
>>>>>>>> +
>>>>>>>> + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn,
>>>>>>>> &npt_level);
>>>>>>>> + if (!rc) {
>>>>>>>> + /*
>>>>>>>> + * This may happen if another vCPU
>>>>>>>> unmapped the page
>>>>>>>> + * before we acquire the lock. Retry the
>>>>>>>> PSC.
>>>>>>>> + */
>>>>>>>> + write_unlock(&kvm->mmu_lock);
>>>>>>>> + return 0;
>>>>>>>> + }
>>>>>>>
>>>>>>> I think we want to return -EAGAIN or similar if we want the caller to
>>>>>>> retry, right? I think returning 0 here hides the error.
>>>>>>>
>>>>>>
>>>>>> The problem here is that the caller(linux guest kernel) doesn't retry
>>>>>> if PSC fails. The current implementation in the guest kernel is that
>>>>>> if a page state change request fails, it terminates the VM with
>>>>>> GHCB_TERM_PSC reason.
>>>>>> Returning 0 here is not a good option because it will fail the PSC
>>>>>> silently and will probably cause a nested RMP fault later. Returning
>>>>>
>>>>> Returning 0 here is ok because the PSC current index into the PSC
>>>>> structure will not be updated and the guest will then retry (see the
>>>>> loop
>>>>> in vmgexit_psc() in arch/x86/kernel/sev.c).
>>>>>
>>>>> Thanks,
>>>>> Tom
>>>>
>>>> But the host code updates the index. It doesn't leave the loop because
>>>> rc is 0. The guest will think that it is successful.
>>>> rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
>>>> if (rc)
>>>> goto out;
>>>>
>>>> Also the page state change request with MSR is not retried. It
>>>> terminates the VM if the MSR request fails.
>>>
>>> Ah, right. I see what you mean. It should probably return a -EAGAIN
>>> instead of 0 and then the if (rc) check should be modified to
>>> specifically look for -EAGAIN and goto out after setting rc to 0.
>>>
>>> But that does leave the MSR protocol open to the problem that you
>>> mention, so, yes, retry logic in snp_handle_page_state_change() for a
>>> -EAGAIN seems reasonable.
>>>
>>> Thanks,
>>> Tom
>>
>> I believe it makes more sense to add the retry logic within
>> __snp_handle_page_state_change() itself, as that will make it work for
>> both the GHCB MSR protocol and the GHCB VMGEXIT requests.
>
> You are suggesting we just retry 'kvm_mmu_get_tdp_walk' inside of
> __snp_handle_page_state_change()? That should work but how many times
> do we retry? If we return EAGAIN or error we can leave it up to the
> caller
>

Ok, we return -EAGAIN here and then let the caller in
snp_handle_page_state_change() or sev_handle_vmgexit_msr_protocol()
(in case of GHCB MSR protocol) do the retries.

But, the question still remains, how may retry attempts should the
caller attempt ?

Thanks,
Ashish

>>>
>>>>
>>>>>
>>>>>> an error also terminates the guest immediately with current guest
>>>>>> implementation. I think the best approach here is adding a retry logic
>>>>>> to this function. Retrying without returning an error should help it
>>>>>> work because snp_check_and_build_npt will be called again and in the
>>>>>> second attempt this should work.
>>>>>>
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * Adjust the level so that we don't go higher
>>>>>>>> than the backing
>>>>>>>> + * page level.
>>>>>>>> + */
>>>>>>>> + level = min_t(size_t, level, npt_level);
>>>>>>>> +
>>>>>>>> + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op,
>>>>>>>> level);
>>>>>>>> +
>>>>>>>> + switch (op) {
>>>>>>>> + case SNP_PAGE_STATE_SHARED:
>>>>>>>> + rc = snp_make_page_shared(kvm, gpa, pfn,
>>>>>>>> level);
>>>>>>>> + break;
>>>>>>>> + case SNP_PAGE_STATE_PRIVATE:
>>>>>>>> + rc = rmp_make_private(pfn, gpa, level,
>>>>>>>> sev->asid, false);
>>>>>>>> + break;
>>>>>>>> + default:
>>>>>>>> + rc = -EINVAL;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + write_unlock(&kvm->mmu_lock);
>>>>>>>> +
>>>>>>>> + if (rc) {
>>>>>>>> + pr_err_ratelimited("Error op %d gpa %llx
>>>>>>>> pfn %llx level %d rc %d\n",
>>>>>>>> + op, gpa, pfn, level, rc);
>>>>>>>> + return rc;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + gpa = gpa + page_level_size(level);
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + return 0;
>>>>>>>> +}
>>>>>>>> +

2022-10-12 22:58:27

by Alper Gun

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On Mon, Oct 10, 2022 at 7:32 PM Kalra, Ashish <[email protected]> wrote:
>
> Hello Alper,
>
> On 10/10/2022 5:03 PM, Alper Gun wrote:
> > On Mon, Jun 20, 2022 at 4:13 PM Ashish Kalra <[email protected]> wrote:
> >>
> >> From: Brijesh Singh <[email protected]>
> >>
> >> When SEV-SNP is enabled in the guest, the hardware places restrictions on
> >> all memory accesses based on the contents of the RMP table. When hardware
> >> encounters RMP check failure caused by the guest memory access it raises
> >> the #NPF. The error code contains additional information on the access
> >> type. See the APM volume 2 for additional information.
> >>
> >> Signed-off-by: Brijesh Singh <[email protected]>
> >> ---
> >> arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
> >> arch/x86/kvm/svm/svm.c | 14 +++++---
> >> 2 files changed, 86 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> >> index 4ed90331bca0..7fc0fad87054 100644
> >> --- a/arch/x86/kvm/svm/sev.c
> >> +++ b/arch/x86/kvm/svm/sev.c
> >> @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> >>
> >> spin_unlock(&sev->psc_lock);
> >> }
> >> +
> >> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> >> +{
> >> + int rmp_level, npt_level, rc, assigned;
> >> + struct kvm *kvm = vcpu->kvm;
> >> + gfn_t gfn = gpa_to_gfn(gpa);
> >> + bool need_psc = false;
> >> + enum psc_op psc_op;
> >> + kvm_pfn_t pfn;
> >> + bool private;
> >> +
> >> + write_lock(&kvm->mmu_lock);
> >> +
> >> + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))
> >> + goto unlock;
> >> +
> >> + assigned = snp_lookup_rmpentry(pfn, &rmp_level);
> >> + if (unlikely(assigned < 0))
> >> + goto unlock;
> >> +
> >> + private = !!(error_code & PFERR_GUEST_ENC_MASK);
> >> +
> >> + /*
> >> + * If the fault was due to size mismatch, or NPT and RMP page level's
> >> + * are not in sync, then use PSMASH to split the RMP entry into 4K.
> >> + */
> >> + if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
> >> + (npt_level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M && private)) {
> >> + rc = snp_rmptable_psmash(kvm, pfn);
> >
> >
> > Regarding this case:
> > RMP level is 4K
> > Page table level is 2M
> >
> > Does this also cause a page fault with size mismatch? If so, we
> > shouldn't try psmash because the rmp entry is already 4K.
> >
> > I see these errors in our tests and I think it may be happening
> > because rmp size is already 4K.
> >
> > [ 1848.752952] psmash failed, gpa 0x191560000 pfn 0x536cd60 rc 7
> > [ 2922.879635] psmash failed, gpa 0x102830000 pfn 0x37c8230 rc 7
> > [ 3010.983090] psmash failed, gpa 0x104220000 pfn 0x6cf1e20 rc 7
> > [ 3170.792050] psmash failed, gpa 0x108a80000 pfn 0x20e0080 rc 7
> > [ 3345.955147] psmash failed, gpa 0x11b480000 pfn 0x1545e480 rc 7
> >
> > Shouldn't we use AND instead of OR in the if statement?
> >
>
> I believe this we can't do, looking at the typical usage case below :
>
> [ 37.243969] #VMEXIT (NPF) - SIZEM, err 0xc80000005 npt_level 2,
> rmp_level 2, private 1
> [ 37.243973] trying psmash gpa 0x7f790000 pfn 0x1f5d90
>
> This is typically the case with #VMEXIT(NPF) with SIZEM error code, when
> the guest tries to do PVALIDATE on 4K GHCB pages, in this case both the
> RMP table and NPT will be optimally setup to 2M hugepage as can be seen.
>
> Is it possible to investigate in more depth, when is the this case being
> observed:

Yes, I added more logs and I can see that these errors happen when RMP
level is 4K and NPT level is 2M.
psmash fails as expected. I think it is just a log, there is no real
issue but the best is not trying psmash if rmp level is 4K.

> RMP level is 4K
> Page table level is 2M
> We shouldn't try psmash because the rmp entry is already 4K.
>
> Thanks,
> Ashish
>
> > if ((error_code & PFERR_GUEST_SIZEM_MASK) && ...
> >

2022-10-12 23:00:46

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 37/49] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT

On Wed, Oct 12, 2022 at 03:15:15PM -0500, Kalra, Ashish wrote:
> On 9/26/2022 10:19 AM, Peter Gonda wrote:
> > On Mon, Sep 19, 2022 at 5:47 PM Ashish Kalra <[email protected]> wrote:
> > >
> > >
> > > On 9/19/22 22:18, Tom Lendacky wrote:
> > > > On 9/19/22 17:02, Alper Gun wrote:
> > > > > On Mon, Sep 19, 2022 at 2:38 PM Tom Lendacky
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On 9/19/22 12:53, Alper Gun wrote:
> > > > > > > On Fri, Aug 19, 2022 at 9:54 AM Peter Gonda <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu,
> > > > > > > > > enum psc_op op, gpa_t gpa,
> > > > > > > > > + int level)
> > > > > > > > > +{
> > > > > > > > > + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
> > > > > > > > > + struct kvm *kvm = vcpu->kvm;
> > > > > > > > > + int rc, npt_level;
> > > > > > > > > + kvm_pfn_t pfn;
> > > > > > > > > + gpa_t gpa_end;
> > > > > > > > > +
> > > > > > > > > + gpa_end = gpa + page_level_size(level);
> > > > > > > > > +
> > > > > > > > > + while (gpa < gpa_end) {
> > > > > > > > > + /*
> > > > > > > > > + * If the gpa is not present in the NPT then
> > > > > > > > > build the NPT.
> > > > > > > > > + */
> > > > > > > > > + rc = snp_check_and_build_npt(vcpu, gpa, level);
> > > > > > > > > + if (rc)
> > > > > > > > > + return -EINVAL;
> > > > > > > > > +
> > > > > > > > > + if (op == SNP_PAGE_STATE_PRIVATE) {
> > > > > > > > > + hva_t hva;
> > > > > > > > > +
> > > > > > > > > + if (snp_gpa_to_hva(kvm, gpa, &hva))
> > > > > > > > > + return -EINVAL;
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * Verify that the hva range is
> > > > > > > > > registered. This enforcement is
> > > > > > > > > + * required to avoid the cases where a
> > > > > > > > > page is marked private
> > > > > > > > > + * in the RMP table but never gets
> > > > > > > > > cleanup during the VM
> > > > > > > > > + * termination path.
> > > > > > > > > + */
> > > > > > > > > + mutex_lock(&kvm->lock);
> > > > > > > > > + rc = is_hva_registered(kvm, hva,
> > > > > > > > > page_level_size(level));
> > > > > > > > > + mutex_unlock(&kvm->lock);
> > > > > > > > > + if (!rc)
> > > > > > > > > + return -EINVAL;
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * Mark the userspace range unmerable
> > > > > > > > > before adding the pages
> > > > > > > > > + * in the RMP table.
> > > > > > > > > + */
> > > > > > > > > + mmap_write_lock(kvm->mm);
> > > > > > > > > + rc = snp_mark_unmergable(kvm, hva,
> > > > > > > > > page_level_size(level));
> > > > > > > > > + mmap_write_unlock(kvm->mm);
> > > > > > > > > + if (rc)
> > > > > > > > > + return -EINVAL;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + write_lock(&kvm->mmu_lock);
> > > > > > > > > +
> > > > > > > > > + rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn,
> > > > > > > > > &npt_level);
> > > > > > > > > + if (!rc) {
> > > > > > > > > + /*
> > > > > > > > > + * This may happen if another vCPU
> > > > > > > > > unmapped the page
> > > > > > > > > + * before we acquire the lock. Retry the
> > > > > > > > > PSC.
> > > > > > > > > + */
> > > > > > > > > + write_unlock(&kvm->mmu_lock);
> > > > > > > > > + return 0;
> > > > > > > > > + }
> > > > > > > >
> > > > > > > > I think we want to return -EAGAIN or similar if we want the caller to
> > > > > > > > retry, right? I think returning 0 here hides the error.
> > > > > > > >
> > > > > > >
> > > > > > > The problem here is that the caller(linux guest kernel) doesn't retry
> > > > > > > if PSC fails. The current implementation in the guest kernel is that
> > > > > > > if a page state change request fails, it terminates the VM with
> > > > > > > GHCB_TERM_PSC reason.
> > > > > > > Returning 0 here is not a good option because it will fail the PSC
> > > > > > > silently and will probably cause a nested RMP fault later. Returning
> > > > > >
> > > > > > Returning 0 here is ok because the PSC current index into the PSC
> > > > > > structure will not be updated and the guest will then retry (see the
> > > > > > loop
> > > > > > in vmgexit_psc() in arch/x86/kernel/sev.c).
> > > > > >
> > > > > > Thanks,
> > > > > > Tom
> > > > >
> > > > > But the host code updates the index. It doesn't leave the loop because
> > > > > rc is 0. The guest will think that it is successful.
> > > > > rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
> > > > > if (rc)
> > > > > goto out;
> > > > >
> > > > > Also the page state change request with MSR is not retried. It
> > > > > terminates the VM if the MSR request fails.
> > > >
> > > > Ah, right. I see what you mean. It should probably return a -EAGAIN
> > > > instead of 0 and then the if (rc) check should be modified to
> > > > specifically look for -EAGAIN and goto out after setting rc to 0.
> > > >
> > > > But that does leave the MSR protocol open to the problem that you
> > > > mention, so, yes, retry logic in snp_handle_page_state_change() for a
> > > > -EAGAIN seems reasonable.
> > > >
> > > > Thanks,
> > > > Tom
> > >
> > > I believe it makes more sense to add the retry logic within
> > > __snp_handle_page_state_change() itself, as that will make it work for
> > > both the GHCB MSR protocol and the GHCB VMGEXIT requests.
> >
> > You are suggesting we just retry 'kvm_mmu_get_tdp_walk' inside of
> > __snp_handle_page_state_change()? That should work but how many times
> > do we retry? If we return EAGAIN or error we can leave it up to the
> > caller
> >
>
> Ok, we return -EAGAIN here and then let the caller in
> snp_handle_page_state_change() or sev_handle_vmgexit_msr_protocol()
> (in case of GHCB MSR protocol) do the retries.
>
> But, the question still remains, how may retry attempts should the caller
> attempt ?

With UPM I don't think we need to deal with this particular case, since we
don't need to walk the NPT to determine the PFN. The PSC will simply get
forward to userspace, and userspace will (generally):

for shared->private:
- deallocate page in shared pool
- allocate page in private pool
- issue KVM_MEM_ENCRYPT_REG_REGION on the GFN to switch it to private
in the KVM xarray and RMP table (and zap current NPT mapping)
- resume guest
- guest faults on GFN and KVM MMU sees that it is private and maps the GFN
to the corresponding PFN in the private pool, which should be reliably
obtainable since it is pinned

for private->shared:
- deallocate page in private pool (which will switch it to shared in
RMP table so it can be safely released back to host)
- allocate page in shared pool
- issue KVM_MEM_ENCRYPT_UNREG_REGION on the GFN to switch it to
shared in the KVM xarray (and zap current NPT mapping)
- resume guest
- guest faults on GFN and KVM MMU sees that it is shared and handles it
just like it would a normal non-SEV guest, so we don't ever need to
acquire the specific PFN backing the HVA since they are implicitly
shared, so no need anymore for kvm_mmu_get_tdp_walk() helper

(also no need for pre-mapping into TDP via kvm_mmu_map_tdp_page() in
this case, but not sure that was needed even without UPM, seems more like
an optimization to avoid a 2nd #NPF. I guess we still have the option with
UPM though if it seems justified, but it would likely happen in
{REG,UNREG}_REGION in that case rather than SNP-specific hooks)

There may be some other edge cases to consider, but I'm not aware of any
sequences that aren't clearly misbehavior on the part of userspace/guest,
in which case terminating at the host/guest level seems reasonable.

-Mike

>
> Thanks,
> Ashish
>
> > > >
> > > > >
> > > > > >
> > > > > > > an error also terminates the guest immediately with current guest
> > > > > > > implementation. I think the best approach here is adding a retry logic
> > > > > > > to this function. Retrying without returning an error should help it
> > > > > > > work because snp_check_and_build_npt will be called again and in the
> > > > > > > second attempt this should work.
> > > > > > >
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * Adjust the level so that we don't go higher
> > > > > > > > > than the backing
> > > > > > > > > + * page level.
> > > > > > > > > + */
> > > > > > > > > + level = min_t(size_t, level, npt_level);
> > > > > > > > > +
> > > > > > > > > + trace_kvm_snp_psc(vcpu->vcpu_id, pfn, gpa, op,
> > > > > > > > > level);
> > > > > > > > > +
> > > > > > > > > + switch (op) {
> > > > > > > > > + case SNP_PAGE_STATE_SHARED:
> > > > > > > > > + rc = snp_make_page_shared(kvm, gpa, pfn,
> > > > > > > > > level);
> > > > > > > > > + break;
> > > > > > > > > + case SNP_PAGE_STATE_PRIVATE:
> > > > > > > > > + rc = rmp_make_private(pfn, gpa, level,
> > > > > > > > > sev->asid, false);
> > > > > > > > > + break;
> > > > > > > > > + default:
> > > > > > > > > + rc = -EINVAL;
> > > > > > > > > + break;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + write_unlock(&kvm->mmu_lock);
> > > > > > > > > +
> > > > > > > > > + if (rc) {
> > > > > > > > > + pr_err_ratelimited("Error op %d gpa %llx
> > > > > > > > > pfn %llx level %d rc %d\n",
> > > > > > > > > + op, gpa, pfn, level, rc);
> > > > > > > > > + return rc;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + gpa = gpa + page_level_size(level);
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + return 0;
> > > > > > > > > +}
> > > > > > > > > +

2022-10-13 15:12:39

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 41/49] KVM: SVM: Add support to handle the RMP nested page fault

On 10/12/2022 5:53 PM, Alper Gun wrote:
> On Mon, Oct 10, 2022 at 7:32 PM Kalra, Ashish <[email protected]> wrote:
>>
>> Hello Alper,
>>
>> On 10/10/2022 5:03 PM, Alper Gun wrote:
>>> On Mon, Jun 20, 2022 at 4:13 PM Ashish Kalra <[email protected]> wrote:
>>>>
>>>> From: Brijesh Singh <[email protected]>
>>>>
>>>> When SEV-SNP is enabled in the guest, the hardware places restrictions on
>>>> all memory accesses based on the contents of the RMP table. When hardware
>>>> encounters RMP check failure caused by the guest memory access it raises
>>>> the #NPF. The error code contains additional information on the access
>>>> type. See the APM volume 2 for additional information.
>>>>
>>>> Signed-off-by: Brijesh Singh <[email protected]>
>>>> ---
>>>> arch/x86/kvm/svm/sev.c | 76 ++++++++++++++++++++++++++++++++++++++++++
>>>> arch/x86/kvm/svm/svm.c | 14 +++++---
>>>> 2 files changed, 86 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>>> index 4ed90331bca0..7fc0fad87054 100644
>>>> --- a/arch/x86/kvm/svm/sev.c
>>>> +++ b/arch/x86/kvm/svm/sev.c
>>>> @@ -4009,3 +4009,79 @@ void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>>>>
>>>> spin_unlock(&sev->psc_lock);
>>>> }
>>>> +
>>>> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
>>>> +{
>>>> + int rmp_level, npt_level, rc, assigned;
>>>> + struct kvm *kvm = vcpu->kvm;
>>>> + gfn_t gfn = gpa_to_gfn(gpa);
>>>> + bool need_psc = false;
>>>> + enum psc_op psc_op;
>>>> + kvm_pfn_t pfn;
>>>> + bool private;
>>>> +
>>>> + write_lock(&kvm->mmu_lock);
>>>> +
>>>> + if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level)))
>>>> + goto unlock;
>>>> +
>>>> + assigned = snp_lookup_rmpentry(pfn, &rmp_level);
>>>> + if (unlikely(assigned < 0))
>>>> + goto unlock;
>>>> +
>>>> + private = !!(error_code & PFERR_GUEST_ENC_MASK);
>>>> +
>>>> + /*
>>>> + * If the fault was due to size mismatch, or NPT and RMP page level's
>>>> + * are not in sync, then use PSMASH to split the RMP entry into 4K.
>>>> + */
>>>> + if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
>>>> + (npt_level == PG_LEVEL_4K && rmp_level == PG_LEVEL_2M && private)) {
>>>> + rc = snp_rmptable_psmash(kvm, pfn);
>>>
>>>
>>> Regarding this case:
>>> RMP level is 4K
>>> Page table level is 2M
>>>
>>> Does this also cause a page fault with size mismatch? If so, we
>>> shouldn't try psmash because the rmp entry is already 4K.
>>>
>>> I see these errors in our tests and I think it may be happening
>>> because rmp size is already 4K.
>>>
>>> [ 1848.752952] psmash failed, gpa 0x191560000 pfn 0x536cd60 rc 7
>>> [ 2922.879635] psmash failed, gpa 0x102830000 pfn 0x37c8230 rc 7
>>> [ 3010.983090] psmash failed, gpa 0x104220000 pfn 0x6cf1e20 rc 7
>>> [ 3170.792050] psmash failed, gpa 0x108a80000 pfn 0x20e0080 rc 7
>>> [ 3345.955147] psmash failed, gpa 0x11b480000 pfn 0x1545e480 rc 7
>>>
>>> Shouldn't we use AND instead of OR in the if statement?
>>>
>>
>> I believe this we can't do, looking at the typical usage case below :
>>
>> [ 37.243969] #VMEXIT (NPF) - SIZEM, err 0xc80000005 npt_level 2,
>> rmp_level 2, private 1
>> [ 37.243973] trying psmash gpa 0x7f790000 pfn 0x1f5d90
>>
>> This is typically the case with #VMEXIT(NPF) with SIZEM error code, when
>> the guest tries to do PVALIDATE on 4K GHCB pages, in this case both the
>> RMP table and NPT will be optimally setup to 2M hugepage as can be seen.
>>
>> Is it possible to investigate in more depth, when is the this case being
>> observed:
>
> Yes, I added more logs and I can see that these errors happen when RMP
> level is 4K and NPT level is 2M.
> psmash fails as expected. I think it is just a log, there is no real
> issue but the best is not trying psmash if rmp level is 4K.
>

Now, the SIZEM bit is only set when PVALIDATE or RMPADJUST fails due to
guest attempting to validate a 4K page that is backed by a 2MB RMP
entry, which is not the case here as RMP level is 4K.

Also, this does not fall into the second case for the same reason.

#NPF will happen during Guest page table walk if RMP checks fail
for 2M nested page and RMP.SubPage_Count !=0 OR
RMP.PageSize != Nested table page size, but then that shouldn't have
the SIZEM fault bit set.

This raises concern about some existing race condition, it probably
can race with
snp_handle_page_state_change()->snp_make_page_shared()->snp_rmptable_psmash(),
but that code path seems to be protected from this nested RMP #PF
handler as they both acquire the kvm->mmu_lock.

So, this still needs more investigation.

Can you share what kind of tests are you running to reproduce this
issue ?

Thanks,
Ashish

>> RMP level is 4K
>> Page table level is 2M
>> We shouldn't try psmash because the rmp entry is already 4K.
>>
>> Thanks,
>> Ashish
>>
>>> if ((error_code & PFERR_GUEST_SIZEM_MASK) && ...
>>>

2022-10-13 15:20:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Jun 20, 2022 at 11:05:01PM +0000, Ashish Kalra wrote:
> +static void snp_leak_pages(unsigned long pfn, unsigned int npages)

That function name looks wrong.

> +{
> + WARN(1, "psc failed, pfn 0x%lx pages %d (leaking)\n", pfn, npages);
> + while (npages--) {
> + memory_failure(pfn, 0);
^^^^^^^^^^^^^^^^^^^^^^

Why?

* This function is called by the low level machine check code
* of an architecture when it detects hardware memory corruption
* of a page. It tries its best to recover, which includes
* dropping pages, killing processes etc.

I don't think you wanna do that.

It looks like you want to prevent the page from being used again but not
mark it as PG_hwpoison and whatnot. PG_reserved perhaps?

> + dump_rmpentry(pfn);
> + pfn++;
> + }
> +}
> +
> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages, bool locked)
> +{
> + struct sev_data_snp_page_reclaim data;
> + int ret, err, i, n = 0;
> +
> + for (i = 0; i < npages; i++) {
> + memset(&data, 0, sizeof(data));
> + data.paddr = pfn << PAGE_SHIFT;

Oh wow, this is just silly. A struct for a single u64. Just use a

u64 paddr;

directly. But we had this topic already...

> +
> + if (locked)

Ew, that's never a good design - conditional locking.

> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> + else
> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);

<---- newline here.

> + if (ret)
> + goto cleanup;
> +
> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (ret)
> + goto cleanup;
> +
> + pfn++;
> + n++;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /*
> + * If failed to reclaim the page then page is no longer safe to
> + * be released, leak it.
> + */
> + snp_leak_pages(pfn, npages - n);

So this looks real weird: we go and reclaim pages, we hit an error
during reclaiming a page X somewhere in-between and then we go and mark
the *remaining* pages as not to be used?!

Why?

Why not only that *one* page which failed and then we continue with the
rest?!

> + return ret;
> +}
> +
> +static inline int rmp_make_firmware(unsigned long pfn, int level)
> +{
> + return rmp_make_private(pfn, 0, level, 0, true);
> +}

That's a silly wrapper used only once. Just do at the callsite:

/* Mark this page as belonging to firmware */
rc = rmp_make_private(pfn, 0, level, 0, true);

> +
> +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
> + bool need_reclaim)

Tangential to the above, this is just nuts with those bool arguments.
Just look at the callsites: do you understand what they do?

snp_set_rmp_state(paddr, npages, true, locked, false);

what does that do? You need to go up to the definition of the function,
count the arguments and see what that "true" arg stands for.

What you should do instead is, have separate helpers which do only one
thing:

rmp_mark_pages_firmware();
rmp_mark_pages_shared();
rmp_mark_pages_...

and then have the *callers* issue snp_reclaim_pages() when needed. So you'd have

rmp_mark_pages_firmware();
rmp_mark_pages_shared()

and __snp_free_firmware_pages() would do

rmp_mark_pages_shared();
snp_reclaim_pages();

and so on.

And then if you need locking, the callers can decide which sev_do_cmd
variant to issue.

And then if you have common code fragments which you can unify into a
bigger helper function, *then* you can do that.

Instead of multiplexing it this way. Which makes it really hard to
follow what the code does.


> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT; /* Cbit maybe set in the paddr */

No side comments pls.

> + int rc, n = 0, i;
> +
> + for (i = 0; i < npages; i++) {
> + if (to_fw)
> + rc = rmp_make_firmware(pfn, PG_LEVEL_4K);
> + else
> + rc = need_reclaim ? snp_reclaim_pages(pfn, 1, locked) :
> + rmp_make_shared(pfn, PG_LEVEL_4K);
> + if (rc)
> + goto cleanup;
> +
> + pfn++;
> + n++;
> + }
> +
> + return 0;
> +
> +cleanup:
> + /* Try unrolling the firmware state changes */
> + if (to_fw) {
> + /*
> + * Reclaim the pages which were already changed to the
> + * firmware state.
> + */
> + snp_reclaim_pages(paddr >> PAGE_SHIFT, n, locked);
> +
> + return rc;
> + }
> +
> + /*
> + * If failed to change the page state to shared, then its not safe
> + * to release the page back to the system, leak it.
> + */
> + snp_leak_pages(pfn, npages - n);
> +
> + return rc;
> +}

...

> +void snp_free_firmware_page(void *addr)
> +{
> + if (!addr)
> + return;
> +
> + __snp_free_firmware_pages(virt_to_page(addr), 0, false);
> +}
> +EXPORT_SYMBOL(snp_free_firmware_page);

EXPORT_SYMBOL_GPL() ofc.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-14 20:06:56

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 10/13/2022 10:15 AM, Borislav Petkov wrote:
> On Mon, Jun 20, 2022 at 11:05:01PM +0000, Ashish Kalra wrote:
>> +static void snp_leak_pages(unsigned long pfn, unsigned int npages)
>
> That function name looks wrong.
>
>> +{
>> + WARN(1, "psc failed, pfn 0x%lx pages %d (leaking)\n", pfn, npages);
>> + while (npages--) {
>> + memory_failure(pfn, 0);
> ^^^^^^^^^^^^^^^^^^^^^^
>
> Why?

The page was in FW state and we couldn't transition it back to HV/Shared
state, any access to this page can cause RMP #PF.

>
> * This function is called by the low level machine check code
> * of an architecture when it detects hardware memory corruption
> * of a page. It tries its best to recover, which includes
> * dropping pages, killing processes etc.
>
> I don't think you wanna do that.
>
> It looks like you want to prevent the page from being used again but not
> mark it as PG_hwpoison and whatnot. PG_reserved perhaps?

* PG_reserved is set for special pages. The "struct page" of such a
* page should in general not be touched (e.g. set dirty) except by its
* owner.

If it is "still" accessed/touched then it can cause RMP #PF.
On the other hand,

* PG_hwpoison... Accessing is
* not safe since it may cause another machine check. Don't touch!

That sounds exactly the state we want these page(s) to be in ?

Another possibility is PG_error.

>
>> + dump_rmpentry(pfn);
>> + pfn++;
>> + }
>> +}
>> +
>> +static int snp_reclaim_pages(unsigned long pfn, unsigned int npages, bool locked)
>> +{
>> + struct sev_data_snp_page_reclaim data;
>> + int ret, err, i, n = 0;
>> +
>> + for (i = 0; i < npages; i++) {
>> + memset(&data, 0, sizeof(data));
>> + data.paddr = pfn << PAGE_SHIFT;
>
> Oh wow, this is just silly. A struct for a single u64. Just use a
>
> u64 paddr;
Ok.
>
> directly. But we had this topic already...
>
>> +
>> + if (locked)
>
> Ew, that's never a good design - conditional locking.

There is a previous suggestion to change `sev_cmd_mutex` to some sort of
nesting lock type to clean up this if (locked) code, though AFAIK, there
is no support for nesting lock types.

>
>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
>> + else
>> + ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
>
> <---- newline here.
>
>> + if (ret)
>> + goto cleanup;
>> +
>> + ret = rmp_make_shared(pfn, PG_LEVEL_4K);
>> + if (ret)
>> + goto cleanup;
>> +
>> + pfn++;
>> + n++;
>> + }
>> +
>> + return 0;
>> +
>> +cleanup:
>> + /*
>> + * If failed to reclaim the page then page is no longer safe to
>> + * be released, leak it.
>> + */
>> + snp_leak_pages(pfn, npages - n);
>
> So this looks real weird: we go and reclaim pages, we hit an error
> during reclaiming a page X somewhere in-between and then we go and mark
> the *remaining* pages as not to be used?!
>
> Why?
>
> Why not only that *one* page which failed and then we continue with the
> rest?!

I agree and will change to this approach.

>
>> + return ret;
>> +}
>> +
>> +static inline int rmp_make_firmware(unsigned long pfn, int level)
>> +{
>> + return rmp_make_private(pfn, 0, level, 0, true);
>> +}
>
> That's a silly wrapper used only once. Just do at the callsite:
>
> /* Mark this page as belonging to firmware */
> rc = rmp_make_private(pfn, 0, level, 0, true);
>
Ok.

>> +
>> +static int snp_set_rmp_state(unsigned long paddr, unsigned int npages, bool to_fw, bool locked,
>> + bool need_reclaim)
>
> Tangential to the above, this is just nuts with those bool arguments.
> Just look at the callsites: do you understand what they do?
>
> snp_set_rmp_state(paddr, npages, true, locked, false);
>
> what does that do? You need to go up to the definition of the function,
> count the arguments and see what that "true" arg stands for.

I totally agree, this is simply unreadable.

And this has been mentioned previously too ...
This function can do a lot and when I read the call sites its hard to
see what its doing since we have a combination of arguments which tell
us what behavior is happening ...

>
> What you should do instead is, have separate helpers which do only one
> thing:
>
> rmp_mark_pages_firmware();
> rmp_mark_pages_shared();
> rmp_mark_pages_...
>
> and then have the *callers* issue snp_reclaim_pages() when needed. So you'd have
>
> rmp_mark_pages_firmware();
> rmp_mark_pages_shared()
>
> and __snp_free_firmware_pages() would do
>
> rmp_mark_pages_shared();
> snp_reclaim_pages();
>
Actually, this only needs to call snp_reclaim_pages().

> and so on.
>
> And then if you need locking, the callers can decide which sev_do_cmd
> variant to issue.
>
> And then if you have common code fragments which you can unify into a
> bigger helper function, *then* you can do that.
>
> Instead of multiplexing it this way. Which makes it really hard to
> follow what the code does.
>

Sure i will do this cleanup.

>
>> + unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT; /* Cbit maybe set in the paddr */
>
> No side comments pls.
>
>> + int rc, n = 0, i;
>> +
>> + for (i = 0; i < npages; i++) {
>> + if (to_fw)
>> + rc = rmp_make_firmware(pfn, PG_LEVEL_4K);
>> + else
>> + rc = need_reclaim ? snp_reclaim_pages(pfn, 1, locked) :
>> + rmp_make_shared(pfn, PG_LEVEL_4K);
>> + if (rc)
>> + goto cleanup;
>> +
>> + pfn++;
>> + n++;
>> + }
>> +
>> + return 0;
>> +
>> +cleanup:
>> + /* Try unrolling the firmware state changes */
>> + if (to_fw) {
>> + /*
>> + * Reclaim the pages which were already changed to the
>> + * firmware state.
>> + */
>> + snp_reclaim_pages(paddr >> PAGE_SHIFT, n, locked);
>> +
>> + return rc;
>> + }
>> +
>> + /*
>> + * If failed to change the page state to shared, then its not safe
>> + * to release the page back to the system, leak it.
>> + */
>> + snp_leak_pages(pfn, npages - n);
>> +
>> + return rc;
>> +}
>
> ...
>
>> +void snp_free_firmware_page(void *addr)
>> +{
>> + if (!addr)
>> + return;
>> +
>> + __snp_free_firmware_pages(virt_to_page(addr), 0, false);
>> +}
>> +EXPORT_SYMBOL(snp_free_firmware_page);
>
> EXPORT_SYMBOL_GPL() ofc.
>

Thanks,
Ashish

2022-10-14 21:22:53

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 12/49] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP

Hello Boris,

On 10/1/2022 12:33 PM, Borislav Petkov wrote:
> On Mon, Jun 20, 2022 at 11:04:29PM +0000, Ashish Kalra wrote:
>> +static int __sev_snp_init_locked(int *error)
>> +{
>> + struct psp_device *psp = psp_master;
>> + struct sev_device *sev;
>> + int rc = 0;
>> +
>> + if (!psp || !psp->sev_data)
>> + return -ENODEV;
>> +
>> + sev = psp->sev_data;
>> +
>> + if (sev->snp_inited)
>
> snp_inited? That's silly.
>
> snp_initialized
>
> pls.

Yes.

>
>> + return 0;
>> +
>> + /*
>> + * The SNP_INIT requires the MSR_VM_HSAVE_PA must be set to 0h
>
> /* Clear MSR_VM_HSAVE_PA on all cores before SNP_INIT */
>
>> + * across all cores.
>> + */
>> + on_each_cpu(snp_set_hsave_pa, NULL, 1);
>> +
>> + /* Issue the SNP_INIT firmware command. */
>
> Useless comment.
>
>> + rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT, NULL, error);
>> + if (rc)
>> + return rc;
>> +
>> + /* Prepare for first SNP guest launch after INIT */
>> + wbinvd_on_all_cpus();
>
> Can you put a wbinvd() in snp_set_hsave_pa() instead and save yourself
> the second IPI?
>
> Or is that order of the commands:
>
> 1. clear MSR IPI
> 2. SNP_INIT
> 3. WBINVD IPI
> 4. ...
>
> mandatory?
>

Yes, we need to do:

wbinvd_on_all_cpus();
SNP_DF_FLUSH

Need to ensure all the caches are clear before launching the first guest
and this has to be a combination of WBINVD and SNP_DF_FLUSH command.

> ...
>
>> +static int __sev_snp_shutdown_locked(int *error)
>> +{
>> + struct sev_device *sev = psp_master->sev_data;
>> + int ret;
>> +
>> + if (!sev->snp_inited)
>> + return 0;
>> +
>> + /* SHUTDOWN requires the DF_FLUSH */
>> + wbinvd_on_all_cpus();
>> + __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
>
> Why isn't this retval checked?

From the SNP FW ABI specs, for the SNP_SHUTDOWN command:

Firmware checks for every encryption capable ASID that the ASID is not
in use by a guest and a DF_FLUSH is not required. If a DF_FLUSH is
required, the firmware returns DFFLUSH_REQUIRED.

Considering that SNP_SHUTDOWN command will check if DF_FLUSH was
required and if so, and not invoked before that command, returns
an error indicating that DFFLUSH is required.

This way, we can cleverly avoid taking the error code path for
DF_FLUSH command here and instead let the SNP_SHUTDOWN command
failure below indicate if DF_FLUSH command failed.

This also ensures that we always invoke SNP_SHUTDOWN command,
irrespective of SNP_DF_FLUSH command failure as SNP_DF_FLUSH may
actually not be required by the SHUTDOWN command.

>
>> +
>> + ret = __sev_do_cmd_locked(SEV_CMD_SNP_SHUTDOWN, NULL, error);
>> + if (ret) {
>> + dev_err(sev->dev, "SEV-SNP firmware shutdown failed\n");
>> + return ret;
>> + }
>> +
>> + sev->snp_inited = false;
>> + dev_dbg(sev->dev, "SEV-SNP firmware shutdown\n");
>> +
>> + return ret;
>> +}
>
> ...
>
>> void sev_dev_destroy(struct psp_device *psp)
>> @@ -1287,6 +1385,26 @@ void sev_pci_init(void)
>> }
>> }
>>
>> + /*
>> + * If boot CPU supports the SNP, then first attempt to initialize
>
> s/the SNP/SNP/g
>
>> + * the SNP firmware.
>> + */
>> + if (cpu_feature_enabled(X86_FEATURE_SEV_SNP)) {
>> + if (!sev_version_greater_or_equal(SNP_MIN_API_MAJOR, SNP_MIN_API_MINOR)) {
>> + dev_err(sev->dev, "SEV-SNP support requires firmware version >= %d:%d\n",
>> + SNP_MIN_API_MAJOR, SNP_MIN_API_MINOR);
>> + } else {
>> + rc = sev_snp_init(&error);
>> + if (rc) {
>> + /*
>> + * If we failed to INIT SNP then don't abort the probe.
>
> Who's "we"?
>
>> + * Continue to initialize the legacy SEV firmware.
>> + */
>> + dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
>> + }
>> + }
>> + }
>> +
>> /* Obtain the TMR memory area for SEV-ES use */
>> sev_es_tmr = sev_fw_alloc(SEV_ES_TMR_SIZE);
>> if (!sev_es_tmr)
>
> ...
>
>> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
>> index 01ba9dc46ca3..ef4d42e8c96e 100644
>> --- a/include/linux/psp-sev.h
>> +++ b/include/linux/psp-sev.h
>> @@ -769,6 +769,20 @@ struct sev_data_snp_init_ex {
>> */
>> int sev_platform_init(int *error);
>>
>> +/**
>> + * sev_snp_init - perform SEV SNP_INIT command
>> + *
>> + * @error: SEV command return code
>> + *
>> + * Returns:
>> + * 0 if the SEV successfully processed the command
>> + * -%ENODEV if the SEV device is not available
>> + * -%ENOTSUPP if the SEV does not support SEV
>> + * -%ETIMEDOUT if the SEV command timed out
>> + * -%EIO if the SEV returned a non-zero return code
>
> Something's weird with those args. I think it should be
>
> %-ENODEV
>
> and so on...
>

Yes, off course %-<errno>

Thanks,
Ashish

2022-10-14 21:33:48

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 12/49] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP

Some more follow up regarding avoiding the second IPI:

>>
>>> +    rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT, NULL, error);
>>> +    if (rc)
>>> +        return rc;
>>> +
>>> +    /* Prepare for first SNP guest launch after INIT */
>>> +    wbinvd_on_all_cpus();
>>
>> Can you put a wbinvd() in snp_set_hsave_pa() instead and save yourself
>> the second IPI?
>>
>> Or is that order of the commands:
>>
>>     1. clear MSR IPI
>>     2. SNP_INIT
>>     3. WBINVD IPI
>>     4. ...
>>
>> mandatory?
>>
>
> Yes, we need to do:
>
> wbinvd_on_all_cpus();
> SNP_DF_FLUSH
>
> Need to ensure all the caches are clear before launching the first guest
> and this has to be a combination of WBINVD and SNP_DF_FLUSH command.
>

I had related discussions with the HW architect:

SNP firmware will fail ACTIVATE if DFFLUSH isn't called, and DFFLUSH
requires the WBINVD on all cores. By requiring WBIDVD on all cores,
we're a) requiring the caches to be flushed, and b) forcing the
hypervisor to exit all guests at least once since SEV/SNP has been
enabled, since the WBINVDs must be done in host mode.

The order is:
VM_HSAVE_PA IPI
SNP_INIT
WBIVND (IPI)
DF_FLUSH

so that means we can't combine the IPIs.

Also, this is not a performance critical path, so should we really be so
concerned about it?

Thanks,
Ashish

2022-10-19 19:00:12

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 12/49] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP

Another follow up:

>>>   int sev_platform_init(int *error);
>>> +/**
>>> + * sev_snp_init - perform SEV SNP_INIT command
>>> + *
>>> + * @error: SEV command return code
>>> + *
>>> + * Returns:
>>> + * 0 if the SEV successfully processed the command
>>> + * -%ENODEV    if the SEV device is not available
>>> + * -%ENOTSUPP  if the SEV does not support SEV
>>> + * -%ETIMEDOUT if the SEV command timed out
>>> + * -%EIO       if the SEV returned a non-zero return code
>>
>> Something's weird with those args. I think it should be
>>
>>     %-ENODEV
>>
>> and so on...
>>
>
> Yes, off course %-<errno>
>

I see that other drivers are also using the same convention:

include/linux/regset.h:
..
/**
* user_regset_set_fn - type of @set function in &struct user_regset
* @target: thread being examined
* @regset: regset being examined
* @pos: offset into the regset data to access, in bytes
* @count: amount of data to copy, in bytes
* @kbuf: if not %NULL, a kernel-space pointer to copy from
* @ubuf: if @kbuf is %NULL, a user-space pointer to copy from
*
* Store register values. Return %0 on success; -%EIO or -%ENODEV
* are usual failure returns. The @pos and @count values are in
...

include/linux/psp-tee.h:
..
/**
* psp_tee_process_cmd() - Process command in Trusted Execution Environment
* @cmd_id: TEE command ID (&enum tee_cmd_id)
* @buf: Command buffer for TEE processing. On success, is updated
* with the response
* @len: Length of command buffer in bytes
* @status: On success, holds the TEE command execution status
*
* This function submits a command to the Trusted OS for processing in the
* TEE environment and waits for a response or until the command times out.
*
* Returns:
* 0 if TEE successfully processed the command
* -%ENODEV if PSP device not available
* -%EINVAL if invalid input
* -%ETIMEDOUT if TEE command timed out
* -%EBUSY if PSP device is not responsive
*/
...

Thanks,
Ashish

2022-10-21 19:07:22

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

On 6/20/22 18:13, Ashish Kalra wrote:
> From: Brijesh Singh <[email protected]>
>
> Version 2 of GHCB specification added the support for two SNP Guest
> Request Message NAE events. The events allows for an SEV-SNP guest to
> make request to the SEV-SNP firmware through hypervisor using the
> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>
> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
> difference of an additional certificate blob that can be passed through
> the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
> provides snp_guest_ext_guest_request() that is used by the KVM to get
> both the report and certificate data at once.
>
> Signed-off-by: Brijesh Singh <[email protected]>
> ---
> arch/x86/kvm/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.h | 2 +
> 2 files changed, 192 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 7fc0fad87054..089af21a4efe 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c

> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa)
> +{
> + struct sev_data_snp_guest_request req = {0};
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + struct kvm *kvm = vcpu->kvm;
> + unsigned long data_npages;
> + struct kvm_sev_info *sev;
> + unsigned long rc, err;
> + u64 data_gpa;
> +
> + if (!sev_snp_guest(vcpu->kvm)) {
> + rc = SEV_RET_INVALID_GUEST;
> + goto e_fail;
> + }
> +
> + sev = &to_kvm_svm(kvm)->sev_info;
> +
> + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> + data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
> +
> + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
> + rc = SEV_RET_INVALID_ADDRESS;
> + goto e_fail;
> + }
> +
> + /* Verify that requested blob will fit in certificate buffer */
> + if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {

Not sure this is a valid check... Isn't it OK if the guest has supplied
more room than is required? If the guest supplies 8 pages and the
hypervisor only needs to copy 1 page of data (or the SEV_FW_BLOB_MAX_SIZE
number of pages) that shouldn't be an error. I think this check can go, right?

Thanks,
Tom

> + rc = SEV_RET_INVALID_PARAM;
> + goto e_fail;
> + }
> +
> + mutex_lock(&sev->guest_req_lock);
> +
> + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
> + if (rc)
> + goto unlock;
> +
> + rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
> + &data_npages, &err);
> + if (rc) {
> + /*
> + * If buffer length is small then return the expected
> + * length in rbx.
> + */
> + if (err == SNP_GUEST_REQ_INVALID_LEN)
> + vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
> +
> + /* pass the firmware error code */
> + rc = err;
> + goto cleanup;
> + }
> +
> + /* Copy the certificate blob in the guest memory */
> + if (data_npages &&
> + kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
> + rc = SEV_RET_INVALID_ADDRESS;
> +
> +cleanup:
> + snp_cleanup_guest_buf(&req, &rc);
> +
> +unlock:
> + mutex_unlock(&sev->guest_req_lock);
> +
> +e_fail:
> + svm_set_ghcb_sw_exit_info_2(vcpu, rc);
> +}
> +

2022-10-21 21:21:13

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

Hello Tom,

On 10/21/2022 2:06 PM, Tom Lendacky wrote:
> On 6/20/22 18:13, Ashish Kalra wrote:
>> From: Brijesh Singh <[email protected]>
>>
>> Version 2 of GHCB specification added the support for two SNP Guest
>> Request Message NAE events. The events allows for an SEV-SNP guest to
>> make request to the SEV-SNP firmware through hypervisor using the
>> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>>
>> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
>> difference of an additional certificate blob that can be passed through
>> the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
>> provides snp_guest_ext_guest_request() that is used by the KVM to get
>> both the report and certificate data at once.
>>
>> Signed-off-by: Brijesh Singh <[email protected]>
>> ---
>>   arch/x86/kvm/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++--
>>   arch/x86/kvm/svm/svm.h |   2 +
>>   2 files changed, 192 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 7fc0fad87054..089af21a4efe 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>
>> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t
>> req_gpa, gpa_t resp_gpa)
>> +{
>> +    struct sev_data_snp_guest_request req = {0};
>> +    struct kvm_vcpu *vcpu = &svm->vcpu;
>> +    struct kvm *kvm = vcpu->kvm;
>> +    unsigned long data_npages;
>> +    struct kvm_sev_info *sev;
>> +    unsigned long rc, err;
>> +    u64 data_gpa;
>> +
>> +    if (!sev_snp_guest(vcpu->kvm)) {
>> +        rc = SEV_RET_INVALID_GUEST;
>> +        goto e_fail;
>> +    }
>> +
>> +    sev = &to_kvm_svm(kvm)->sev_info;
>> +
>> +    data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
>> +    data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
>> +
>> +    if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
>> +        rc = SEV_RET_INVALID_ADDRESS;
>> +        goto e_fail;
>> +    }
>> +
>> +    /* Verify that requested blob will fit in certificate buffer */
>> +    if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
>
> Not sure this is a valid check...  Isn't it OK if the guest has supplied
> more room than is required? If the guest supplies 8 pages and the
> hypervisor only needs to copy 1 page of data (or the
> SEV_FW_BLOB_MAX_SIZE number of pages) that shouldn't be an error. I
> think this check can go, right?
>

Agreed.

The check should probably be
if ((data_npages << PAGE_SHIFT) < SEV_FW_BLOB_MAX_SIZE)

and that check already exists in:

snp_guest_ext_guest_request(...)
{
...
...
/*
* Check if there is enough space to copy the certificate
chain. Otherwise
* return ERROR code defined in the GHCB specification.
*/
expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
if (*npages < expected_npages) {
*npages = expected_npages;
*fw_err = SNP_GUEST_REQ_INVALID_LEN;
return -EINVAL;
}
...

Thanks,
Ashish

> Thanks,
> Tom
>
>> +        rc = SEV_RET_INVALID_PARAM;
>> +        goto e_fail;
>> +    }
>> +
>> +    mutex_lock(&sev->guest_req_lock);
>> +
>> +    rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
>> +    if (rc)
>> +        goto unlock;
>> +
>> +    rc = snp_guest_ext_guest_request(&req, (unsigned
>> long)sev->snp_certs_data,
>> +                     &data_npages, &err);
>> +    if (rc) {
>> +        /*
>> +         * If buffer length is small then return the expected
>> +         * length in rbx.
>> +         */
>> +        if (err == SNP_GUEST_REQ_INVALID_LEN)
>> +            vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
>> +
>> +        /* pass the firmware error code */
>> +        rc = err;
>> +        goto cleanup;
>> +    }
>> +
>> +    /* Copy the certificate blob in the guest memory */
>> +    if (data_npages &&
>> +        kvm_write_guest(kvm, data_gpa, sev->snp_certs_data,
>> data_npages << PAGE_SHIFT))
>> +        rc = SEV_RET_INVALID_ADDRESS;
>> +
>> +cleanup:
>> +    snp_cleanup_guest_buf(&req, &rc);
>> +
>> +unlock:
>> +    mutex_unlock(&sev->guest_req_lock);
>> +
>> +e_fail:
>> +    svm_set_ghcb_sw_exit_info_2(vcpu, rc);
>> +}
>> +

2022-10-21 21:48:04

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 42/49] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

On 10/21/22 16:12, Kalra, Ashish wrote:
> Hello Tom,
>
> On 10/21/2022 2:06 PM, Tom Lendacky wrote:
>> On 6/20/22 18:13, Ashish Kalra wrote:
>>> From: Brijesh Singh <[email protected]>
>>>
>>> Version 2 of GHCB specification added the support for two SNP Guest
>>> Request Message NAE events. The events allows for an SEV-SNP guest to
>>> make request to the SEV-SNP firmware through hypervisor using the
>>> SNP_GUEST_REQUEST API define in the SEV-SNP firmware specification.
>>>
>>> The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
>>> difference of an additional certificate blob that can be passed through
>>> the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver
>>> provides snp_guest_ext_guest_request() that is used by the KVM to get
>>> both the report and certificate data at once.
>>>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>> ---
>>>   arch/x86/kvm/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++--
>>>   arch/x86/kvm/svm/svm.h |   2 +
>>>   2 files changed, 192 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index 7fc0fad87054..089af21a4efe 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>
>>> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t
>>> req_gpa, gpa_t resp_gpa)
>>> +{
>>> +    struct sev_data_snp_guest_request req = {0};
>>> +    struct kvm_vcpu *vcpu = &svm->vcpu;
>>> +    struct kvm *kvm = vcpu->kvm;
>>> +    unsigned long data_npages;
>>> +    struct kvm_sev_info *sev;
>>> +    unsigned long rc, err;
>>> +    u64 data_gpa;
>>> +
>>> +    if (!sev_snp_guest(vcpu->kvm)) {
>>> +        rc = SEV_RET_INVALID_GUEST;
>>> +        goto e_fail;
>>> +    }
>>> +
>>> +    sev = &to_kvm_svm(kvm)->sev_info;
>>> +
>>> +    data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
>>> +    data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
>>> +
>>> +    if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
>>> +        rc = SEV_RET_INVALID_ADDRESS;
>>> +        goto e_fail;
>>> +    }
>>> +
>>> +    /* Verify that requested blob will fit in certificate buffer */
>>> +    if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
>>
>> Not sure this is a valid check...  Isn't it OK if the guest has supplied
>> more room than is required? If the guest supplies 8 pages and the
>> hypervisor only needs to copy 1 page of data (or the
>> SEV_FW_BLOB_MAX_SIZE number of pages) that shouldn't be an error. I
>> think this check can go, right?
>>
>
> Agreed.
>
> The check should probably be
>  if ((data_npages << PAGE_SHIFT) < SEV_FW_BLOB_MAX_SIZE)

No, the check should just be removed. If the number of pages required to
hold the cert data is only 1, then a data_npages value of 1 is just fine
(see below).

>
> and that check already exists in:
>
> snp_guest_ext_guest_request(...)
> {
> ...
> ...
>    /*
>          * Check if there is enough space to copy the certificate chain.
> Otherwise
>          * return ERROR code defined in the GHCB specification.
>          */
>         expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
>         if (*npages < expected_npages) {

If expected_npages is 1, then an *npages value of 1 is OK. But if you put
the check in above that you want, you would never get here with an *npages
value of 1.

Thanks,
Tom

>                 *npages = expected_npages;
>                 *fw_err = SNP_GUEST_REQ_INVALID_LEN;
>                 return -EINVAL;
>         }
> ...
>
> Thanks,
> Ashish
>
>> Thanks,
>> Tom
>>
>>> +        rc = SEV_RET_INVALID_PARAM;
>>> +        goto e_fail;
>>> +    }
>>> +
>>> +    mutex_lock(&sev->guest_req_lock);
>>> +
>>> +    rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa);
>>> +    if (rc)
>>> +        goto unlock;
>>> +
>>> +    rc = snp_guest_ext_guest_request(&req, (unsigned
>>> long)sev->snp_certs_data,
>>> +                     &data_npages, &err);
>>> +    if (rc) {
>>> +        /*
>>> +         * If buffer length is small then return the expected
>>> +         * length in rbx.
>>> +         */
>>> +        if (err == SNP_GUEST_REQ_INVALID_LEN)
>>> +            vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
>>> +
>>> +        /* pass the firmware error code */
>>> +        rc = err;
>>> +        goto cleanup;
>>> +    }
>>> +
>>> +    /* Copy the certificate blob in the guest memory */
>>> +    if (data_npages &&
>>> +        kvm_write_guest(kvm, data_gpa, sev->snp_certs_data,
>>> data_npages << PAGE_SHIFT))
>>> +        rc = SEV_RET_INVALID_ADDRESS;
>>> +
>>> +cleanup:
>>> +    snp_cleanup_guest_buf(&req, &rc);
>>> +
>>> +unlock:
>>> +    mutex_unlock(&sev->guest_req_lock);
>>> +
>>> +e_fail:
>>> +    svm_set_ghcb_sw_exit_info_2(vcpu, rc);
>>> +}
>>> +

2022-10-25 08:42:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 12/49] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP

On Fri, Oct 14, 2022 at 04:09:11PM -0500, Kalra, Ashish wrote:
> Yes, we need to do:
>
> wbinvd_on_all_cpus();
> SNP_DF_FLUSH
>
> Need to ensure all the caches are clear before launching the first guest and
> this has to be a combination of WBINVD and SNP_DF_FLUSH command.

Ok.

> > Why isn't this retval checked?
>
> From the SNP FW ABI specs, for the SNP_SHUTDOWN command:
>
> Firmware checks for every encryption capable ASID that the ASID is not in
> use by a guest and a DF_FLUSH is not required. If a DF_FLUSH is required,
> the firmware returns DFFLUSH_REQUIRED.
>
> Considering that SNP_SHUTDOWN command will check if DF_FLUSH was
> required and if so, and not invoked before that command, returns
> an error indicating that DFFLUSH is required.
>
> This way, we can cleverly avoid taking the error code path for
> DF_FLUSH command here and instead let the SNP_SHUTDOWN command
> failure below indicate if DF_FLUSH command failed.
>
> This also ensures that we always invoke SNP_SHUTDOWN command,
> irrespective of SNP_DF_FLUSH command failure as SNP_DF_FLUSH may
> actually not be required by the SHUTDOWN command.

This all sounds just silly. The proper way to do this is:

retry:
ret = __sev_do_cmd_locked(SEV_CMD_SNP_SHUTDOWN, NULL, error);
if (ret == DFFLUSH_REQUIRED) {
ret = __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
if (ret)
"... DF_FLUSH failed...";

goto retry;
}

I'm assuming here the firmware is smart enough to not keep returning
DFFLUSH_REQUIRED constantly and cause an endless loop.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-25 09:15:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 12/49] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP

On Wed, Oct 19, 2022 at 01:48:48PM -0500, Kalra, Ashish wrote:
> I see that other drivers are also using the same convention:

It is only convention. Look at the .rst output:

0 if the SEV successfully processed the command
-``ENODEV`` if the SEV device is not available
-``ENOTSUPP`` if the SEV does not support SEV
-``ETIMEDOUT`` if the SEV command timed out
-``EIO`` if the SEV returned a non-zero return code

vs

0 if the SEV successfully processed the command
``-ENODEV`` if the SEV device is not available
``-ENOTSUPP`` if the SEV does not support SEV
``-ETIMEDOUT`` if the SEV command timed out
``-EIO`` if the SEV returned a non-zero return code

so in the html output of this, the minus sign will be displayed either
with text font or with monospaced font as part of the error type.

I wanna say the second is better as the '-' is part of the error code
but won't waste too much time debating this. :)

Btw

$ ./scripts/kernel-doc include/linux/psp-sev.h

complains a lot. Might wanna fix those up when bored or someone else
who's reading this and feels bored too. :-)

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-25 10:41:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Fri, Oct 14, 2022 at 03:00:09PM -0500, Kalra, Ashish wrote:
> If it is "still" accessed/touched then it can cause RMP #PF.
> On the other hand,
>
> * PG_hwpoison... Accessing is
> * not safe since it may cause another machine check. Don't touch!
>
> That sounds exactly the state we want these page(s) to be in ?
>
> Another possibility is PG_error.

Something like this:

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..baffa9c0dc30 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -186,6 +186,7 @@ enum pageflags {
* THP.
*/
PG_has_hwpoisoned = PG_error,
+ PG_offlimits = PG_hwpoison,
#endif

/* non-lru isolated movable page */

and SNP will have to depend on CONFIG_MEMORY_FAILURE.

But I'd let mm folks correct me here on the details.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-31 20:13:55

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Just to add here, writing to any of these pages from the Host
will trigger a RMP #PF which will cause the RMP page fault handler
to send a SIGBUS to the current process, as this page is not owned
by Host.

So calling memory_failure() is proactively doing the same,
marking the page as poisoned and probably also killing the
current process.

Thanks,
Ashish

On 10/25/2022 5:25 AM, Borislav Petkov wrote:
> On Fri, Oct 14, 2022 at 03:00:09PM -0500, Kalra, Ashish wrote:
>> If it is "still" accessed/touched then it can cause RMP #PF.
>> On the other hand,
>>
>> * PG_hwpoison... Accessing is
>> * not safe since it may cause another machine check. Don't touch!
>>
>> That sounds exactly the state we want these page(s) to be in ?
>>
>> Another possibility is PG_error.
>
> Something like this:
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e66f7aa3191d..baffa9c0dc30 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -186,6 +186,7 @@ enum pageflags {
> * THP.
> */
> PG_has_hwpoisoned = PG_error,
> + PG_offlimits = PG_hwpoison,
> #endif
>
> /* non-lru isolated movable page */
>
> and SNP will have to depend on CONFIG_MEMORY_FAILURE.
>
> But I'd let mm folks correct me here on the details.
>

2022-10-31 21:16:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Oct 31, 2022 at 03:10:16PM -0500, Kalra, Ashish wrote:
> Just to add here, writing to any of these pages from the Host
> will trigger a RMP #PF which will cause the RMP page fault handler
> to send a SIGBUS to the current process, as this page is not owned
> by Host.

And kill the host process?

So this is another "policy" which sounds iffy. If we kill the process,
we should at least say why. Are we doing that currently?

> So calling memory_failure() is proactively doing the same, marking the
> page as poisoned and probably also killing the current process.

But the page is not suffering a memory failure - it cannot be reclaimed
for whatever reason. Btw, how can that reclaim failure ever happen? Any
real scenarios?

Anyway, memory failure just happens to fit what you wanna do but you
can't just reuse that - that's hacky. What is the problem with writing
your own function which does that?

Also, btw, please do not top-post.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-31 22:06:16

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 10/31/2022 4:15 PM, Borislav Petkov wrote:
> On Mon, Oct 31, 2022 at 03:10:16PM -0500, Kalra, Ashish wrote:
>> Just to add here, writing to any of these pages from the Host
>> will trigger a RMP #PF which will cause the RMP page fault handler
>> to send a SIGBUS to the current process, as this page is not owned
>> by Host.
>
> And kill the host process?
>
> So this is another "policy" which sounds iffy. If we kill the process,
> we should at least say why. Are we doing that currently?

Yes, pasted below is the latest host RMP #PF handler, with new and
additional comments added and there is a relevant comment added here for
this behavior:

static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned
long error_code,unsigned long address)
{
...
...

/*
* If its a guest private page, then the fault cannot be resolved.
* Send a SIGBUS to terminate the process.
*
* As documented in APM vol3 pseudo-code for RMPUPDATE, when the
* 2M range is covered by a valid (Assigned=1) 2M entry, the middle
* 511 4k entries also have Assigned=1. This means that if there is
* an access to a page which happens to lie within an Assigned 2M
* entry, the 4k RMP entry will also have Assigned=1. Therefore, the
* kernel should see that the page is not a valid page and the fault
* cannot be resolved.
*/
if (snp_lookup_rmpentry(pfn, &rmp_level)) {
do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
return RMP_PF_RETRY;
}
...
...

I believe that we already had an off-list discussion on the same,
copying David Kaplan's reply on the same below:

So what I think you want to do is:
1. Compute the pfn for the 4kb page you're trying to access (as your
code below does) 2. Read that RMP entry -- If it is assigned then kill
the process 3. Otherwise, check the level from the host page table. If
level=PG_LEVEL_4K then somebody else may have already smashed this page,
so just retry the instruction 4. If level=PG_LEVEL_2M/1G, then the host
needs to split their page.

This is the current algorithm being followed by the host RMP #PF handler.

>
>> So calling memory_failure() is proactively doing the same, marking the
>> page as poisoned and probably also killing the current process.
>
> But the page is not suffering a memory failure - it cannot be reclaimed
> for whatever reason. Btw, how can that reclaim failure ever happen? Any
> real scenarios?

The scenarios here are either SNP FW failure (SNP_PAGE_RECLAIM command)
in transitioning the page back to HV state and/or RMPUPDATE instruction
failure to transition the page back to hypervisor/shared state.

>
> Anyway, memory failure just happens to fit what you wanna do but you
> can't just reuse that - that's hacky. What is the problem with writing
> your own function which does that?
>

Ok.

Will look at adding our own recovery function for the same, but that
will again mark the pages as poisoned, right ?

Still waiting for some/more feedback from mm folks on the same.

Thanks,
Ashish

> Also, btw, please do not top-post.
>
> Thx.
>

2022-11-02 03:18:08

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

Hello Boris,

On 8/4/2022 7:11 AM, Borislav Petkov wrote:
> On Mon, Aug 01, 2022 at 11:57:09PM +0000, Kalra, Ashish wrote:
>> You mean set_memory_present() ?
>
> Right, that.
>
> We have set_memory_np() but set_memory_present(). Talk about
> consistence... ;-\

Following up on this, now, set_memory_present() is a static interface,
so will need do add a new external API like set_memory_p() similar
to set_memory_np().

Also, looking at arch/x86/include/asm/set_memory.h:
..
/*
* The set_memory_* API can be used to change various attributes of a
* virtual address range. The attributes include:
* Cacheability : UnCached, WriteCombining, WriteThrough, WriteBack
* Executability : eXecutable, NoteXecutable
* Read/Write : ReadOnly, ReadWrite
* Presence : NotPresent
* Encryption : Encrypted, Decrypted
*
..
int set_memory_np(unsigned long addr, int numpages);
..

So currently there is no interface defined for changing the attribute of
a range to present or restoring the range in the direct map.

Thanks,
Ashish

2022-11-02 11:39:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
> if (snp_lookup_rmpentry(pfn, &rmp_level)) {
> do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
> return RMP_PF_RETRY;

Does this issue some halfway understandable error message why the
process got killed?

> Will look at adding our own recovery function for the same, but that will
> again mark the pages as poisoned, right ?

Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
Semantically, it'll be handled the same way, ofc.

> Still waiting for some/more feedback from mm folks on the same.

Just send the patch and they'll give it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-14 23:41:51

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 11/2/2022 6:22 AM, Borislav Petkov wrote:
> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>> if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>> do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>> return RMP_PF_RETRY;
>
> Does this issue some halfway understandable error message why the
> process got killed?
>
>> Will look at adding our own recovery function for the same, but that will
>> again mark the pages as poisoned, right ?
>
> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
> Semantically, it'll be handled the same way, ofc.

Added a new PG_offlimits flag and a simple corresponding handler for it.

But there is still added complexity of handling hugepages as part of
reclamation failures (both HugeTLB and transparent hugepages) and that
means calling more static functions in mm/memory_failure.c

There is probably a more appropriate handler in mm/memory-failure.c:

soft_offline_page() - this will mark the page as HWPoisoned and also has
handling for hugepages. And we can avoid adding a new page flag too.

soft_offline_page - Soft offline a page.
Soft offline a page, by migration or invalidation, without killing anything.

So, this looks like a good option to call
soft_offline_page() instead of memory_failure() in case of
failure to transition the page back to HV/shared state via
SNP_RECLAIM_CMD and/or RMPUPDATE instruction.

Thanks,
Ashish

>
>> Still waiting for some/more feedback from mm folks on the same.
>
> Just send the patch and they'll give it.
>
> Thx.
>

2022-11-15 14:29:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Nov 14, 2022 at 05:36:29PM -0600, Kalra, Ashish wrote:
> But there is still added complexity of handling hugepages as part of
> reclamation failures (both HugeTLB and transparent hugepages) and that

Why?

You want to offline pfns of 4K pages. What hugepages?

> means calling more static functions in mm/memory_failure.c
>
> There is probably a more appropriate handler in mm/memory-failure.c:
>
> soft_offline_page() - this will mark the page as HWPoisoned and also has
> handling for hugepages. And we can avoid adding a new page flag too.

So if some other code wants to dump the amount of all hwpoisoned pages,
it'll dump those too.

Don't you see what is wrong with this picture?

And btw, reusing the hwpoison flag

PG_offlimits = PG_hwpoison

like previously suggested doesn't help here either.

IOW, I really don't like this lumping of semantics together. ;-\

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-15 15:28:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Tue, Nov 15, 2022 at 04:14:42PM +0100, Vlastimil Babka wrote:
> but maybe we could just put the pages on some leaked lists without
> special page? The only thing that should matter is not to free the
> pages to the page allocator so they would be reused by something else.

As said on IRC, I like this a *lot*. This perfectly represents what
those leaked pages are: leaked, cannot be used and lost. Certainly not
hwpoisoned.

Yeah, that's much better.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-15 15:30:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Cc'ing memory failure folks, the beinning of this subthread is here:

https://lore.kernel.org/all/3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra@amd.com/

On 11/15/22 00:36, Kalra, Ashish wrote:
> Hello Boris,
>
> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>       if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>              do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>              return RMP_PF_RETRY;
>>
>> Does this issue some halfway understandable error message why the
>> process got killed?
>>
>>> Will look at adding our own recovery function for the same, but that will
>>> again mark the pages as poisoned, right ?
>>
>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>> Semantically, it'll be handled the same way, ofc.
>
> Added a new PG_offlimits flag and a simple corresponding handler for it.

One thing is, there's not enough page flags to be adding more (except
aliases for existing) for cases that can avoid it, but as Boris says, if
using alias to PG_hwpoison it depends what will become confused with the
actual hwpoison.

> But there is still added complexity of handling hugepages as part of
> reclamation failures (both HugeTLB and transparent hugepages) and that
> means calling more static functions in mm/memory_failure.c
>
> There is probably a more appropriate handler in mm/memory-failure.c:
>
> soft_offline_page() - this will mark the page as HWPoisoned and also has
> handling for hugepages. And we can avoid adding a new page flag too.
>
> soft_offline_page - Soft offline a page.
> Soft offline a page, by migration or invalidation, without killing anything.
>
> So, this looks like a good option to call
> soft_offline_page() instead of memory_failure() in case of
> failure to transition the page back to HV/shared state via SNP_RECLAIM_CMD
> and/or RMPUPDATE instruction.

So it's a bit unclear to me what exact situation we are handling here. The
original patch here seems to me to be just leaking back pages that are
unsafe for further use. soft_offline_page() seems to fit that scenario of a
graceful leak before something is irrepairably corrupt and we page fault on it.
But then in the thread you discus PF handling and killing. So what is the
case here? If we detect this need to call snp_leak_pages() does it mean:

a) nobody that could page fault at them (the guest?) is running anymore, we
are tearing it down, we just can't reuse the pages further on the host
- seem like soft_offline_page() could work, but maybe we could just put the
pages on some leaked lists without special page? The only thing that should
matter is not to free the pages to the page allocator so they would be
reused by something else.

b) something can stil page fault at them (what?) - AFAIU can't be resolved
without killing something, memory_failure() might limit the damage

> Thanks,
> Ashish
>
>>
>>> Still waiting for some/more feedback from mm folks on the same.
>>
>> Just send the patch and they'll give it.
>>
>> Thx.
>>


2022-11-15 16:28:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

And,

as dhansen connected the dots, this should be the exact same protection
scenario as UPM:

https://lore.kernel.org/all/[email protected]

so you should be able to mark them inaccessible the same way and you
won't need any poisoning dance.

And Michael has patches so you probably should talk to him...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-15 17:25:29

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Vlastimil,

On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
> Cc'ing memory failure folks, the beinning of this subthread is here:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C944b59f239c541a52ac808dac71c2089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041220947600149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=do9zzyMlAErkKx5rguqnL2GoG4lhsWHDI74zgwLWaZU%3D&amp;reserved=0
>
> On 11/15/22 00:36, Kalra, Ashish wrote:
>> Hello Boris,
>>
>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>       if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>              do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>              return RMP_PF_RETRY;
>>>
>>> Does this issue some halfway understandable error message why the
>>> process got killed?
>>>
>>>> Will look at adding our own recovery function for the same, but that will
>>>> again mark the pages as poisoned, right ?
>>>
>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>>> Semantically, it'll be handled the same way, ofc.
>>
>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>
> One thing is, there's not enough page flags to be adding more (except
> aliases for existing) for cases that can avoid it, but as Boris says, if
> using alias to PG_hwpoison it depends what will become confused with the
> actual hwpoison.
>
>> But there is still added complexity of handling hugepages as part of
>> reclamation failures (both HugeTLB and transparent hugepages) and that
>> means calling more static functions in mm/memory_failure.c
>>
>> There is probably a more appropriate handler in mm/memory-failure.c:
>>
>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>> handling for hugepages. And we can avoid adding a new page flag too.
>>
>> soft_offline_page - Soft offline a page.
>> Soft offline a page, by migration or invalidation, without killing anything.
>>
>> So, this looks like a good option to call
>> soft_offline_page() instead of memory_failure() in case of
>> failure to transition the page back to HV/shared state via SNP_RECLAIM_CMD
>> and/or RMPUPDATE instruction.
>
> So it's a bit unclear to me what exact situation we are handling here. The
> original patch here seems to me to be just leaking back pages that are
> unsafe for further use. soft_offline_page() seems to fit that scenario of a
> graceful leak before something is irrepairably corrupt and we page fault on it.
> But then in the thread you discus PF handling and killing. So what is the
> case here? If we detect this need to call snp_leak_pages() does it mean:
>
> a) nobody that could page fault at them (the guest?) is running anymore, we
> are tearing it down, we just can't reuse the pages further on the host

The host can page fault on them, if anything on the host tries to write
to these pages. Host reads will return garbage data.

> - seem like soft_offline_page() could work, but maybe we could just put the
> pages on some leaked lists without special page? The only thing that should
> matter is not to free the pages to the page allocator so they would be
> reused by something else.
>
> b) something can stil page fault at them (what?) - AFAIU can't be resolved
> without killing something, memory_failure() might limit the damage

As i mentioned above, host writes will cause RMP violation page fault.

Thanks,
Ashish

>>
>>>
>>>> Still waiting for some/more feedback from mm folks on the same.
>>>
>>> Just send the patch and they'll give it.
>>>
>>> Thx.
>>>
>

2022-11-15 18:19:30

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled


On 11/15/2022 11:24 AM, Kalra, Ashish wrote:
> Hello Vlastimil,
>
> On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C944b59f239c541a52ac808dac71c2089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041220947600149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=do9zzyMlAErkKx5rguqnL2GoG4lhsWHDI74zgwLWaZU%3D&amp;reserved=0
>>
>>
>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>> Hello Boris,
>>>
>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>>        if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>>               do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>>               return RMP_PF_RETRY;
>>>>
>>>> Does this issue some halfway understandable error message why the
>>>> process got killed?
>>>>
>>>>> Will look at adding our own recovery function for the same, but
>>>>> that will
>>>>> again mark the pages as poisoned, right ?
>>>>
>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree
>>>> upon.
>>>> Semantically, it'll be handled the same way, ofc.
>>>
>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>
>> One thing is, there's not enough page flags to be adding more (except
>> aliases for existing) for cases that can avoid it, but as Boris says, if
>> using alias to PG_hwpoison it depends what will become confused with the
>> actual hwpoison.
>>
>>> But there is still added complexity of handling hugepages as part of
>>> reclamation failures (both HugeTLB and transparent hugepages) and that
>>> means calling more static functions in mm/memory_failure.c
>>>
>>> There is probably a more appropriate handler in mm/memory-failure.c:
>>>
>>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>>> handling for hugepages. And we can avoid adding a new page flag too.
>>>
>>> soft_offline_page - Soft offline a page.
>>> Soft offline a page, by migration or invalidation, without killing
>>> anything.
>>>
>>> So, this looks like a good option to call
>>> soft_offline_page() instead of memory_failure() in case of
>>> failure to transition the page back to HV/shared state via
>>> SNP_RECLAIM_CMD
>>> and/or RMPUPDATE instruction.
>>
>> So it's a bit unclear to me what exact situation we are handling here.
>> The
>> original patch here seems to me to be just leaking back pages that are
>> unsafe for further use. soft_offline_page() seems to fit that scenario
>> of a
>> graceful leak before something is irrepairably corrupt and we page
>> fault on it.
>> But then in the thread you discus PF handling and killing. So what is the
>> case here? If we detect this need to call snp_leak_pages() does it mean:
>>
>> a) nobody that could page fault at them (the guest?) is running
>> anymore, we
>> are tearing it down, we just can't reuse the pages further on the host
>
> The host can page fault on them, if anything on the host tries to write
> to these pages. Host reads will return garbage data.
>
>> - seem like soft_offline_page() could work, but maybe we could just
>> put the
>> pages on some leaked lists without special page? The only thing that
>> should
>> matter is not to free the pages to the page allocator so they would be
>> reused by something else.
>>
>> b) something can stil page fault at them (what?) - AFAIU can't be
>> resolved
>> without killing something, memory_failure() might limit the damage
>
> As i mentioned above, host writes will cause RMP violation page fault.
>

And to add here, if its a guest private page, then the above fault
cannot be resolved, so the faulting process is terminated.

Thanks,
Ashish

>
>>>
>>>>
>>>>> Still waiting for some/more feedback from mm folks on the same.
>>>>
>>>> Just send the patch and they'll give it.
>>>>
>>>> Thx.
>>>>
>>

2022-11-15 22:47:36

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 11/15/2022 10:27 AM, Borislav Petkov wrote:
> And,
>
> as dhansen connected the dots, this should be the exact same protection
> scenario as UPM:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20221025151344.3784230-1-chao.p.peng%40linux.intel.com&amp;data=05%7C01%7Cashish.kalra%40amd.com%7Cbfecf32a51eb499b526d08dac726491e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041264625164355%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=3RqOC3b9qn0%2B2IRsTZURBmhAVtOn7rARR6fOMOsrFpE%3D&amp;reserved=0
>
> so you should be able to mark them inaccessible the same way and you
> won't need any poisoning dance.

With UPM, the guest pages are all still freed back to the host after
guest shutdown, so it's not clear how this would help with handling of
leaked pages, for e.g, the host can still access these pages once the
guest is shutdown and it will cause the RMP violation #PF at that point.

Additionally, our use case is of host allocated firmware pages as part
of the crypto driver (to be passed to SNP firmware api calls
and then re-transitioned back to host state on return) so these are not
guest private pages in the true sense and they need to be
handled differently in case there is a failure in reclaiming them.

Can you elaborate on what you have in mind ?

Thanks,
Ashish

>
> And Michael has patches so you probably should talk to him...
>
> Thx.
>

Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Tue, Nov 15, 2022 at 04:14:42PM +0100, Vlastimil Babka wrote:
> Cc'ing memory failure folks, the beinning of this subthread is here:
>
> https://lore.kernel.org/all/3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra@amd.com/
>
> On 11/15/22 00:36, Kalra, Ashish wrote:
> > Hello Boris,
> >
> > On 11/2/2022 6:22 AM, Borislav Petkov wrote:
> >> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
> >>>       if (snp_lookup_rmpentry(pfn, &rmp_level)) {
> >>>              do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
> >>>              return RMP_PF_RETRY;
> >>
> >> Does this issue some halfway understandable error message why the
> >> process got killed?
> >>
> >>> Will look at adding our own recovery function for the same, but that will
> >>> again mark the pages as poisoned, right ?
> >>
> >> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
> >> Semantically, it'll be handled the same way, ofc.
> >
> > Added a new PG_offlimits flag and a simple corresponding handler for it.
>
> One thing is, there's not enough page flags to be adding more (except
> aliases for existing) for cases that can avoid it, but as Boris says, if
> using alias to PG_hwpoison it depends what will become confused with the
> actual hwpoison.

I agree with this. Just defining PG_offlimits as an alias of PG_hwpoison
could break current hwpoison workload. So if you finally decide to go
forward in this direction, you may as well have some indicator to
distinguish the new kind of leaked pages from hwpoisoned pages.

I don't remember exact thread, but I've read someone writing about similar
kind of suggestion of using memory_failure() to make pages inaccessible in
non-memory error usecase. I feel that it could be possible to generalize
memory_failure() as general-purpose page offlining (by renaming it with
hard_offline_page() and making memory_failure() one of the user of it).

Thanks,
Naoya Horiguchi

2022-11-16 09:22:54

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/15/22 19:15, Kalra, Ashish wrote:
>
> On 11/15/2022 11:24 AM, Kalra, Ashish wrote:
>> Hello Vlastimil,
>>
>> On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
>>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C944b59f239c541a52ac808dac71c2089%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041220947600149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=do9zzyMlAErkKx5rguqnL2GoG4lhsWHDI74zgwLWaZU%3D&amp;reserved=0
>>>
>>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>>> Hello Boris,
>>>>
>>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>>>        if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>>>               do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>>>               return RMP_PF_RETRY;
>>>>>
>>>>> Does this issue some halfway understandable error message why the
>>>>> process got killed?
>>>>>
>>>>>> Will look at adding our own recovery function for the same, but that will
>>>>>> again mark the pages as poisoned, right ?
>>>>>
>>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>>>>> Semantically, it'll be handled the same way, ofc.
>>>>
>>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>>
>>> One thing is, there's not enough page flags to be adding more (except
>>> aliases for existing) for cases that can avoid it, but as Boris says, if
>>> using alias to PG_hwpoison it depends what will become confused with the
>>> actual hwpoison.
>>>
>>>> But there is still added complexity of handling hugepages as part of
>>>> reclamation failures (both HugeTLB and transparent hugepages) and that
>>>> means calling more static functions in mm/memory_failure.c
>>>>
>>>> There is probably a more appropriate handler in mm/memory-failure.c:
>>>>
>>>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>>>> handling for hugepages. And we can avoid adding a new page flag too.
>>>>
>>>> soft_offline_page - Soft offline a page.
>>>> Soft offline a page, by migration or invalidation, without killing
>>>> anything.
>>>>
>>>> So, this looks like a good option to call
>>>> soft_offline_page() instead of memory_failure() in case of
>>>> failure to transition the page back to HV/shared state via SNP_RECLAIM_CMD
>>>> and/or RMPUPDATE instruction.
>>>
>>> So it's a bit unclear to me what exact situation we are handling here. The
>>> original patch here seems to me to be just leaking back pages that are
>>> unsafe for further use. soft_offline_page() seems to fit that scenario of a
>>> graceful leak before something is irrepairably corrupt and we page fault
>>> on it.
>>> But then in the thread you discus PF handling and killing. So what is the
>>> case here? If we detect this need to call snp_leak_pages() does it mean:
>>>
>>> a) nobody that could page fault at them (the guest?) is running anymore, we
>>> are tearing it down, we just can't reuse the pages further on the host
>>
>> The host can page fault on them, if anything on the host tries to write to
>> these pages. Host reads will return garbage data.
>>
>>> - seem like soft_offline_page() could work, but maybe we could just put the
>>> pages on some leaked lists without special page? The only thing that should
>>> matter is not to free the pages to the page allocator so they would be
>>> reused by something else.
>>>
>>> b) something can stil page fault at them (what?) - AFAIU can't be resolved
>>> without killing something, memory_failure() might limit the damage
>>
>> As i mentioned above, host writes will cause RMP violation page fault.
>>
>
> And to add here, if its a guest private page, then the above fault cannot be
> resolved, so the faulting process is terminated.

BTW would this not be mostly resolved as part of rebasing to UPM?
- host will not have these pages mapped in the first place (both kernel
directmap and qemu userspace)
- guest will have them mapped, but I assume that the conversion from private
to shared (that might fail?) can only happen after guest's mappings are
invalidated in the first place?

> Thanks,
> Ashish
>
>>
>>>>
>>>>>
>>>>>> Still waiting for some/more feedback from mm folks on the same.
>>>>>
>>>>> Just send the patch and they'll give it.
>>>>>
>>>>> Thx.
>>>>>
>>>


2022-11-16 10:20:48

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/16/2022 3:08 AM, Vlastimil Babka wrote:
> On 11/15/22 19:15, Kalra, Ashish wrote:
>>
>> On 11/15/2022 11:24 AM, Kalra, Ashish wrote:
>>> Hello Vlastimil,
>>>
>>> On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
>>>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C38f8b76238014c67049308dac7b213a5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041865033588985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=DDm7tZhUJLy%2BMzS1SXUnsBmBQnAI5dqR6tWZhCKRMEI%3D&amp;reserved=0
>>>>
>>>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>>>> Hello Boris,
>>>>>
>>>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>>>>        if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>>>>               do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>>>>               return RMP_PF_RETRY;
>>>>>>
>>>>>> Does this issue some halfway understandable error message why the
>>>>>> process got killed?
>>>>>>
>>>>>>> Will look at adding our own recovery function for the same, but that will
>>>>>>> again mark the pages as poisoned, right ?
>>>>>>
>>>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>>>>>> Semantically, it'll be handled the same way, ofc.
>>>>>
>>>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>>>
>>>> One thing is, there's not enough page flags to be adding more (except
>>>> aliases for existing) for cases that can avoid it, but as Boris says, if
>>>> using alias to PG_hwpoison it depends what will become confused with the
>>>> actual hwpoison.
>>>>
>>>>> But there is still added complexity of handling hugepages as part of
>>>>> reclamation failures (both HugeTLB and transparent hugepages) and that
>>>>> means calling more static functions in mm/memory_failure.c
>>>>>
>>>>> There is probably a more appropriate handler in mm/memory-failure.c:
>>>>>
>>>>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>>>>> handling for hugepages. And we can avoid adding a new page flag too.
>>>>>
>>>>> soft_offline_page - Soft offline a page.
>>>>> Soft offline a page, by migration or invalidation, without killing
>>>>> anything.
>>>>>
>>>>> So, this looks like a good option to call
>>>>> soft_offline_page() instead of memory_failure() in case of
>>>>> failure to transition the page back to HV/shared state via SNP_RECLAIM_CMD
>>>>> and/or RMPUPDATE instruction.
>>>>
>>>> So it's a bit unclear to me what exact situation we are handling here. The
>>>> original patch here seems to me to be just leaking back pages that are
>>>> unsafe for further use. soft_offline_page() seems to fit that scenario of a
>>>> graceful leak before something is irrepairably corrupt and we page fault
>>>> on it.
>>>> But then in the thread you discus PF handling and killing. So what is the
>>>> case here? If we detect this need to call snp_leak_pages() does it mean:
>>>>
>>>> a) nobody that could page fault at them (the guest?) is running anymore, we
>>>> are tearing it down, we just can't reuse the pages further on the host
>>>
>>> The host can page fault on them, if anything on the host tries to write to
>>> these pages. Host reads will return garbage data.
>>>
>>>> - seem like soft_offline_page() could work, but maybe we could just put the
>>>> pages on some leaked lists without special page? The only thing that should
>>>> matter is not to free the pages to the page allocator so they would be
>>>> reused by something else.
>>>>
>>>> b) something can stil page fault at them (what?) - AFAIU can't be resolved
>>>> without killing something, memory_failure() might limit the damage
>>>
>>> As i mentioned above, host writes will cause RMP violation page fault.
>>>
>>
>> And to add here, if its a guest private page, then the above fault cannot be
>> resolved, so the faulting process is terminated.
>
> BTW would this not be mostly resolved as part of rebasing to UPM?
> - host will not have these pages mapped in the first place (both kernel
> directmap and qemu userspace)
> - guest will have them mapped, but I assume that the conversion from private
> to shared (that might fail?) can only happen after guest's mappings are
> invalidated in the first place?
>

Yes, that will be true for guest private pages. But then there are host
allocated pages for firmware use which will remain in firmware page
state or reclaim state if they can't be transitioned back to HV/shared
state once the firmware releases them back to the host and accessing
them at this point can potentially cause RMP violation #PF.

Again i don't think this is going to happen regularly or frequently so
it will be a rare error case where the page reclamation, i.e., the
transition back to HV/shared state fails and now these pages are no
longer safe to be used.

Referring back to your thoughts about putting these pages on some leaked
pages list, do any such leaked pages list exist currently ?

Thanks,
Ashish

>>>
>>>>>
>>>>>>
>>>>>>> Still waiting for some/more feedback from mm folks on the same.
>>>>>>
>>>>>> Just send the patch and they'll give it.
>>>>>>
>>>>>> Thx.
>>>>>>
>>>>
>

2022-11-16 10:35:32

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/16/22 11:19, Kalra, Ashish wrote:
> On 11/16/2022 3:08 AM, Vlastimil Babka wrote:
>> On 11/15/22 19:15, Kalra, Ashish wrote:
>>>
>>> On 11/15/2022 11:24 AM, Kalra, Ashish wrote:
>>>> Hello Vlastimil,
>>>>
>>>> On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
>>>>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C38f8b76238014c67049308dac7b213a5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041865033588985%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=DDm7tZhUJLy%2BMzS1SXUnsBmBQnAI5dqR6tWZhCKRMEI%3D&amp;reserved=0
>>>>>
>>>>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>>>>> Hello Boris,
>>>>>>
>>>>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>>>>>         if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>>>>>                do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>>>>>                return RMP_PF_RETRY;
>>>>>>>
>>>>>>> Does this issue some halfway understandable error message why the
>>>>>>> process got killed?
>>>>>>>
>>>>>>>> Will look at adding our own recovery function for the same, but that
>>>>>>>> will
>>>>>>>> again mark the pages as poisoned, right ?
>>>>>>>
>>>>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>>>>>>> Semantically, it'll be handled the same way, ofc.
>>>>>>
>>>>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>>>>
>>>>> One thing is, there's not enough page flags to be adding more (except
>>>>> aliases for existing) for cases that can avoid it, but as Boris says, if
>>>>> using alias to PG_hwpoison it depends what will become confused with the
>>>>> actual hwpoison.
>>>>>
>>>>>> But there is still added complexity of handling hugepages as part of
>>>>>> reclamation failures (both HugeTLB and transparent hugepages) and that
>>>>>> means calling more static functions in mm/memory_failure.c
>>>>>>
>>>>>> There is probably a more appropriate handler in mm/memory-failure.c:
>>>>>>
>>>>>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>>>>>> handling for hugepages. And we can avoid adding a new page flag too.
>>>>>>
>>>>>> soft_offline_page - Soft offline a page.
>>>>>> Soft offline a page, by migration or invalidation, without killing
>>>>>> anything.
>>>>>>
>>>>>> So, this looks like a good option to call
>>>>>> soft_offline_page() instead of memory_failure() in case of
>>>>>> failure to transition the page back to HV/shared state via
>>>>>> SNP_RECLAIM_CMD
>>>>>> and/or RMPUPDATE instruction.
>>>>>
>>>>> So it's a bit unclear to me what exact situation we are handling here. The
>>>>> original patch here seems to me to be just leaking back pages that are
>>>>> unsafe for further use. soft_offline_page() seems to fit that scenario
>>>>> of a
>>>>> graceful leak before something is irrepairably corrupt and we page fault
>>>>> on it.
>>>>> But then in the thread you discus PF handling and killing. So what is the
>>>>> case here? If we detect this need to call snp_leak_pages() does it mean:
>>>>>
>>>>> a) nobody that could page fault at them (the guest?) is running
>>>>> anymore, we
>>>>> are tearing it down, we just can't reuse the pages further on the host
>>>>
>>>> The host can page fault on them, if anything on the host tries to write to
>>>> these pages. Host reads will return garbage data.
>>>>
>>>>> - seem like soft_offline_page() could work, but maybe we could just put
>>>>> the
>>>>> pages on some leaked lists without special page? The only thing that
>>>>> should
>>>>> matter is not to free the pages to the page allocator so they would be
>>>>> reused by something else.
>>>>>
>>>>> b) something can stil page fault at them (what?) - AFAIU can't be resolved
>>>>> without killing something, memory_failure() might limit the damage
>>>>
>>>> As i mentioned above, host writes will cause RMP violation page fault.
>>>>
>>>
>>> And to add here, if its a guest private page, then the above fault cannot be
>>> resolved, so the faulting process is terminated.
>>
>> BTW would this not be mostly resolved as part of rebasing to UPM?
>> - host will not have these pages mapped in the first place (both kernel
>> directmap and qemu userspace)
>> - guest will have them mapped, but I assume that the conversion from private
>> to shared (that might fail?) can only happen after guest's mappings are
>> invalidated in the first place?
>>
>
> Yes, that will be true for guest private pages. But then there are host
> allocated pages for firmware use which will remain in firmware page state or
> reclaim state if they can't be transitioned back to HV/shared state once the
> firmware releases them back to the host and accessing them at this point can
> potentially cause RMP violation #PF.
>
> Again i don't think this is going to happen regularly or frequently so it
> will be a rare error case where the page reclamation, i.e., the transition
> back to HV/shared state fails and now these pages are no longer safe to be
> used.
>
> Referring back to your thoughts about putting these pages on some leaked
> pages list, do any such leaked pages list exist currently ?

Not AFAIK, you could just create a list_head somewhere appropriate (some snp
state structure?) and put the pages there, maybe with a counter exposed in
debugs. The point would be mostly that if something goes so wrong it would
be leaking substantial amounts of memory, we can at least recognize the
cause (but I suppose the dmesg will be also full of messages) and e.g. find
the pages in a crash dump.

> Thanks,
> Ashish
>
>>>>
>>>>>>
>>>>>>>
>>>>>>>> Still waiting for some/more feedback from mm folks on the same.
>>>>>>>
>>>>>>> Just send the patch and they'll give it.
>>>>>>>
>>>>>>> Thx.
>>>>>>>
>>>>>
>>


2022-11-16 10:36:33

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/15/2022 11:19 PM, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Nov 15, 2022 at 04:14:42PM +0100, Vlastimil Babka wrote:
>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C7b2d39d6e2504a8f923608dac792224b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041727879125176%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=KBJLKhPQP23vmvY%2FNnbjZs8wTJs%2FrF%2BiU54Sdc4Ldx4%3D&amp;reserved=0
>>
>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>> Hello Boris,
>>>
>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>>       if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>>              do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>>              return RMP_PF_RETRY;
>>>>
>>>> Does this issue some halfway understandable error message why the
>>>> process got killed?
>>>>
>>>>> Will look at adding our own recovery function for the same, but that will
>>>>> again mark the pages as poisoned, right ?
>>>>
>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>>>> Semantically, it'll be handled the same way, ofc.
>>>
>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>
>> One thing is, there's not enough page flags to be adding more (except
>> aliases for existing) for cases that can avoid it, but as Boris says, if
>> using alias to PG_hwpoison it depends what will become confused with the
>> actual hwpoison.
>
> I agree with this. Just defining PG_offlimits as an alias of PG_hwpoison
> could break current hwpoison workload. So if you finally decide to go
> forward in this direction, you may as well have some indicator to
> distinguish the new kind of leaked pages from hwpoisoned pages.
>
> I don't remember exact thread, but I've read someone writing about similar
> kind of suggestion of using memory_failure() to make pages inaccessible in
> non-memory error usecase. I feel that it could be possible to generalize
> memory_failure() as general-purpose page offlining (by renaming it with

But, doesn't memory_failure() also mark the pages as PG_hwpoison, and
then using it for these leaked pages will again cause confusion with
actual hwpoison ?

Thanks,
Ashish

> hard_offline_page() and making memory_failure() one of the user of it).
>
> Thanks,
> Naoya Horiguchi
>

2022-11-16 18:12:29

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/16/2022 4:25 AM, Vlastimil Babka wrote:
> On 11/16/22 11:19, Kalra, Ashish wrote:
>> On 11/16/2022 3:08 AM, Vlastimil Babka wrote:
>>> On 11/15/22 19:15, Kalra, Ashish wrote:
>>>>
>>>> On 11/15/2022 11:24 AM, Kalra, Ashish wrote:
>>>>> Hello Vlastimil,
>>>>>
>>>>> On 11/15/2022 9:14 AM, Vlastimil Babka wrote:
>>>>>> Cc'ing memory failure folks, the beinning of this subthread is here:
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C174b7caaf99a473194cd08dac7bcebf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041911481429347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=CFkAXNQqangvCqhnwDyIUJUkfiUrX67OpKDTtLGj6PU%3D&amp;reserved=0
>>>>>>
>>>>>> On 11/15/22 00:36, Kalra, Ashish wrote:
>>>>>>> Hello Boris,
>>>>>>>
>>>>>>> On 11/2/2022 6:22 AM, Borislav Petkov wrote:
>>>>>>>> On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
>>>>>>>>>         if (snp_lookup_rmpentry(pfn, &rmp_level)) {
>>>>>>>>>                do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
>>>>>>>>>                return RMP_PF_RETRY;
>>>>>>>>
>>>>>>>> Does this issue some halfway understandable error message why the
>>>>>>>> process got killed?
>>>>>>>>
>>>>>>>>> Will look at adding our own recovery function for the same, but that
>>>>>>>>> will
>>>>>>>>> again mark the pages as poisoned, right ?
>>>>>>>>
>>>>>>>> Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
>>>>>>>> Semantically, it'll be handled the same way, ofc.
>>>>>>>
>>>>>>> Added a new PG_offlimits flag and a simple corresponding handler for it.
>>>>>>
>>>>>> One thing is, there's not enough page flags to be adding more (except
>>>>>> aliases for existing) for cases that can avoid it, but as Boris says, if
>>>>>> using alias to PG_hwpoison it depends what will become confused with the
>>>>>> actual hwpoison.
>>>>>>
>>>>>>> But there is still added complexity of handling hugepages as part of
>>>>>>> reclamation failures (both HugeTLB and transparent hugepages) and that
>>>>>>> means calling more static functions in mm/memory_failure.c
>>>>>>>
>>>>>>> There is probably a more appropriate handler in mm/memory-failure.c:
>>>>>>>
>>>>>>> soft_offline_page() - this will mark the page as HWPoisoned and also has
>>>>>>> handling for hugepages. And we can avoid adding a new page flag too.
>>>>>>>
>>>>>>> soft_offline_page - Soft offline a page.
>>>>>>> Soft offline a page, by migration or invalidation, without killing
>>>>>>> anything.
>>>>>>>
>>>>>>> So, this looks like a good option to call
>>>>>>> soft_offline_page() instead of memory_failure() in case of
>>>>>>> failure to transition the page back to HV/shared state via
>>>>>>> SNP_RECLAIM_CMD
>>>>>>> and/or RMPUPDATE instruction.
>>>>>>
>>>>>> So it's a bit unclear to me what exact situation we are handling here. The
>>>>>> original patch here seems to me to be just leaking back pages that are
>>>>>> unsafe for further use. soft_offline_page() seems to fit that scenario
>>>>>> of a
>>>>>> graceful leak before something is irrepairably corrupt and we page fault
>>>>>> on it.
>>>>>> But then in the thread you discus PF handling and killing. So what is the
>>>>>> case here? If we detect this need to call snp_leak_pages() does it mean:
>>>>>>
>>>>>> a) nobody that could page fault at them (the guest?) is running
>>>>>> anymore, we
>>>>>> are tearing it down, we just can't reuse the pages further on the host
>>>>>
>>>>> The host can page fault on them, if anything on the host tries to write to
>>>>> these pages. Host reads will return garbage data.
>>>>>
>>>>>> - seem like soft_offline_page() could work, but maybe we could just put
>>>>>> the
>>>>>> pages on some leaked lists without special page? The only thing that
>>>>>> should
>>>>>> matter is not to free the pages to the page allocator so they would be
>>>>>> reused by something else.
>>>>>>
>>>>>> b) something can stil page fault at them (what?) - AFAIU can't be resolved
>>>>>> without killing something, memory_failure() might limit the damage
>>>>>
>>>>> As i mentioned above, host writes will cause RMP violation page fault.
>>>>>
>>>>
>>>> And to add here, if its a guest private page, then the above fault cannot be
>>>> resolved, so the faulting process is terminated.
>>>
>>> BTW would this not be mostly resolved as part of rebasing to UPM?
>>> - host will not have these pages mapped in the first place (both kernel
>>> directmap and qemu userspace)
>>> - guest will have them mapped, but I assume that the conversion from private
>>> to shared (that might fail?) can only happen after guest's mappings are
>>> invalidated in the first place?
>>>
>>
>> Yes, that will be true for guest private pages. But then there are host
>> allocated pages for firmware use which will remain in firmware page state or
>> reclaim state if they can't be transitioned back to HV/shared state once the
>> firmware releases them back to the host and accessing them at this point can
>> potentially cause RMP violation #PF.
>>
>> Again i don't think this is going to happen regularly or frequently so it
>> will be a rare error case where the page reclamation, i.e., the transition
>> back to HV/shared state fails and now these pages are no longer safe to be
>> used.
>>
>> Referring back to your thoughts about putting these pages on some leaked
>> pages list, do any such leaked pages list exist currently ?
>
> Not AFAIK, you could just create a list_head somewhere appropriate (some snp
> state structure?) and put the pages there, maybe with a counter exposed in
> debugs. The point would be mostly that if something goes so wrong it would
> be leaking substantial amounts of memory, we can at least recognize the
> cause (but I suppose the dmesg will be also full of messages) and e.g. find
> the pages in a crash dump.
>

Ok, so i will work on implementing this leaked pages list and put it on
a sev/snp associated structure.

Also to add here, we will actually get a not-present #PF instead of the
RMP violation #PF on writing to these leaked pages, as these pages would
have been removed from the kernel direct map.

Thanks,
Ashish

>>
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Still waiting for some/more feedback from mm folks on the same.
>>>>>>>>
>>>>>>>> Just send the patch and they'll give it.
>>>>>>>>
>>>>>>>> Thx.
>>>>>>>>
>>>>>>
>>>
>

2022-11-16 18:44:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/16/22 02:25, Vlastimil Babka wrote:
>> Referring back to your thoughts about putting these pages on some leaked
>> pages list, do any such leaked pages list exist currently ?
> Not AFAIK, you could just create a list_head somewhere appropriate (some snp
> state structure?) and put the pages there, maybe with a counter exposed in
> debugs. The point would be mostly that if something goes so wrong it would
> be leaking substantial amounts of memory, we can at least recognize the
> cause (but I suppose the dmesg will be also full of messages) and e.g. find
> the pages in a crash dump.

It might also be worth looking through the places that check
PageHWPoison() and making sure that none of them are poking into the
page contents.

It's also the kind of thing that adding some CONFIG_DEBUG_VM checks
might help with. For instance, nobody should ever be kmap*()'ing a
private page. The same might even go for pin_user_pages().

2022-11-16 18:48:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Wed, Nov 16, 2022 at 12:01:11PM -0600, Kalra, Ashish wrote:
> Ok, so i will work on implementing this leaked pages list and put it on a
> sev/snp associated structure.

See __sgx_sanitize_pages() and the poison list there, for an example.

> Also to add here, we will actually get a not-present #PF instead of the RMP
> violation #PF on writing to these leaked pages, as these pages would have
> been removed from the kernel direct map.

So if you do the list and still have the kernel raise a RMP fault for
those pages, traversing that list in the RMP handler to check whether
the page is there on it, should be a lot faster operation than doing the
#PF thing and removing them from the direct map.

And sorry for misleading you about UPM - we were thinking wrong
yesterday.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-16 18:56:17

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/16/2022 12:33 PM, Borislav Petkov wrote:
> On Wed, Nov 16, 2022 at 12:01:11PM -0600, Kalra, Ashish wrote:
>> Ok, so i will work on implementing this leaked pages list and put it on a
>> sev/snp associated structure.
>
> See __sgx_sanitize_pages() and the poison list there, for an example.
>
>> Also to add here, we will actually get a not-present #PF instead of the RMP
>> violation #PF on writing to these leaked pages, as these pages would have
>> been removed from the kernel direct map.
>
> So if you do the list and still have the kernel raise a RMP fault for
> those pages, traversing that list in the RMP handler to check whether
> the page is there on it, should be a lot faster operation than doing the
> #PF thing and removing them from the direct map.
>

Actually, these host allocated pages would have already been removed
from the kernel direct map, when they were transitioned to the firmware
state. So actually the not-present #PF fault will happen on any
read/write access to these leaked pages instead of the RMP violation #PF
(not-present #PF has higher priority than RMP violation #PF).

If these pages cannot be reclaimed, they are unsafe to use and cannot be
added back to the kernel direct map.

Thanks,
Ashish

> And sorry for misleading you about UPM - we were thinking wrong
> yesterday.
>
> Thx.
>

2022-11-16 19:21:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Wed, Nov 16, 2022 at 12:53:36PM -0600, Kalra, Ashish wrote:
> Actually, these host allocated pages would have already been removed from
> the kernel direct map,

And, as I said above, it would be a lot easier to handle any potential
faults resulting from the host touching them by having it raise a *RMP*
fault instead of normal *PF* fault where the latter code is a crazy mess.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-16 19:26:56

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/16/2022 1:09 PM, Borislav Petkov wrote:
> On Wed, Nov 16, 2022 at 12:53:36PM -0600, Kalra, Ashish wrote:
>> Actually, these host allocated pages would have already been removed from
>> the kernel direct map,
>
> And, as I said above, it would be a lot easier to handle any potential
> faults resulting from the host touching them by having it raise a *RMP*
> fault instead of normal *PF* fault where the latter code is a crazy mess.

Just to reiterate here, we won't be getting a *RMP* fault but will
instead get a normal (not-present) #PF fault when the host touches these
pages.

Sorry for any confusion about the fault signaled, earlier i mentioned we
will get a RMP violation #PF, but actually as these pages are also
removed from the kernel direct-map, therefore, we will get the
not-present #PF and not the RMP #PF (core will check and signal
not-present #PF before it performs the RMP checks).

Thanks,
Ashish


Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Wed, Nov 16, 2022 at 04:28:11AM -0600, Kalra, Ashish wrote:
> On 11/15/2022 11:19 PM, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Tue, Nov 15, 2022 at 04:14:42PM +0100, Vlastimil Babka wrote:
> > > Cc'ing memory failure folks, the beinning of this subthread is here:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F3a51840f6a80c87b39632dc728dbd9b5dd444cd7.1655761627.git.ashish.kalra%40amd.com%2F&amp;data=05%7C01%7Cashish.kalra%40amd.com%7C7b2d39d6e2504a8f923608dac792224b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638041727879125176%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=KBJLKhPQP23vmvY%2FNnbjZs8wTJs%2FrF%2BiU54Sdc4Ldx4%3D&amp;reserved=0
> > >
> > > On 11/15/22 00:36, Kalra, Ashish wrote:
> > > > Hello Boris,
> > > >
> > > > On 11/2/2022 6:22 AM, Borislav Petkov wrote:
> > > > > On Mon, Oct 31, 2022 at 04:58:38PM -0500, Kalra, Ashish wrote:
> > > > > >       if (snp_lookup_rmpentry(pfn, &rmp_level)) {
> > > > > >              do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
> > > > > >              return RMP_PF_RETRY;
> > > > >
> > > > > Does this issue some halfway understandable error message why the
> > > > > process got killed?
> > > > >
> > > > > > Will look at adding our own recovery function for the same, but that will
> > > > > > again mark the pages as poisoned, right ?
> > > > >
> > > > > Well, not poisoned but PG_offlimits or whatever the mm folks agree upon.
> > > > > Semantically, it'll be handled the same way, ofc.
> > > >
> > > > Added a new PG_offlimits flag and a simple corresponding handler for it.
> > >
> > > One thing is, there's not enough page flags to be adding more (except
> > > aliases for existing) for cases that can avoid it, but as Boris says, if
> > > using alias to PG_hwpoison it depends what will become confused with the
> > > actual hwpoison.
> >
> > I agree with this. Just defining PG_offlimits as an alias of PG_hwpoison
> > could break current hwpoison workload. So if you finally decide to go
> > forward in this direction, you may as well have some indicator to
> > distinguish the new kind of leaked pages from hwpoisoned pages.
> >
> > I don't remember exact thread, but I've read someone writing about similar
> > kind of suggestion of using memory_failure() to make pages inaccessible in
> > non-memory error usecase. I feel that it could be possible to generalize
> > memory_failure() as general-purpose page offlining (by renaming it with
>
> But, doesn't memory_failure() also mark the pages as PG_hwpoison, and then
> using it for these leaked pages will again cause confusion with actual
> hwpoison ?

Yes, so we might need modification of memory_failure code for this approach
like renaming PG_hwpoison to more generic one (although some possible names
like PageOffline and PageIsolated are already used) and/or somehow showing
"which kind of leaked pages" info.

Thanks,
Naoya Horiguchi

2022-11-17 20:24:58

by Peter Gonda

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 39/49] KVM: SVM: Introduce ops for the post gfn map and unmap

On Wed, Aug 17, 2022 at 9:47 PM Alper Gun <[email protected]> wrote:
>
> On Mon, Jun 20, 2022 at 4:12 PM Ashish Kalra <[email protected]> wrote:
> >
> > From: Brijesh Singh <[email protected]>
> >
> > When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > either be a private or shared. A write from the hypervisor goes through
> > the RMP checks. If hardware sees that hypervisor is attempting to write
> > to a guest private page, then it triggers an RMP violation #PF.
> >
> > To avoid the RMP violation with GHCB pages, added new post_{map,unmap}_gfn
> > functions to verify if its safe to map GHCB pages. Uses a spinlock to
> > protect against the page state change for existing mapped pages.
> >
> > Need to add generic post_{map,unmap}_gfn() ops that can be used to verify
> > that its safe to map a given guest page in the hypervisor.
> >
> > This patch will need to be revisited later after consensus is reached on
> > how to manage guest private memory as probably UPM private memslots will
> > be able to handle this page state change more gracefully.
> >
> > Signed-off-by: Brijesh Singh <[email protected]>
> > Signed-off by: Ashish Kalra <[email protected]>
> > ---
> > arch/x86/include/asm/kvm-x86-ops.h | 1 +
> > arch/x86/include/asm/kvm_host.h | 3 ++
> > arch/x86/kvm/svm/sev.c | 48 ++++++++++++++++++++++++++++--
> > arch/x86/kvm/svm/svm.c | 3 ++
> > arch/x86/kvm/svm/svm.h | 11 +++++++
> > 5 files changed, 64 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index e0068e702692..2dd2bc0cf4c3 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -130,6 +130,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
> > KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> > KVM_X86_OP(alloc_apic_backing_page)
> > KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
> > +KVM_X86_OP(update_protected_guest_state)
> >
> > #undef KVM_X86_OP
> > #undef KVM_X86_OP_OPTIONAL
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 49b217dc8d7e..8abc0e724f5c 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1522,7 +1522,10 @@ struct kvm_x86_ops {
> > unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
> >
> > void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> > +
> > void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
> > +
> > + int (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
> > };
> >
> > struct kvm_x86_nested_ops {
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index cb2d1bbb862b..4ed90331bca0 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -341,6 +341,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > if (ret)
> > goto e_free;
> >
> > + spin_lock_init(&sev->psc_lock);
> > ret = sev_snp_init(&argp->error);
> > } else {
> > ret = sev_platform_init(&argp->error);
> > @@ -2828,19 +2829,28 @@ static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
> > {
> > struct vmcb_control_area *control = &svm->vmcb->control;
> > u64 gfn = gpa_to_gfn(control->ghcb_gpa);
> > + struct kvm_vcpu *vcpu = &svm->vcpu;
> >
> > - if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
> > + if (kvm_vcpu_map(vcpu, gfn, map)) {
> > /* Unable to map GHCB from guest */
> > pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
> > return -EFAULT;
> > }
> >
> > + if (sev_post_map_gfn(vcpu->kvm, map->gfn, map->pfn)) {
> > + kvm_vcpu_unmap(vcpu, map, false);
> > + return -EBUSY;
> > + }
> > +
> > return 0;
> > }
> >
> > static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
> > {
> > - kvm_vcpu_unmap(&svm->vcpu, map, true);
> > + struct kvm_vcpu *vcpu = &svm->vcpu;
> > +
> > + kvm_vcpu_unmap(vcpu, map, true);
> > + sev_post_unmap_gfn(vcpu->kvm, map->gfn, map->pfn);
> > }
> >
> > static void dump_ghcb(struct vcpu_svm *svm)
> > @@ -3383,6 +3393,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
> > return PSC_UNDEF_ERR;
> > }
> >
> > + spin_lock(&sev->psc_lock);
> > +
> > write_lock(&kvm->mmu_lock);
> >
> > rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
> > @@ -3417,6 +3429,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
> >
> > write_unlock(&kvm->mmu_lock);
> >
> > + spin_unlock(&sev->psc_lock);
>
> There is a corner case where the psc_lock is not released. If
> kvm_mmu_get_tdp_walk fails, the lock will be kept and will cause soft
> lockup.
>
> > +
> > if (rc) {
> > pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
> > op, gpa, pfn, level, rc);
> > @@ -3965,3 +3979,33 @@ void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)
> > /* Adjust the level to keep the NPT and RMP in sync */
> > *level = min_t(size_t, *level, rmp_level);
> > }
> > +
> > +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > + int level;
> > +
> > + if (!sev_snp_guest(kvm))
> > + return 0;
> > +
> > + spin_lock(&sev->psc_lock);
> > +
> > + /* If pfn is not added as private then fail */
> > + if (snp_lookup_rmpentry(pfn, &level) == 1) {
> > + spin_unlock(&sev->psc_lock);
> > + pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
> > + return -EBUSY;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> > +
> > + if (!sev_snp_guest(kvm))
> > + return;
> > +
> > + spin_unlock(&sev->psc_lock);
> > +}
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index b24e0171cbf2..1c8e035ba011 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4734,7 +4734,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> > .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
> >
> > .alloc_apic_backing_page = svm_alloc_apic_backing_page,
> > +
> > .rmp_page_level_adjust = sev_rmp_page_level_adjust,
> > +
> > + .update_protected_guest_state = sev_snp_update_protected_guest_state,
> > };

I don't see this function sev_snp_update_protected_guest_state() being
defined anywhere in this series.

Then this line is removed in 'KVM: SVM: Support SEV-SNP AP Creation
NAE event'. Should this line just be removed from this patch in the
first place?

> >
> > /*
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index 54ff56cb6125..3fd95193ed8d 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -79,19 +79,25 @@ struct kvm_sev_info {
> > bool active; /* SEV enabled guest */
> > bool es_active; /* SEV-ES enabled guest */
> > bool snp_active; /* SEV-SNP enabled guest */
> > +
> > unsigned int asid; /* ASID used for this guest */
> > unsigned int handle; /* SEV firmware handle */
> > int fd; /* SEV device fd */
> > +
> > unsigned long pages_locked; /* Number of pages locked */
> > struct list_head regions_list; /* List of registered regions */
> > +
> > u64 ap_jump_table; /* SEV-ES AP Jump Table address */
> > +
> > struct kvm *enc_context_owner; /* Owner of copied encryption context */
> > struct list_head mirror_vms; /* List of VMs mirroring */
> > struct list_head mirror_entry; /* Use as a list entry of mirrors */
> > struct misc_cg *misc_cg; /* For misc cgroup accounting */
> > atomic_t migration_in_progress;
> > +
> > u64 snp_init_flags;
> > void *snp_context; /* SNP guest context page */
> > + spinlock_t psc_lock;
> > };
> >
> > struct kvm_svm {
> > @@ -702,6 +708,11 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
> > void sev_es_unmap_ghcb(struct vcpu_svm *svm);
> > struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
> > void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);
> > +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
> > +void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
> > +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
> > +void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
> > +int sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu);

Ditto should this be removed?

> >
> > /* vmenter.S */
> >
> > --
> > 2.25.1
> >

2022-11-17 20:30:06

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 39/49] KVM: SVM: Introduce ops for the post gfn map and unmap

On 11/17/2022 2:18 PM, Peter Gonda wrote:
> On Wed, Aug 17, 2022 at 9:47 PM Alper Gun <[email protected]> wrote:
>>
>> On Mon, Jun 20, 2022 at 4:12 PM Ashish Kalra <[email protected]> wrote:
>>>
>>> From: Brijesh Singh <[email protected]>
>>>
>>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>>> either be a private or shared. A write from the hypervisor goes through
>>> the RMP checks. If hardware sees that hypervisor is attempting to write
>>> to a guest private page, then it triggers an RMP violation #PF.
>>>
>>> To avoid the RMP violation with GHCB pages, added new post_{map,unmap}_gfn
>>> functions to verify if its safe to map GHCB pages. Uses a spinlock to
>>> protect against the page state change for existing mapped pages.
>>>
>>> Need to add generic post_{map,unmap}_gfn() ops that can be used to verify
>>> that its safe to map a given guest page in the hypervisor.
>>>
>>> This patch will need to be revisited later after consensus is reached on
>>> how to manage guest private memory as probably UPM private memslots will
>>> be able to handle this page state change more gracefully.
>>>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>> Signed-off by: Ashish Kalra <[email protected]>
>>> ---
>>> arch/x86/include/asm/kvm-x86-ops.h | 1 +
>>> arch/x86/include/asm/kvm_host.h | 3 ++
>>> arch/x86/kvm/svm/sev.c | 48 ++++++++++++++++++++++++++++--
>>> arch/x86/kvm/svm/svm.c | 3 ++
>>> arch/x86/kvm/svm/svm.h | 11 +++++++
>>> 5 files changed, 64 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
>>> index e0068e702692..2dd2bc0cf4c3 100644
>>> --- a/arch/x86/include/asm/kvm-x86-ops.h
>>> +++ b/arch/x86/include/asm/kvm-x86-ops.h
>>> @@ -130,6 +130,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
>>> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
>>> KVM_X86_OP(alloc_apic_backing_page)
>>> KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
>>> +KVM_X86_OP(update_protected_guest_state)
>>>
>>> #undef KVM_X86_OP
>>> #undef KVM_X86_OP_OPTIONAL
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 49b217dc8d7e..8abc0e724f5c 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1522,7 +1522,10 @@ struct kvm_x86_ops {
>>> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>>>
>>> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>>> +
>>> void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
>>> +
>>> + int (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
>>> };
>>>
>>> struct kvm_x86_nested_ops {
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index cb2d1bbb862b..4ed90331bca0 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -341,6 +341,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>> if (ret)
>>> goto e_free;
>>>
>>> + spin_lock_init(&sev->psc_lock);
>>> ret = sev_snp_init(&argp->error);
>>> } else {
>>> ret = sev_platform_init(&argp->error);
>>> @@ -2828,19 +2829,28 @@ static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
>>> {
>>> struct vmcb_control_area *control = &svm->vmcb->control;
>>> u64 gfn = gpa_to_gfn(control->ghcb_gpa);
>>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>>>
>>> - if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
>>> + if (kvm_vcpu_map(vcpu, gfn, map)) {
>>> /* Unable to map GHCB from guest */
>>> pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
>>> return -EFAULT;
>>> }
>>>
>>> + if (sev_post_map_gfn(vcpu->kvm, map->gfn, map->pfn)) {
>>> + kvm_vcpu_unmap(vcpu, map, false);
>>> + return -EBUSY;
>>> + }
>>> +
>>> return 0;
>>> }
>>>
>>> static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
>>> {
>>> - kvm_vcpu_unmap(&svm->vcpu, map, true);
>>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>>> +
>>> + kvm_vcpu_unmap(vcpu, map, true);
>>> + sev_post_unmap_gfn(vcpu->kvm, map->gfn, map->pfn);
>>> }
>>>
>>> static void dump_ghcb(struct vcpu_svm *svm)
>>> @@ -3383,6 +3393,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
>>> return PSC_UNDEF_ERR;
>>> }
>>>
>>> + spin_lock(&sev->psc_lock);
>>> +
>>> write_lock(&kvm->mmu_lock);
>>>
>>> rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
>>> @@ -3417,6 +3429,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
>>>
>>> write_unlock(&kvm->mmu_lock);
>>>
>>> + spin_unlock(&sev->psc_lock);
>>
>> There is a corner case where the psc_lock is not released. If
>> kvm_mmu_get_tdp_walk fails, the lock will be kept and will cause soft
>> lockup.
>>
>>> +
>>> if (rc) {
>>> pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
>>> op, gpa, pfn, level, rc);
>>> @@ -3965,3 +3979,33 @@ void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level)
>>> /* Adjust the level to keep the NPT and RMP in sync */
>>> *level = min_t(size_t, *level, rmp_level);
>>> }
>>> +
>>> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>>> +{
>>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>> + int level;
>>> +
>>> + if (!sev_snp_guest(kvm))
>>> + return 0;
>>> +
>>> + spin_lock(&sev->psc_lock);
>>> +
>>> + /* If pfn is not added as private then fail */
>>> + if (snp_lookup_rmpentry(pfn, &level) == 1) {
>>> + spin_unlock(&sev->psc_lock);
>>> + pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
>>> + return -EBUSY;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>>> +{
>>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>> +
>>> + if (!sev_snp_guest(kvm))
>>> + return;
>>> +
>>> + spin_unlock(&sev->psc_lock);
>>> +}
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index b24e0171cbf2..1c8e035ba011 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -4734,7 +4734,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>>> .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
>>>
>>> .alloc_apic_backing_page = svm_alloc_apic_backing_page,
>>> +
>>> .rmp_page_level_adjust = sev_rmp_page_level_adjust,
>>> +
>>> + .update_protected_guest_state = sev_snp_update_protected_guest_state,
>>> };
>
> I don't see this function sev_snp_update_protected_guest_state() being
> defined anywhere in this series.
>
> Then this line is removed in 'KVM: SVM: Support SEV-SNP AP Creation
> NAE event'. Should this line just be removed from this patch in the
> first place?

Yes, already fixed for v7.

>
>>>
>>> /*
>>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>>> index 54ff56cb6125..3fd95193ed8d 100644
>>> --- a/arch/x86/kvm/svm/svm.h
>>> +++ b/arch/x86/kvm/svm/svm.h
>>> @@ -79,19 +79,25 @@ struct kvm_sev_info {
>>> bool active; /* SEV enabled guest */
>>> bool es_active; /* SEV-ES enabled guest */
>>> bool snp_active; /* SEV-SNP enabled guest */
>>> +
>>> unsigned int asid; /* ASID used for this guest */
>>> unsigned int handle; /* SEV firmware handle */
>>> int fd; /* SEV device fd */
>>> +
>>> unsigned long pages_locked; /* Number of pages locked */
>>> struct list_head regions_list; /* List of registered regions */
>>> +
>>> u64 ap_jump_table; /* SEV-ES AP Jump Table address */
>>> +
>>> struct kvm *enc_context_owner; /* Owner of copied encryption context */
>>> struct list_head mirror_vms; /* List of VMs mirroring */
>>> struct list_head mirror_entry; /* Use as a list entry of mirrors */
>>> struct misc_cg *misc_cg; /* For misc cgroup accounting */
>>> atomic_t migration_in_progress;
>>> +
>>> u64 snp_init_flags;
>>> void *snp_context; /* SNP guest context page */
>>> + spinlock_t psc_lock;
>>> };
>>>
>>> struct kvm_svm {
>>> @@ -702,6 +708,11 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
>>> void sev_es_unmap_ghcb(struct vcpu_svm *svm);
>>> struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
>>> void sev_rmp_page_level_adjust(struct kvm *kvm, kvm_pfn_t pfn, int *level);
>>> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
>>> +void sev_post_unmap_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn);
>>> +void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
>>> +void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
>>> +int sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu);
>
> Ditto should this be removed?
>
Yes, already fixed for v7.

Thanks,
Ashish

>>>
>>> /* vmenter.S */
>>>
>>> --
>>> 2.25.1
>>>

2022-11-17 21:05:34

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

>>
>>> +        if (ret)
>>> +            goto cleanup;
>>> +
>>> +        ret = rmp_make_shared(pfn, PG_LEVEL_4K);
>>> +        if (ret)
>>> +            goto cleanup;
>>> +
>>> +        pfn++;
>>> +        n++;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +cleanup:
>>> +    /*
>>> +     * If failed to reclaim the page then page is no longer safe to
>>> +     * be released, leak it.
>>> +     */
>>> +    snp_leak_pages(pfn, npages - n);
>>
>> So this looks real weird: we go and reclaim pages, we hit an error
>> during reclaiming a page X somewhere in-between and then we go and mark
>> the *remaining* pages as not to be used?!
>>
>> Why?
>>
>> Why not only that *one* page which failed and then we continue with the
>> rest?!
>

I had a re-look at this while fixing the memory_failure() handling and
realized that we can't do a *partial* recovery here if we fail to
reclaim a single page, i.e., if we hit an error during reclaiming a
single page we need to mark the remaining pages as not usable.

This is because this page could be a part of larger buffer which would
have been transitioned to firmware state and need to be restored back in
*full* to HV/shared state, any access to a partially transitioned buffer
will still cause failures, basically the callers won't be able to do any
kind of a recovery/access on a partially restored/reclaimed buffer and
now potentially fragmented buffer, which anyway means failure due to
data loss on non reclaimed page(s).

So we need to be able to reclaim all the pages or none.

Also this failure won't be happening regularly/frequently, it is a
*rare* error case and if there is a reclamation failure on a single
page, there is a high probability that there will be reclamation
failures on subsequent pages.

Thanks,
Ashish

2022-11-17 21:45:13

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 39/49] KVM: SVM: Introduce ops for the post gfn map and unmap

On 11/17/2022 2:18 PM, Peter Gonda wrote:
> On Wed, Aug 17, 2022 at 9:47 PM Alper Gun <[email protected]> wrote:
>>
>> On Mon, Jun 20, 2022 at 4:12 PM Ashish Kalra <[email protected]> wrote:
>>>
>>> From: Brijesh Singh <[email protected]>
>>>
>>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>>> either be a private or shared. A write from the hypervisor goes through
>>> the RMP checks. If hardware sees that hypervisor is attempting to write
>>> to a guest private page, then it triggers an RMP violation #PF.
>>>
>>> To avoid the RMP violation with GHCB pages, added new post_{map,unmap}_gfn
>>> functions to verify if its safe to map GHCB pages. Uses a spinlock to
>>> protect against the page state change for existing mapped pages.
>>>
>>> Need to add generic post_{map,unmap}_gfn() ops that can be used to verify
>>> that its safe to map a given guest page in the hypervisor.
>>>
>>> This patch will need to be revisited later after consensus is reached on
>>> how to manage guest private memory as probably UPM private memslots will
>>> be able to handle this page state change more gracefully.
>>>
>>> Signed-off-by: Brijesh Singh <[email protected]>
>>> Signed-off by: Ashish Kalra <[email protected]>
>>> ---
>>> arch/x86/include/asm/kvm-x86-ops.h | 1 +
>>> arch/x86/include/asm/kvm_host.h | 3 ++
>>> arch/x86/kvm/svm/sev.c | 48 ++++++++++++++++++++++++++++--
>>> arch/x86/kvm/svm/svm.c | 3 ++
>>> arch/x86/kvm/svm/svm.h | 11 +++++++
>>> 5 files changed, 64 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
>>> index e0068e702692..2dd2bc0cf4c3 100644
>>> --- a/arch/x86/include/asm/kvm-x86-ops.h
>>> +++ b/arch/x86/include/asm/kvm-x86-ops.h
>>> @@ -130,6 +130,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
>>> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
>>> KVM_X86_OP(alloc_apic_backing_page)
>>> KVM_X86_OP_OPTIONAL(rmp_page_level_adjust)
>>> +KVM_X86_OP(update_protected_guest_state)
>>>
>>> #undef KVM_X86_OP
>>> #undef KVM_X86_OP_OPTIONAL
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 49b217dc8d7e..8abc0e724f5c 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1522,7 +1522,10 @@ struct kvm_x86_ops {
>>> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>>>
>>> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>>> +
>>> void (*rmp_page_level_adjust)(struct kvm *kvm, kvm_pfn_t pfn, int *level);
>>> +
>>> + int (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
>>> };
>>>
>>> struct kvm_x86_nested_ops {
>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>>> index cb2d1bbb862b..4ed90331bca0 100644
>>> --- a/arch/x86/kvm/svm/sev.c
>>> +++ b/arch/x86/kvm/svm/sev.c
>>> @@ -341,6 +341,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>> if (ret)
>>> goto e_free;
>>>
>>> + spin_lock_init(&sev->psc_lock);
>>> ret = sev_snp_init(&argp->error);
>>> } else {
>>> ret = sev_platform_init(&argp->error);
>>> @@ -2828,19 +2829,28 @@ static inline int svm_map_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
>>> {
>>> struct vmcb_control_area *control = &svm->vmcb->control;
>>> u64 gfn = gpa_to_gfn(control->ghcb_gpa);
>>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>>>
>>> - if (kvm_vcpu_map(&svm->vcpu, gfn, map)) {
>>> + if (kvm_vcpu_map(vcpu, gfn, map)) {
>>> /* Unable to map GHCB from guest */
>>> pr_err("error mapping GHCB GFN [%#llx] from guest\n", gfn);
>>> return -EFAULT;
>>> }
>>>
>>> + if (sev_post_map_gfn(vcpu->kvm, map->gfn, map->pfn)) {
>>> + kvm_vcpu_unmap(vcpu, map, false);
>>> + return -EBUSY;
>>> + }
>>> +
>>> return 0;
>>> }
>>>
>>> static inline void svm_unmap_ghcb(struct vcpu_svm *svm, struct kvm_host_map *map)
>>> {
>>> - kvm_vcpu_unmap(&svm->vcpu, map, true);
>>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>>> +
>>> + kvm_vcpu_unmap(vcpu, map, true);
>>> + sev_post_unmap_gfn(vcpu->kvm, map->gfn, map->pfn);
>>> }
>>>
>>> static void dump_ghcb(struct vcpu_svm *svm)
>>> @@ -3383,6 +3393,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
>>> return PSC_UNDEF_ERR;
>>> }
>>>
>>> + spin_lock(&sev->psc_lock);
>>> +
>>> write_lock(&kvm->mmu_lock);
>>>
>>> rc = kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &npt_level);
>>> @@ -3417,6 +3429,8 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, enum psc_op op,
>>>
>>> write_unlock(&kvm->mmu_lock);
>>>
>>> + spin_unlock(&sev->psc_lock);
>>
>> There is a corner case where the psc_lock is not released. If
>> kvm_mmu_get_tdp_walk fails, the lock will be kept and will cause soft
>> lockup.
>>

This is also already fixed for v7.

Thanks,
Ashish

2022-11-20 21:46:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Thu, Nov 17, 2022 at 02:56:47PM -0600, Kalra, Ashish wrote:
> So we need to be able to reclaim all the pages or none.

/me goes and looks at SNP_PAGE_RECLAIM's retvals:

- INVALID_PLATFORM_STATE - platform is not in INIT state. That's
certainly not a reason to leak pages.

- INVALID_ADDRESS - PAGE_PADDR is not a valid system physical address.
That's botched command buffer but not a broken page so no reason to leak
them either.

- INVALID_PAGE_STATE - the page is neither of those types: metadata,
firmware, pre-guest nor pre-swap. So if you issue page reclaim on the
wrong range of pages that looks again like a user error but no need to
leak pages.

- INVALID_PAGE_SIZE - a size mismatch. Still sounds to me like a user
error of sev-guest instead of anything wrong deeper in the FW or HW.

So in all those, if you end up supplying the wrong range of addresses,
you most certainly will end up leaking the wrong pages.

So it sounds to me like you wanna say: "Error reclaiming range, check
your driver" instead of punishing any innocent pages.

Now, if the retval from the fw were FIRMWARE_INTERNAL_ERROR or so, then
sure, by all means. But not for the above. All the error conditions
above sound like the kernel has supplied the wrong range/botched command
buffer to the firmware so there's no need to leak pages.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-22 00:44:25

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 11/20/2022 3:34 PM, Borislav Petkov wrote:
> On Thu, Nov 17, 2022 at 02:56:47PM -0600, Kalra, Ashish wrote:
>> So we need to be able to reclaim all the pages or none.
>
> /me goes and looks at SNP_PAGE_RECLAIM's retvals:
>
> - INVALID_PLATFORM_STATE - platform is not in INIT state. That's
> certainly not a reason to leak pages.

This should not happen, as there are sev->snp_initialized checks before
any firmware page allocation or snp page transitions.

>
> - INVALID_ADDRESS - PAGE_PADDR is not a valid system physical address.
> That's botched command buffer but not a broken page so no reason to leak
> them either.
>
> - INVALID_PAGE_STATE - the page is neither of those types: metadata,
> firmware, pre-guest nor pre-swap. So if you issue page reclaim on the
> wrong range of pages that looks again like a user error but no need to
> leak pages.
>
> - INVALID_PAGE_SIZE - a size mismatch. Still sounds to me like a user
> error of sev-guest instead of anything wrong deeper in the FW or HW.
>
> So in all those, if you end up supplying the wrong range of addresses,
> you most certainly will end up leaking the wrong pages.
>
> So it sounds to me like you wanna say: "Error reclaiming range, check
> your driver" instead of punishing any innocent pages.

I agree, but these pages are not in the right state to be released back
to the system or accessed by the host, because they have already been
transitioned successfully to firmware state and the reclaim has failed.
If we release them back to page-allocator and whenever the host accesses
them, it will get a not-present #PF and it will panic/crash the host
process.

It might be a user/sev-guest error, but these pages are now unsafe to
use. So is a kernel panic justified here, instead of not releasing the
pages back to host and logging errors for the same.

Thanks,
Ashish

>
> Now, if the retval from the fw were FIRMWARE_INTERNAL_ERROR or so, then
> sure, by all means. But not for the above. All the error conditions
> above sound like the kernel has supplied the wrong range/botched command
> buffer to the firmware so there's no need to leak pages.
>
> Thx.
>

2022-11-22 10:20:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Mon, Nov 21, 2022 at 06:37:18PM -0600, Kalra, Ashish wrote:
> I agree, but these pages are not in the right state to be released back to

Which pages exactly?

Some pages' state has really changed underneath or you've given the
wrong range?

> It might be a user/sev-guest error, but these pages are now unsafe to use.
> So is a kernel panic justified here, instead of not releasing the pages back
> to host and logging errors for the same.

Ok, there are two cases:

* kernel error: I guess a big fat warning is the least we can issue
here. Not sure about panic considering this should almost never happen
and a warning would allow for people to catch dumps and debug the issue.

* firmware error: I don't think you can know that that is really
the case on a production system without additional fw debugging
capabilities. Dumping a warning would be the least we can do here too,
to signal that something's out of the ordinary and so people can look
into it further.

So yeah, a big fat warning is a good start. And then you don't need any
memory poisoning etc gunk.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-22 10:49:37

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 11/22/2022 4:17 AM, Borislav Petkov wrote:
> On Mon, Nov 21, 2022 at 06:37:18PM -0600, Kalra, Ashish wrote:
>> I agree, but these pages are not in the right state to be released back to
>
> Which pages exactly?
>
> Some pages' state has really changed underneath or you've given the
> wrong range?
>
>> It might be a user/sev-guest error, but these pages are now unsafe to use.
>> So is a kernel panic justified here, instead of not releasing the pages back
>> to host and logging errors for the same.
>
> Ok, there are two cases:
>
> * kernel error: I guess a big fat warning is the least we can issue
> here. Not sure about panic considering this should almost never happen
> and a warning would allow for people to catch dumps and debug the issue.
>
> * firmware error: I don't think you can know that that is really
> the case on a production system without additional fw debugging
> capabilities. Dumping a warning would be the least we can do here too,
> to signal that something's out of the ordinary and so people can look
> into it further.

Please note that in both cases, these non-reclaimed pages cannot be
freed/returned back to the page allocator. Anytime the kernel accesses
these pages it will cause a panic or host process crash.

So along with warning, the pages will be added to a leaked pages list,
but there is no poisoning or anything, only need to ensure that these
pages are not touched/accessed again.

Thanks,
Ashish

>
> So yeah, a big fat warning is a good start. And then you don't need any
> memory poisoning etc gunk.
>

2022-11-22 10:50:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Tue, Nov 22, 2022 at 04:32:18AM -0600, Kalra, Ashish wrote:
> Please note that in both cases, these non-reclaimed pages cannot be
> freed/returned back to the page allocator.

You keep repeating "these pages". Which pages?

What if you specify the wrong, innocent pages because the kernel is
wrong?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-22 12:06:44

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On 11/22/2022 4:44 AM, Borislav Petkov wrote:
> On Tue, Nov 22, 2022 at 04:32:18AM -0600, Kalra, Ashish wrote:
>> Please note that in both cases, these non-reclaimed pages cannot be
>> freed/returned back to the page allocator.
>
> You keep repeating "these pages". Which pages?

The pages which have been allocated for firmware use (such as for
SNP_INIT command, TMR memory for SEV-ES usage, etc), command buffers
used for SEV legacy commands when SNP is enabled.

Here is a detailed description of the SEV legacy command handling when
SNP is enabled:
The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command. A command buffer may contain
a system physical address that the firmware may write to. In case the
command buffer contains a system physical address points to a guest
memory, we need to change the page state to the firmware in the RMP
table before issuing the command and restore the state to shared after
the command completes.

Then there are host buffers allocated for SNP platform status command,
SNP launch update and SNP launch update vmsa command.

The other pages which can be user or guest provided are SNP guest
requests and SNP guest debug helpers.

It is important to note that if invalid address/len are supplied, the
failure will happen at the initial stage itself of transitioning these
pages to firmware state.

But if the above pages have been successfully transitioned to firmware
state and passed on to the SNP firmware, then after return, they need to
be restored to shared state. If this restoration/reclamation fails, then
accessing these pages will cause the kernel to panic.

>
> What if you specify the wrong, innocent pages because the kernel is
> wrong?
>

In such a case the kernel panic is justifiable, but again if incorrect
addresses are supplied, the failure will happen at the initial stage of
transitioning these pages to firmware state and there is no need to
reclaim.

Or, otherwise dump a warning and let the pages not be freed/returned
back to the page allocator.

It is either innocent pages or kernel panic or an innocent host process
crash (these are the choices to make).

Thanks,
Ashish

2022-11-23 12:00:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

On Tue, Nov 22, 2022 at 05:44:47AM -0600, Kalra, Ashish wrote:
> It is important to note that if invalid address/len are supplied, the
> failure will happen at the initial stage itself of transitioning these pages
> to firmware state.

/me goes and checks out your v6 tree based on 5.18.

Lemme choose one:

static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
{
...

inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);

...

for (i = 0; i < npages; i++) {
pfn = page_to_pfn(inpages[i]);

...

ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
if (ret) {
/*
* If the command failed then need to reclaim the page.
*/
snp_page_reclaim(pfn);

and here it would leak the pages if it cannot reclaim them.

Now how did you get those?

Through params.uaddr and params.len which come from userspace:

if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
return -EFAULT;


Now, think about it, can userspace be trusted?

Exactly.

Yeah, yeah, I see it does is_hva_registered() but userspace can just as
well supply the wrong region which fits.

> In such a case the kernel panic is justifiable,

So userspace can supply whatever it wants and you'd panic?

You surely don't mean that.

> but again if incorrect addresses are supplied, the failure will happen
> at the initial stage of transitioning these pages to firmware state
> and there is no need to reclaim.

See above.

> Or, otherwise dump a warning and let the pages not be freed/returned
> back to the page allocator.
>
> It is either innocent pages or kernel panic or an innocent host
> process crash (these are the choices to make).

No, it is make the kernel as resilient as possible. Which means, no
panic, add the pages to a not-to-be-used-anymore list and scream loudly
with warning messages when it must leak pages so that people can fix the
issue.

Ok?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-23 18:33:54

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

Hello Boris,

On 11/23/2022 5:40 AM, Borislav Petkov wrote:
> On Tue, Nov 22, 2022 at 05:44:47AM -0600, Kalra, Ashish wrote:
>> It is important to note that if invalid address/len are supplied, the
>> failure will happen at the initial stage itself of transitioning these pages
>> to firmware state.
>
> /me goes and checks out your v6 tree based on 5.18.
>
> Lemme choose one:
>
> static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> {
> ...
>
> inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
>
> ...
>
> for (i = 0; i < npages; i++) {
> pfn = page_to_pfn(inpages[i]);
>
> ...
>
> ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
> if (ret) {
> /*
> * If the command failed then need to reclaim the page.
> */
> snp_page_reclaim(pfn);
>
> and here it would leak the pages if it cannot reclaim them.
>
> Now how did you get those?
>
> Through params.uaddr and params.len which come from userspace:
>
> if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> return -EFAULT;
>
>
> Now, think about it, can userspace be trusted?
>
> Exactly.
>
> Yeah, yeah, I see it does is_hva_registered() but userspace can just as
> well supply the wrong region which fits.

Yes, that's right.

Also, before sev_issue_cmd() above, there is a call to
rmp_make_private() to make these pages transition to firmware state
before we issue the LAUNCH_UPDATE command as below:

ret = rmp_make_private(pfn, gfn << PAGE_SHIFT, level,
sev_get_asid(kvm), true);
if (ret) {
ret = -EFAULT;
goto e_unpin;

}
...
ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
&data, error);

So in case, the userspace provided an invalid/incorrect range, this
transition would have failed and there would not have been a need to do
any reclaim, so there are no pages leaked here.

This is also the reason why we need to reclaim pages if the subsequent
LAUNCH_UPDATE command fails, as now the pages are in F/W state because
of the rmp_make_private() call and they are now unsafe to be used by the
host.

>
>> In such a case the kernel panic is justifiable,
>
> So userspace can supply whatever it wants and you'd panic?
>
> You surely don't mean that.
>

No, we don't want to do that.

>> but again if incorrect addresses are supplied, the failure will happen
>> at the initial stage of transitioning these pages to firmware state
>> and there is no need to reclaim.

This is the case i mentioned above, rmp_make_private() before the
firmware command is the initial stage of transitioning the pages to
firmware state before issuing the firmware command.

>
> See above.
>
>> Or, otherwise dump a warning and let the pages not be freed/returned
>> back to the page allocator.
>>
>> It is either innocent pages or kernel panic or an innocent host
>> process crash (these are the choices to make).
>
> No, it is make the kernel as resilient as possible. Which means, no
> panic, add the pages to a not-to-be-used-anymore list and scream loudly
> with warning messages when it must leak pages so that people can fix the
> issue.
>
> Ok?
>

Right, yes, i totally agree.

So now we are adding these pages to an internal not-to-be-used-anymore
list and printing warnings and ensuring no panics as we don't allow
these pages to be released back to the page allocator.

Thanks,
Ashish

2022-12-19 15:07:19

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Wed, Jul 27, 2022 at 07:01:34PM +0200, Borislav Petkov wrote:
> On Mon, Jun 20, 2022 at 11:03:07PM +0000, Ashish Kalra wrote:
>
> > Subject: x86/sev: Invalid pages from direct map when adding it to RMP table
>
> "...: Invalidate pages from the direct map when adding them to the RMP table"
>
> > +static int restore_direct_map(u64 pfn, int npages)
> > +{
> > + int i, ret = 0;
> > +
> > + for (i = 0; i < npages; i++) {
> > + ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));
>
> set_memory_p() ?

We implemented this approach for v7, but it causes a fairly significant
performance regression, particularly for the case for npages > 1 which
this change was meant to optimize.

I still need to dig in a big but I'm guessing it's related to flushing
behavior.

It would however be nice to have a set_direct_map_default_noflush()
variant that accepted a 'npages' argument, since it would be more
performant here and also would potentially allow for restoring the 2M
direct mapping in some cases. Will look into this more for v8.

-Mike

>
> > + if (ret)
> > + goto cleanup;
> > + }
> > +
> > +cleanup:
> > + WARN(ret > 0, "Failed to restore direct map for pfn 0x%llx\n", pfn + i);
>
> Warn for each pfn?!
>
> That'll flood dmesg mightily.
>
> > + return ret;
> > +}
> > +
> > +static int invalid_direct_map(unsigned long pfn, int npages)
> > +{
> > + int i, ret = 0;
> > +
> > + for (i = 0; i < npages; i++) {
> > + ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));
>
> As above, set_memory_np() doesn't work here instead of looping over each
> page?
>
> > @@ -2462,11 +2494,38 @@ static int rmpupdate(u64 pfn, struct rmpupdate *val)
> > if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> > return -ENXIO;
> >
> > + level = RMP_TO_X86_PG_LEVEL(val->pagesize);
> > + npages = page_level_size(level) / PAGE_SIZE;
> > +
> > + /*
> > + * If page is getting assigned in the RMP table then unmap it from the
> > + * direct map.
> > + */
> > + if (val->assigned) {
> > + if (invalid_direct_map(pfn, npages)) {
> > + pr_err("Failed to unmap pfn 0x%llx pages %d from direct_map\n",
>
> "Failed to unmap %d pages at pfn 0x... from the direct map\n"
>
> > + pfn, npages);
> > + return -EFAULT;
> > + }
> > + }
> > +
> > /* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
> > asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
> > : "=a"(ret)
> > : "a"(paddr), "c"((unsigned long)val)
> > : "memory", "cc");
> > +
> > + /*
> > + * Restore the direct map after the page is removed from the RMP table.
> > + */
> > + if (!ret && !val->assigned) {
> > + if (restore_direct_map(pfn, npages)) {
> > + pr_err("Failed to map pfn 0x%llx pages %d in direct_map\n",
>
> "Failed to map %d pages at pfn 0x... into the direct map\n"
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2022-12-19 20:18:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Mon, Dec 19, 2022 at 09:00:26AM -0600, Michael Roth wrote:
> We implemented this approach for v7, but it causes a fairly significant
> performance regression, particularly for the case for npages > 1 which
> this change was meant to optimize.
>
> I still need to dig in a big but I'm guessing it's related to flushing
> behavior.

Well, AFAICT, change_page_attr_set_clr() flushes once at the end.

Don't you need to flush when you modify the direct map?

> It would however be nice to have a set_direct_map_default_noflush()
> variant that accepted a 'npages' argument, since it would be more
> performant here and also would potentially allow for restoring the 2M
> direct mapping in some cases. Will look into this more for v8.

set_pages_direct_map_default_noflush()

I guess.

Although the name is a mouthful so I wouldn't mind having those
shortened.

In any case, as long as that helper is properly defined and documented,
I don't mind.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-27 21:56:08

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

Hello Boris,

On 12/19/2022 2:08 PM, Borislav Petkov wrote:
> On Mon, Dec 19, 2022 at 09:00:26AM -0600, Michael Roth wrote:
>> We implemented this approach for v7, but it causes a fairly significant
>> performance regression, particularly for the case for npages > 1 which
>> this change was meant to optimize.
>>
>> I still need to dig in a big but I'm guessing it's related to flushing
>> behavior.
>
> Well, AFAICT, change_page_attr_set_clr() flushes once at the end.
>
> Don't you need to flush when you modify the direct map?
>

Milan onward, there is H/W support for coherency between mappings of the
same physical page with different encryption keys, so AFAIK, there
should be no need to flush during page state transitions, where we
invoke these direct map interface functions for re-mapping/invalidating
pages.

I don't know if there is any other reason to flush after modifying
the direct map ?

Thanks,
Ashish

2022-12-29 17:13:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Tue, Dec 27, 2022 at 03:49:39PM -0600, Kalra, Ashish wrote:
> Milan onward,

And before ML there's no SNP, right?

> there is H/W support for coherency between mappings of the
> same physical page with different encryption keys, so AFAIK, there should be
> no need to flush during page state transitions, where we invoke these direct
> map interface functions for re-mapping/invalidating pages.

Yah, that rings a bell.

In any case, the fact that flushing is not needed should be stated
somewhere in text so that it is clear why.

> I don't know if there is any other reason to flush after modifying
> the direct map ?

There's

/*
* No need to flush, when we did not set any of the caching
* attributes:
*/
cache = !!pgprot2cachemode(mask_set);


Does the above HW cover this case too?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-30 15:21:00

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Mon, Dec 19, 2022 at 09:08:31PM +0100, Borislav Petkov wrote:
> On Mon, Dec 19, 2022 at 09:00:26AM -0600, Michael Roth wrote:
> > We implemented this approach for v7, but it causes a fairly significant
> > performance regression, particularly for the case for npages > 1 which
> > this change was meant to optimize.
> >
> > I still need to dig in a big but I'm guessing it's related to flushing
> > behavior.
>
> Well, AFAICT, change_page_attr_set_clr() flushes once at the end.
>
> Don't you need to flush when you modify the direct map?
>
> > It would however be nice to have a set_direct_map_default_noflush()
> > variant that accepted a 'npages' argument, since it would be more
> > performant here and also would potentially allow for restoring the 2M
> > direct mapping in some cases. Will look into this more for v8.
>
> set_pages_direct_map_default_noflush()
>
> I guess.
>
> Although the name is a mouthful so I wouldn't mind having those
> shortened.

I had a patch that just adds numpages parameter:

https://lore.kernel.org/lkml/[email protected]/

The set_direct_map*() are not too widely used, so it's not a big deal to
update all callers.

> In any case, as long as that helper is properly defined and documented,
> I don't mind.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>

--
Sincerely yours,
Mike.

2023-01-05 21:52:09

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

Hello Boris,

On 12/29/2022 11:09 AM, Borislav Petkov wrote:
> On Tue, Dec 27, 2022 at 03:49:39PM -0600, Kalra, Ashish wrote:
>> Milan onward,
>
> And before ML there's no SNP, right?
>

Yes, that's correct.

>> there is H/W support for coherency between mappings of the
>> same physical page with different encryption keys, so AFAIK, there should be
>> no need to flush during page state transitions, where we invoke these direct
>> map interface functions for re-mapping/invalidating pages.
>
> Yah, that rings a bell.
>
> In any case, the fact that flushing is not needed should be stated
> somewhere in text so that it is clear why.
>
>> I don't know if there is any other reason to flush after modifying
>> the direct map ?
>
> There's
>
> /*
> * No need to flush, when we did not set any of the caching
> * attributes:
> */
> cache = !!pgprot2cachemode(mask_set);
>
>
> Does the above HW cover this case too?

Actually, as both set_memory_p() and set_memory_np() are only
setting/clearing the _PAGE_PRESENT flag and not changing any of the page
caching attributes so this flush won't be required anyway.

Thanks,
Ashish

>
> Thx.
>

2023-01-05 22:11:11

by Marc Orr

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

On Tue, Dec 27, 2022 at 1:49 PM Kalra, Ashish <[email protected]> wrote:
>
> Hello Boris,
>
> On 12/19/2022 2:08 PM, Borislav Petkov wrote:
> > On Mon, Dec 19, 2022 at 09:00:26AM -0600, Michael Roth wrote:
> >> We implemented this approach for v7, but it causes a fairly significant
> >> performance regression, particularly for the case for npages > 1 which
> >> this change was meant to optimize.
> >>
> >> I still need to dig in a big but I'm guessing it's related to flushing
> >> behavior.
> >
> > Well, AFAICT, change_page_attr_set_clr() flushes once at the end.
> >
> > Don't you need to flush when you modify the direct map?
> >
>
> Milan onward, there is H/W support for coherency between mappings of the
> same physical page with different encryption keys, so AFAIK, there
> should be no need to flush during page state transitions, where we
> invoke these direct map interface functions for re-mapping/invalidating
> pages.
>
> I don't know if there is any other reason to flush after modifying
> the direct map ?

Isn't the Milan coherence feature (SME_COHERENT?) about the caches --
not the TLBs? And isn't the flushing being discussed here about the
TLBs?

Also, I thought that Mingwei Zhang <[email protected]> found that the
Milan SEV coherence feature was basically unusable in Linux because it
only works across CPUs. It does not extend to IO (e.g., CPU caches
need to be flushed prior to free'ing a SEV VM's private address and
reallocating that location to a device driver to be used for IO). My
understanding of this feature and its limitations may be too coarse.
But I think we should be very careful about relying on this feature as
it is implemented in Milan.

That being said, I guess I could see an argument to rely on the
feature here, since we're not deallocating the memory and reallocating
it to a device. But again, I thought the feature was about cache
coherence -- not TLB coherence.

2023-01-05 22:28:37

by Ashish Kalra

[permalink] [raw]
Subject: Re: [PATCH Part2 v6 07/49] x86/sev: Invalid pages from direct map when adding it to RMP table

Hello Marc,

On 1/5/2023 4:08 PM, Marc Orr wrote:
> On Tue, Dec 27, 2022 at 1:49 PM Kalra, Ashish <[email protected]> wrote:
>>
>> Hello Boris,
>>
>> On 12/19/2022 2:08 PM, Borislav Petkov wrote:
>>> On Mon, Dec 19, 2022 at 09:00:26AM -0600, Michael Roth wrote:
>>>> We implemented this approach for v7, but it causes a fairly significant
>>>> performance regression, particularly for the case for npages > 1 which
>>>> this change was meant to optimize.
>>>>
>>>> I still need to dig in a big but I'm guessing it's related to flushing
>>>> behavior.
>>>
>>> Well, AFAICT, change_page_attr_set_clr() flushes once at the end.
>>>
>>> Don't you need to flush when you modify the direct map?
>>>
>>
>> Milan onward, there is H/W support for coherency between mappings of the
>> same physical page with different encryption keys, so AFAIK, there
>> should be no need to flush during page state transitions, where we
>> invoke these direct map interface functions for re-mapping/invalidating
>> pages.
>>
>> I don't know if there is any other reason to flush after modifying
>> the direct map ?
>
> Isn't the Milan coherence feature (SME_COHERENT?) about the caches --
> not the TLBs? And isn't the flushing being discussed here about the
> TLBs?

Actually, the flush does both cache and TLB flushing.

Both cpa_flush() and cpa_flush_all() will also do cache flushing if
cache argument is not NULL. As in this case, no page caching attributes
are being changed so there is no need to do cache flushing.

But TLB flushing (as PTE is updated) is still required and will be done.

>
> Also, I thought that Mingwei Zhang <[email protected]> found that the
> Milan SEV coherence feature was basically unusable in Linux because it
> only works across CPUs. It does not extend to IO (e.g., CPU caches
> need to be flushed prior to free'ing a SEV VM's private address and
> reallocating that location to a device driver to be used for IO). My
> understanding of this feature and its limitations may be too coarse.
> But I think we should be very careful about relying on this feature as
> it is implemented in Milan.
>
> That being said, I guess I could see an argument to rely on the
> feature here, since we're not deallocating the memory and reallocating
> it to a device. But again, I thought the feature was about cache
> coherence -- not TLB coherence.

Yes, this is just invalidating or re-mapping into the kernel direct map,
so we can rely on this feature for the use case here.

Thanks,
Ashish