This patch series adds support for PCI pass-thru devices to Hyper-V
Confidential VMs (also called "Isolation VMs"). But in preparation, it
first changes how private (encrypted) vs. shared (decrypted) memory is
handled in Hyper-V SEV-SNP guest VMs. The new approach builds on the
confidential computing (coco) mechanisms introduced in the 5.19 kernel
for TDX support and significantly reduces the amount of Hyper-V specific
code. Furthermore, with this new approach a proposed RFC patch set for
generic DMA layer functionality[1] is no longer necessary.
Background
==========
Hyper-V guests on AMD SEV-SNP hardware have the option of using the
"virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
architecture. With vTOM, shared vs. private memory accesses are
controlled by splitting the guest physical address space into two
halves. vTOM is the dividing line where the uppermost bit of the
physical address space is set; e.g., with 47 bits of guest physical
address space, vTOM is 0x40000000000 (bit 46 is set). Guest phyiscal
memory is accessible at two parallel physical addresses -- one below
vTOM and one above vTOM. Accesses below vTOM are private (encrypted)
while accesses above vTOM are shared (decrypted). In this sense, vTOM
is like the GPA.SHARED bit in Intel TDX.
In Hyper-V's use of vTOM, the normal guest OS runs at VMPL2, while
a Hyper-V provided "paravisor" runs at VMPL0 in the guest VM. (VMPL is
Virtual Machine Privilege Level. See AMD's SEV-SNP spec for more
details.) The paravisor provides emulation for various system devices
like the I/O APIC as part of the guest VM. Accesses to such devices
made by the normal guest OS trap to the paravisor and are emulated in
the guest VM context instead of in the Hyper-V host. This emulation is
invisible to the normal guest OS, but with the quirk that memory mapped
I/O accesses to these devices must be treated as private, not shared as
would be the case for other device accesses.
Support for Hyper-V guests using vTOM was added to the Linux kernel
in two patch sets[2][3]. This support treats the vTOM bit as part of
the physical address. For accessing shared (decrypted) memory, the core
approach is to create a second kernel virtual mapping that maps to
parallel physical addresses above vTOM, while leaving the original
mapping unchanged. Most of the code for creating that second virtual
mapping is confined to Hyper-V specific areas, but there are are also
changes to generic swiotlb code.
Changes in this patch set
=========================
In preparation for supporting PCI pass-thru devices, this patch set
changes the core approach for handling vTOM. In the new approach,
the vTOM bit is treated as a protection flag, and not as part of
the physical address. This new approach is like the approach for
the GPA.SHARED bit in Intel TDX. Furthermore, there's no need to
create a second kernel virtual mapping. When memory is changed
between private and shared using set_memory_decrypted() and
set_memory_encrypted(), the PTEs for the existing kernel mapping
are changed to add or remove the vTOM bit just as with TDX. The
hypercalls to change the memory status on the host side are made
using the existing callback mechanism. Everything just works, with
a minor tweak to map the I/O APIC to use private accesses as mentioned
above.
With the new handling of vTOM in place, existing Hyper-V code that
creates the second kernel virtual mapping still works, but it is now
redundant as the original kernel virtual mapping (as updated) maps
to the same physical address. To simplify things going forward, this
patch set removes the code that creates the second kernel virtual
mapping. And since a second kernel virtual mapping is no longer
needed, changes to the DMA layer proposed as an RFC[1] are no
longer needed.
Finally, to support PCI pass-thru in a Confidential VM, Hyper-V
requires that all accesses to PCI config space be emulated using
a hypercall. This patch set adds functions to invoke those
hypercalls and uses them in the config space access functions
in the Hyper-V PCI driver. Lastly, the Hyper-V PCI driver is
marked as allowed to be used in a Confidential VM. The Hyper-V
PCI driver has been hardened against a malicious Hyper-V in a
previous patch set.[4]
Patch Organization
==================
Patches 1 thru 6 are prepatory patches that account for
slightly different assumptions when running in a Hyper-V VM
with vTOM, fix some minor bugs, and make temporary tweaks
to avoid needing a single large patch to make the transition
from the old approach to the new approach.
Patch 7 enables the new approach to handling vTOM for Hyper-V
guest VMs. This is the core patch after which the new approach
is in effect.
Patches 8 thru 11 remove existing code for creating the second
kernel virtual mapping that is no longer necessary with the
new approach.
Patch 12 updates existing code so that it no longer assumes that
the vTOM bit is part of the physical address.
Patches 13 and 14 add new hypercalls for accessing MMIO space
and use those hypercalls for PCI config space. They also enable
the Hyper-V vPCI driver to be used in a Confidential VM.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
---
Changes in v3:
* Patch 1: Tweak the code fix to cleanly separate the page
alignment and physical address masking [Dave Hansen]
* Patch 2: Change the name of the new CC_ATTR that controls
whether the IO-APIC is mapped decrypted [Dave Hansen]
* Patch 5 (now patch 7): Add CC_ATTR_MEM_ENCRYPT to what
Hyper-V vTOM reports as 'true'. With the addition, Patches
5 and 6 are new to accomodate working correctly with Hyper-V
VMs using vTOM. [Tom Lendacky]
Changes in v2:
* Patch 11: Include more detail in the error message if an MMIO
hypercall fails. [Bjorn Helgaas]
* Patch 12: Restore removed memory barriers. It seems like these
barriers should not be needed because of the spin_unlock() calls,
but commit bdd74440d9e8 indicates that they are. This patch series
will leave the barriers unchanged; whether they are really needed
can be sorted out separately. [Boqun Feng]
Michael Kelley (14):
x86/ioremap: Fix page aligned size calculation in __ioremap_caller()
x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute
x86/hyperv: Reorder code in prep for subsequent patch
Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
init: Call mem_encrypt_init() after Hyper-V hypercall init is done
x86/hyperv: Change vTOM handling to use standard coco mechanisms
swiotlb: Remove bounce buffer remapping for Hyper-V
Drivers: hv: vmbus: Remove second mapping of VMBus monitor pages
Drivers: hv: vmbus: Remove second way of mapping ring buffers
hv_netvsc: Remove second mapping of send and recv buffers
Drivers: hv: Don't remap addresses that are above shared_gpa_boundary
PCI: hv: Add hypercalls to read/write MMIO space
PCI: hv: Enable PCI pass-thru devices in Confidential VMs
arch/x86/coco/core.c | 11 +-
arch/x86/hyperv/hv_init.c | 18 +--
arch/x86/hyperv/ivm.c | 121 +++++++++----------
arch/x86/include/asm/hyperv-tlfs.h | 3 +
arch/x86/include/asm/mshyperv.h | 8 +-
arch/x86/kernel/apic/io_apic.c | 3 +-
arch/x86/kernel/cpu/mshyperv.c | 22 ++--
arch/x86/mm/ioremap.c | 8 +-
arch/x86/mm/mem_encrypt_amd.c | 10 +-
arch/x86/mm/pat/set_memory.c | 3 -
drivers/hv/Kconfig | 1 -
drivers/hv/channel_mgmt.c | 2 +-
drivers/hv/connection.c | 113 +++++-------------
drivers/hv/hv.c | 23 ++--
drivers/hv/hv_common.c | 11 --
drivers/hv/hyperv_vmbus.h | 2 -
drivers/hv/ring_buffer.c | 62 ++++------
drivers/net/hyperv/hyperv_net.h | 2 -
drivers/net/hyperv/netvsc.c | 48 +-------
drivers/pci/controller/pci-hyperv.c | 232 ++++++++++++++++++++++++++----------
include/asm-generic/hyperv-tlfs.h | 22 ++++
include/asm-generic/mshyperv.h | 2 -
include/linux/cc_platform.h | 12 ++
include/linux/swiotlb.h | 2 -
init/main.c | 19 +--
kernel/dma/swiotlb.c | 45 +------
26 files changed, 380 insertions(+), 425 deletions(-)
--
1.8.3.1
Current code always maps the IOAPIC as shared (decrypted) in a
confidential VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM
enabled use a paravisor running in VMPL0 to emulate the IOAPIC.
In such a case, the IOAPIC must be accessed as private (encrypted).
Fix this by gating the IOAPIC decrypted mapping on a new
cc_platform_has() attribute that a subsequent patch in the series
will set only for Hyper-V guests.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Wei Liu <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 3 ++-
include/linux/cc_platform.h | 12 ++++++++++++
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index a868b76..c65e0cc 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2686,7 +2686,8 @@ static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys)
* Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
* bits, just like normal ioremap():
*/
- flags = pgprot_decrypted(flags);
+ if (!cc_platform_has(CC_ATTR_EMULATED_IOAPIC))
+ flags = pgprot_decrypted(flags);
__set_fixmap(idx, phys, flags);
}
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd..7a0da75 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -90,6 +90,18 @@ enum cc_attr {
* Examples include TDX Guest.
*/
CC_ATTR_HOTPLUG_DISABLED,
+
+ /**
+ * @CC_ATTR_EMULATED_IOAPIC: Guest VM has an emulated I/O APIC
+ *
+ * The platform/OS is running as a guest/virtual machine with
+ * an I/O APIC that is emulated by a paravisor running in the
+ * guest VM context. As such, the I/O APIC is accessed in the
+ * encrypted portion of the guest physical address space.
+ *
+ * Examples include Hyper-V SEV-SNP guests using vTOM.
+ */
+ CC_ATTR_EMULATED_IOAPIC,
};
#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
--
1.8.3.1
Current code re-calculates the size after aligning the starting and
ending physical addresses on a page boundary. But the re-calculation
also embeds the masking of high order bits that exceed the size of
the physical address space (via PHYSICAL_PAGE_MASK). If the masking
removes any high order bits, the size calculation results in a huge
value that is likely to immediately fail.
Fix this by re-calculating the page-aligned size first. Then mask any
high order bits using PHYSICAL_PAGE_MASK.
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/ioremap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 78c5bc6..6453fba 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -217,9 +217,15 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,
* Mappings have to be page-aligned
*/
offset = phys_addr & ~PAGE_MASK;
- phys_addr &= PHYSICAL_PAGE_MASK;
+ phys_addr &= PAGE_MASK;
size = PAGE_ALIGN(last_addr+1) - phys_addr;
+ /*
+ * Mask out any bits not part of the actual physical
+ * address, like memory encryption bits.
+ */
+ phys_addr &= PHYSICAL_PAGE_MASK;
+
retval = memtype_reserve(phys_addr, (u64)phys_addr + size,
pcm, &new_pcm);
if (retval) {
--
1.8.3.1
Reorder some code as preparation for a subsequent patch. No
functional change.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
arch/x86/hyperv/ivm.c | 68 +++++++++++++++++++++++++--------------------------
1 file changed, 34 insertions(+), 34 deletions(-)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 1dbcbd9..f33c67e 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -235,40 +235,6 @@ void hv_ghcb_msr_read(u64 msr, u64 *value)
EXPORT_SYMBOL_GPL(hv_ghcb_msr_read);
#endif
-enum hv_isolation_type hv_get_isolation_type(void)
-{
- if (!(ms_hyperv.priv_high & HV_ISOLATION))
- return HV_ISOLATION_TYPE_NONE;
- return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
-}
-EXPORT_SYMBOL_GPL(hv_get_isolation_type);
-
-/*
- * hv_is_isolation_supported - Check system runs in the Hyper-V
- * isolation VM.
- */
-bool hv_is_isolation_supported(void)
-{
- if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
- return false;
-
- if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
- return false;
-
- return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
-}
-
-DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
-
-/*
- * hv_isolation_type_snp - Check system runs in the AMD SEV-SNP based
- * isolation VM.
- */
-bool hv_isolation_type_snp(void)
-{
- return static_branch_unlikely(&isolation_type_snp);
-}
-
/*
* hv_mark_gpa_visibility - Set pages visible to host via hvcall.
*
@@ -387,3 +353,37 @@ void hv_unmap_memory(void *addr)
{
vunmap(addr);
}
+
+enum hv_isolation_type hv_get_isolation_type(void)
+{
+ if (!(ms_hyperv.priv_high & HV_ISOLATION))
+ return HV_ISOLATION_TYPE_NONE;
+ return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
+}
+EXPORT_SYMBOL_GPL(hv_get_isolation_type);
+
+/*
+ * hv_is_isolation_supported - Check system runs in the Hyper-V
+ * isolation VM.
+ */
+bool hv_is_isolation_supported(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+ return false;
+
+ if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
+ return false;
+
+ return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
+}
+
+DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
+
+/*
+ * hv_isolation_type_snp - Check system runs in the AMD SEV-SNP based
+ * isolation VM.
+ */
+bool hv_isolation_type_snp(void)
+{
+ return static_branch_unlikely(&isolation_type_snp);
+}
--
1.8.3.1
Current code in sme_postprocess_startup() decrypts the bss_decrypted
section when sme_me_mask is non-zero. But code in
mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these
conditions are not equivalent as sme_me_mask is always zero when
using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts
to re-encrypt memory that was never decrypted.
Fix this in mem_encrypt_free_decrypted_mem() by conditioning the
re-encryption on the same test for non-zero sme_me_mask. Hyper-V
guests using vTOM don't need the bss_decrypted section to be
decrypted, so skipping the decryption/re-encryption doesn't cause
a problem.
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/mem_encrypt_amd.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 9c4d8db..5a51343 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -513,10 +513,14 @@ void __init mem_encrypt_free_decrypted_mem(void)
npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
/*
- * The unused memory range was mapped decrypted, change the encryption
- * attribute from decrypted to encrypted before freeing it.
+ * If the unused memory range was mapped decrypted, change the encryption
+ * attribute from decrypted to encrypted before freeing it. Base the
+ * re-encryption on the same condition used for the decryption in
+ * sme_postprocess_startup(). Higher level abstractions, such as
+ * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a Hyper-V VM
+ * using vTOM, where sme_me_mask is always zero.
*/
- if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+ if (sme_get_me_mask()) {
r = set_memory_encrypted(vaddr, npages);
if (r) {
pr_warn("failed to free unused decrypted pages\n");
--
1.8.3.1
In preparation for a subsequent patch, update vmap_pfn() calls to
explicitly request that the mapping be for decrypted access to
the memory. There's no change in functionality since the PFNs
passed to vmap_pfn() are above the shared_gpa_boundary, implicitly
producing a decrypted mapping. But explicitly requesting decrypted
allows the code to work before and after a subsequent patch
that will cause vmap_pfn() to mask the PFNs to being below the
shared_gpa_boundary. While another subsesquent patch removes the
vmap_pfn() calls entirely, this temporary tweak avoids the need
for a large patch that makes all the changes at once.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
arch/x86/hyperv/ivm.c | 2 +-
drivers/hv/ring_buffer.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index f33c67e..e8be4c2 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -343,7 +343,7 @@ void *hv_map_memory(void *addr, unsigned long size)
pfns[i] = vmalloc_to_pfn(addr + i * PAGE_SIZE) +
(ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
- vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO);
+ vaddr = vmap_pfn(pfns, size / PAGE_SIZE, pgprot_decrypted(PAGE_KERNEL_NOENC));
kfree(pfns);
return vaddr;
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 59a4aa8..b4a91b1 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -211,7 +211,7 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
ring_info->ring_buffer = (struct hv_ring_buffer *)
vmap_pfn(pfns_wraparound, page_cnt * 2 - 1,
- PAGE_KERNEL);
+ pgprot_decrypted(PAGE_KERNEL_NOENC));
kfree(pfns_wraparound);
if (!ring_info->ring_buffer)
--
1.8.3.1
Full Hyper-V initialization, including support for hypercalls, is done
as an apic_post_init callback via late_time_init(). mem_encrypt_init()
needs to make hypercalls when it marks swiotlb memory as decrypted.
But mem_encrypt_init() is currently called a few lines before
late_time_init(), so the hypercalls don't work.
Fix this by moving mem_encrypt_init() after late_time_init() and
related clock initializations. The intervening initializations don't
do any I/O that requires the swiotlb, so moving mem_encrypt_init()
slightly later has no impact.
Signed-off-by: Michael Kelley <[email protected]>
---
init/main.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/init/main.c b/init/main.c
index e1c3911..5a7c466 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1088,14 +1088,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
*/
locking_selftest();
- /*
- * This needs to be called before any devices perform DMA
- * operations that might use the SWIOTLB bounce buffers. It will
- * mark the bounce buffers as decrypted so that their usage will
- * not cause "plain-text" data to be decrypted when accessed.
- */
- mem_encrypt_init();
-
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
@@ -1112,6 +1104,17 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
late_time_init();
sched_clock_init();
calibrate_delay();
+
+ /*
+ * This needs to be called before any devices perform DMA
+ * operations that might use the SWIOTLB bounce buffers. It will
+ * mark the bounce buffers as decrypted so that their usage will
+ * not cause "plain-text" data to be decrypted when accessed. It
+ * must be called after late_time_init() so that Hyper-V x86/x64
+ * hypercalls work when the SWIOTLB bounce buffers are decrypted.
+ */
+ mem_encrypt_init();
+
pid_idr_init();
anon_vma_init();
#ifdef CONFIG_X86
--
1.8.3.1
For PCI pass-thru devices in a Confidential VM, Hyper-V requires
that PCI config space be accessed via hypercalls. In normal VMs,
config space accesses are trapped to the Hyper-V host and emulated.
But in a confidential VM, the host can't access guest memory to
decode the instruction for emulation, so an explicit hypercall must
be used.
Update the PCI config space access functions to use the hypercalls
when such use is indicated by Hyper-V flags. Also, set the flag to
allow the Hyper-V PCI driver to be loaded and used in a Confidential
VM (a.k.a., "Isolation VM"). The driver has previously been hardened
against a malicious Hyper-V host[1].
[1] https://lore.kernel.org/all/[email protected]/
Co-developed-by: Dexuan Cui <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Boqun Feng <[email protected]>
---
drivers/hv/channel_mgmt.c | 2 +-
drivers/pci/controller/pci-hyperv.c | 168 ++++++++++++++++++++++--------------
2 files changed, 105 insertions(+), 65 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 5b12040..c0f9ac2 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -67,7 +67,7 @@
{ .dev_type = HV_PCIE,
HV_PCIE_GUID,
.perf_device = false,
- .allowed_in_isolated = false,
+ .allowed_in_isolated = true,
},
/* Synthetic Frame Buffer */
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 09b40a1..6ce83e4 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -514,6 +514,7 @@ struct hv_pcibus_device {
/* Highest slot of child device with resources allocated */
int wslot_res_allocated;
+ bool use_calls; /* Use hypercalls to access mmio cfg space */
/* hypercall arg, must not cross page boundary */
struct hv_retarget_device_interrupt retarget_msi_interrupt_params;
@@ -1136,8 +1137,10 @@ static void hv_pci_write_mmio(struct device *dev, phys_addr_t gpa, int size, u32
static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
int size, u32 *val)
{
+ struct hv_pcibus_device *hbus = hpdev->hbus;
+ struct device *dev = &hbus->hdev->device;
+ int offset = where + CFG_PAGE_OFFSET;
unsigned long flags;
- void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET + where;
/*
* If the attempt is to read the IDs or the ROM BAR, simulate that.
@@ -1165,56 +1168,79 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
*/
*val = 0;
} else if (where + size <= CFG_PAGE_SIZE) {
- spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
- /* Choose the function to be read. (See comment above) */
- writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
- /* Make sure the function was chosen before we start reading. */
- mb();
- /* Read from that function's config space. */
- switch (size) {
- case 1:
- *val = readb(addr);
- break;
- case 2:
- *val = readw(addr);
- break;
- default:
- *val = readl(addr);
- break;
+
+ spin_lock_irqsave(&hbus->config_lock, flags);
+ if (hbus->use_calls) {
+ phys_addr_t addr = hbus->mem_config->start + offset;
+
+ hv_pci_write_mmio(dev, hbus->mem_config->start, 4,
+ hpdev->desc.win_slot.slot);
+ hv_pci_read_mmio(dev, addr, size, val);
+ } else {
+ void __iomem *addr = hbus->cfg_addr + offset;
+
+ /* Choose the function to be read. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hbus->cfg_addr);
+ /* Make sure the function was chosen before reading. */
+ mb();
+ /* Read from that function's config space. */
+ switch (size) {
+ case 1:
+ *val = readb(addr);
+ break;
+ case 2:
+ *val = readw(addr);
+ break;
+ default:
+ *val = readl(addr);
+ break;
+ }
+ /*
+ * Make sure the read was done before we release the
+ * spinlock allowing consecutive reads/writes.
+ */
+ mb();
}
- /*
- * Make sure the read was done before we release the spinlock
- * allowing consecutive reads/writes.
- */
- mb();
- spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+ spin_unlock_irqrestore(&hbus->config_lock, flags);
} else {
- dev_err(&hpdev->hbus->hdev->device,
- "Attempt to read beyond a function's config space.\n");
+ dev_err(dev, "Attempt to read beyond a function's config space.\n");
}
}
static u16 hv_pcifront_get_vendor_id(struct hv_pci_dev *hpdev)
{
+ struct hv_pcibus_device *hbus = hpdev->hbus;
+ struct device *dev = &hbus->hdev->device;
+ u32 val;
u16 ret;
unsigned long flags;
- void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET +
- PCI_VENDOR_ID;
- spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
+ spin_lock_irqsave(&hbus->config_lock, flags);
- /* Choose the function to be read. (See comment above) */
- writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
- /* Make sure the function was chosen before we start reading. */
- mb();
- /* Read from that function's config space. */
- ret = readw(addr);
- /*
- * mb() is not required here, because the spin_unlock_irqrestore()
- * is a barrier.
- */
+ if (hbus->use_calls) {
+ phys_addr_t addr = hbus->mem_config->start +
+ CFG_PAGE_OFFSET + PCI_VENDOR_ID;
+
+ hv_pci_write_mmio(dev, hbus->mem_config->start, 4,
+ hpdev->desc.win_slot.slot);
+ hv_pci_read_mmio(dev, addr, 2, &val);
+ ret = val; /* Truncates to 16 bits */
+ } else {
+ void __iomem *addr = hbus->cfg_addr + CFG_PAGE_OFFSET +
+ PCI_VENDOR_ID;
+ /* Choose the function to be read. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hbus->cfg_addr);
+ /* Make sure the function was chosen before we start reading. */
+ mb();
+ /* Read from that function's config space. */
+ ret = readw(addr);
+ /*
+ * mb() is not required here, because the
+ * spin_unlock_irqrestore() is a barrier.
+ */
+ }
- spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+ spin_unlock_irqrestore(&hbus->config_lock, flags);
return ret;
}
@@ -1229,39 +1255,51 @@ static u16 hv_pcifront_get_vendor_id(struct hv_pci_dev *hpdev)
static void _hv_pcifront_write_config(struct hv_pci_dev *hpdev, int where,
int size, u32 val)
{
+ struct hv_pcibus_device *hbus = hpdev->hbus;
+ struct device *dev = &hbus->hdev->device;
+ int offset = where + CFG_PAGE_OFFSET;
unsigned long flags;
- void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET + where;
if (where >= PCI_SUBSYSTEM_VENDOR_ID &&
where + size <= PCI_CAPABILITY_LIST) {
/* SSIDs and ROM BARs are read-only */
} else if (where >= PCI_COMMAND && where + size <= CFG_PAGE_SIZE) {
- spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
- /* Choose the function to be written. (See comment above) */
- writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
- /* Make sure the function was chosen before we start writing. */
- wmb();
- /* Write to that function's config space. */
- switch (size) {
- case 1:
- writeb(val, addr);
- break;
- case 2:
- writew(val, addr);
- break;
- default:
- writel(val, addr);
- break;
+ spin_lock_irqsave(&hbus->config_lock, flags);
+
+ if (hbus->use_calls) {
+ phys_addr_t addr = hbus->mem_config->start + offset;
+
+ hv_pci_write_mmio(dev, hbus->mem_config->start, 4,
+ hpdev->desc.win_slot.slot);
+ hv_pci_write_mmio(dev, addr, size, val);
+ } else {
+ void __iomem *addr = hbus->cfg_addr + offset;
+
+ /* Choose the function to write. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hbus->cfg_addr);
+ /* Make sure the function was chosen before writing. */
+ wmb();
+ /* Write to that function's config space. */
+ switch (size) {
+ case 1:
+ writeb(val, addr);
+ break;
+ case 2:
+ writew(val, addr);
+ break;
+ default:
+ writel(val, addr);
+ break;
+ }
+ /*
+ * Make sure the write was done before we release the
+ * spinlock allowing consecutive reads/writes.
+ */
+ mb();
}
- /*
- * Make sure the write was done before we release the spinlock
- * allowing consecutive reads/writes.
- */
- mb();
- spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+ spin_unlock_irqrestore(&hbus->config_lock, flags);
} else {
- dev_err(&hpdev->hbus->hdev->device,
- "Attempt to write beyond a function's config space.\n");
+ dev_err(dev, "Attempt to write beyond a function's config space.\n");
}
}
@@ -3580,6 +3618,7 @@ static int hv_pci_probe(struct hv_device *hdev,
hbus->bridge->domain_nr = dom;
#ifdef CONFIG_X86
hbus->sysdata.domain = dom;
+ hbus->use_calls = !!(ms_hyperv.hints & HV_X64_USE_MMIO_HYPERCALLS);
#elif defined(CONFIG_ARM64)
/*
* Set the PCI bus parent to be the corresponding VMbus
@@ -3589,6 +3628,7 @@ static int hv_pci_probe(struct hv_device *hdev,
* information to devices created on the bus.
*/
hbus->sysdata.parent = hdev->device.parent;
+ hbus->use_calls = false;
#endif
hbus->hdev = hdev;
--
1.8.3.1
To support PCI pass-thru devices in Confidential VMs, Hyper-V
has added hypercalls to read and write MMIO space. Add the
appropriate definitions to hyperv-tlfs.h and implement
functions to make the hypercalls. These functions are used
in a subsequent patch.
Co-developed-by: Dexuan Cui <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/include/asm/hyperv-tlfs.h | 3 ++
drivers/pci/controller/pci-hyperv.c | 64 +++++++++++++++++++++++++++++++++++++
include/asm-generic/hyperv-tlfs.h | 22 +++++++++++++
3 files changed, 89 insertions(+)
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 3089ec3..f769b9d 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -117,6 +117,9 @@
/* Recommend using enlightened VMCS */
#define HV_X64_ENLIGHTENED_VMCS_RECOMMENDED BIT(14)
+/* Use hypercalls for MMIO config space access */
+#define HV_X64_USE_MMIO_HYPERCALLS BIT(21)
+
/*
* CPU management features identification.
* These are HYPERV_CPUID_CPU_MANAGEMENT_FEATURES.EAX bits.
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index ba64284..09b40a1 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1054,6 +1054,70 @@ static int wslot_to_devfn(u32 wslot)
return PCI_DEVFN(slot_no.bits.dev, slot_no.bits.func);
}
+static void hv_pci_read_mmio(struct device *dev, phys_addr_t gpa, int size, u32 *val)
+{
+ struct hv_mmio_read_input *in;
+ struct hv_mmio_read_output *out;
+ u64 ret;
+
+ /*
+ * Must be called with interrupts disabled so it is safe
+ * to use the per-cpu input argument page. Use it for
+ * both input and output.
+ */
+ in = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ out = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*in);
+ in->gpa = gpa;
+ in->size = size;
+
+ ret = hv_do_hypercall(HVCALL_MMIO_READ, in, out);
+ if (hv_result_success(ret)) {
+ switch (size) {
+ case 1:
+ *val = *(u8 *)(out->data);
+ break;
+ case 2:
+ *val = *(u16 *)(out->data);
+ break;
+ default:
+ *val = *(u32 *)(out->data);
+ break;
+ }
+ } else
+ dev_err(dev, "MMIO read hypercall error %llx addr %llx size %d\n",
+ ret, gpa, size);
+}
+
+static void hv_pci_write_mmio(struct device *dev, phys_addr_t gpa, int size, u32 val)
+{
+ struct hv_mmio_write_input *in;
+ u64 ret;
+
+ /*
+ * Must be called with interrupts disabled so it is safe
+ * to use the per-cpu input argument memory.
+ */
+ in = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ in->gpa = gpa;
+ in->size = size;
+ switch (size) {
+ case 1:
+ *(u8 *)(in->data) = val;
+ break;
+ case 2:
+ *(u16 *)(in->data) = val;
+ break;
+ default:
+ *(u32 *)(in->data) = val;
+ break;
+ }
+
+ ret = hv_do_hypercall(HVCALL_MMIO_WRITE, in, NULL);
+ if (!hv_result_success(ret))
+ dev_err(dev, "MMIO write hypercall error %llx addr %llx size %d\n",
+ ret, gpa, size);
+}
+
/*
* PCI Configuration Space for these root PCI buses is implemented as a pair
* of pages in memory-mapped I/O space. Writing to the first page chooses
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index b17c6ee..fcab07b 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -168,6 +168,8 @@ struct ms_hyperv_tsc_page {
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
#define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY 0x00db
+#define HVCALL_MMIO_READ 0x0106
+#define HVCALL_MMIO_WRITE 0x0107
/* Extended hypercalls */
#define HV_EXT_CALL_QUERY_CAPABILITIES 0x8001
@@ -790,4 +792,24 @@ struct hv_memory_hint {
union hv_gpa_page_range ranges[];
} __packed;
+/* Data structures for HVCALL_MMIO_READ and HVCALL_MMIO_WRITE */
+#define HV_HYPERCALL_MMIO_MAX_DATA_LENGTH 64
+
+struct hv_mmio_read_input {
+ u64 gpa;
+ u32 size;
+ u32 reserved;
+} __packed;
+
+struct hv_mmio_read_output {
+ u8 data[HV_HYPERCALL_MMIO_MAX_DATA_LENGTH];
+} __packed;
+
+struct hv_mmio_write_input {
+ u64 gpa;
+ u32 size;
+ u32 reserved;
+ u8 data[HV_HYPERCALL_MMIO_MAX_DATA_LENGTH];
+} __packed;
+
#endif
--
1.8.3.1
With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), creating a second kernel virtual
mapping for shared memory is no longer necessary. Everything needed
for the transition to shared is handled by set_memory_decrypted().
As such, remove the code to create and manage the second
mapping for VMBus monitor pages. Because set_memory_decrypted()
and set_memory_encrypted() are no-ops in normal VMs, it's
not even necessary to test for being in a Confidential VM
(a.k.a., "Isolation VM").
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
drivers/hv/connection.c | 113 ++++++++++++----------------------------------
drivers/hv/hyperv_vmbus.h | 2 -
2 files changed, 28 insertions(+), 87 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 9dc27e5..e473b69 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -104,8 +104,14 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID;
}
- msg->monitor_page1 = vmbus_connection.monitor_pages_pa[0];
- msg->monitor_page2 = vmbus_connection.monitor_pages_pa[1];
+ /*
+ * shared_gpa_boundary is zero in non-SNP VMs, so it's safe to always
+ * do the add
+ */
+ msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]) +
+ ms_hyperv.shared_gpa_boundary;
+ msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]) +
+ ms_hyperv.shared_gpa_boundary;
msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
@@ -219,72 +225,27 @@ int vmbus_connect(void)
* Setup the monitor notification facility. The 1st page for
* parent->child and the 2nd page for child->parent
*/
- vmbus_connection.monitor_pages[0] = (void *)hv_alloc_hyperv_zeroed_page();
- vmbus_connection.monitor_pages[1] = (void *)hv_alloc_hyperv_zeroed_page();
+ vmbus_connection.monitor_pages[0] = (void *)hv_alloc_hyperv_page();
+ vmbus_connection.monitor_pages[1] = (void *)hv_alloc_hyperv_page();
if ((vmbus_connection.monitor_pages[0] == NULL) ||
(vmbus_connection.monitor_pages[1] == NULL)) {
ret = -ENOMEM;
goto cleanup;
}
- vmbus_connection.monitor_pages_original[0]
- = vmbus_connection.monitor_pages[0];
- vmbus_connection.monitor_pages_original[1]
- = vmbus_connection.monitor_pages[1];
- vmbus_connection.monitor_pages_pa[0]
- = virt_to_phys(vmbus_connection.monitor_pages[0]);
- vmbus_connection.monitor_pages_pa[1]
- = virt_to_phys(vmbus_connection.monitor_pages[1]);
-
- if (hv_is_isolation_supported()) {
- ret = set_memory_decrypted((unsigned long)
- vmbus_connection.monitor_pages[0],
- 1);
- ret |= set_memory_decrypted((unsigned long)
- vmbus_connection.monitor_pages[1],
- 1);
- if (ret)
- goto cleanup;
-
- /*
- * Isolation VM with AMD SNP needs to access monitor page via
- * address space above shared gpa boundary.
- */
- if (hv_isolation_type_snp()) {
- vmbus_connection.monitor_pages_pa[0] +=
- ms_hyperv.shared_gpa_boundary;
- vmbus_connection.monitor_pages_pa[1] +=
- ms_hyperv.shared_gpa_boundary;
-
- vmbus_connection.monitor_pages[0]
- = memremap(vmbus_connection.monitor_pages_pa[0],
- HV_HYP_PAGE_SIZE,
- MEMREMAP_WB);
- if (!vmbus_connection.monitor_pages[0]) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
- vmbus_connection.monitor_pages[1]
- = memremap(vmbus_connection.monitor_pages_pa[1],
- HV_HYP_PAGE_SIZE,
- MEMREMAP_WB);
- if (!vmbus_connection.monitor_pages[1]) {
- ret = -ENOMEM;
- goto cleanup;
- }
- }
-
- /*
- * Set memory host visibility hvcall smears memory
- * and so zero monitor pages here.
- */
- memset(vmbus_connection.monitor_pages[0], 0x00,
- HV_HYP_PAGE_SIZE);
- memset(vmbus_connection.monitor_pages[1], 0x00,
- HV_HYP_PAGE_SIZE);
+ ret = set_memory_decrypted((unsigned long)
+ vmbus_connection.monitor_pages[0], 1);
+ ret |= set_memory_decrypted((unsigned long)
+ vmbus_connection.monitor_pages[1], 1);
+ if (ret)
+ goto cleanup;
- }
+ /*
+ * Set_memory_decrypted() will change the memory contents if
+ * decryption occurs, so zero monitor pages here.
+ */
+ memset(vmbus_connection.monitor_pages[0], 0x00, HV_HYP_PAGE_SIZE);
+ memset(vmbus_connection.monitor_pages[1], 0x00, HV_HYP_PAGE_SIZE);
msginfo = kzalloc(sizeof(*msginfo) +
sizeof(struct vmbus_channel_initiate_contact),
@@ -376,31 +337,13 @@ void vmbus_disconnect(void)
vmbus_connection.int_page = NULL;
}
- if (hv_is_isolation_supported()) {
- /*
- * memunmap() checks input address is ioremap address or not
- * inside. It doesn't unmap any thing in the non-SNP CVM and
- * so not check CVM type here.
- */
- memunmap(vmbus_connection.monitor_pages[0]);
- memunmap(vmbus_connection.monitor_pages[1]);
-
- set_memory_encrypted((unsigned long)
- vmbus_connection.monitor_pages_original[0],
- 1);
- set_memory_encrypted((unsigned long)
- vmbus_connection.monitor_pages_original[1],
- 1);
- }
+ set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[0], 1);
+ set_memory_encrypted((unsigned long)vmbus_connection.monitor_pages[1], 1);
- hv_free_hyperv_page((unsigned long)
- vmbus_connection.monitor_pages_original[0]);
- hv_free_hyperv_page((unsigned long)
- vmbus_connection.monitor_pages_original[1]);
- vmbus_connection.monitor_pages_original[0] =
- vmbus_connection.monitor_pages[0] = NULL;
- vmbus_connection.monitor_pages_original[1] =
- vmbus_connection.monitor_pages[1] = NULL;
+ hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
+ hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
+ vmbus_connection.monitor_pages[0] = NULL;
+ vmbus_connection.monitor_pages[1] = NULL;
}
/*
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index dc673ed..167ac51 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -241,8 +241,6 @@ struct vmbus_connection {
* is child->parent notification
*/
struct hv_monitor_page *monitor_pages[2];
- void *monitor_pages_original[2];
- phys_addr_t monitor_pages_pa[2];
struct list_head chn_msg_list;
spinlock_t channelmsg_lock;
--
1.8.3.1
With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), it's no longer necessary to
have separate code paths for mapping VMBus ring buffers for
for normal VMs and for Confidential VMs.
As such, remove the code path that uses vmap_pfn(), and set
the protection flags argument to vmap() to account for the
difference between normal and Confidential VMs.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
drivers/hv/ring_buffer.c | 62 ++++++++++++++++--------------------------------
1 file changed, 20 insertions(+), 42 deletions(-)
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index b4a91b1..20a0631 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -186,8 +186,6 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
struct page *pages, u32 page_cnt, u32 max_pkt_size)
{
struct page **pages_wraparound;
- unsigned long *pfns_wraparound;
- u64 pfn;
int i;
BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
@@ -196,50 +194,30 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
* First page holds struct hv_ring_buffer, do wraparound mapping for
* the rest.
*/
- if (hv_isolation_type_snp()) {
- pfn = page_to_pfn(pages) +
- PFN_DOWN(ms_hyperv.shared_gpa_boundary);
+ pages_wraparound = kcalloc(page_cnt * 2 - 1,
+ sizeof(struct page *),
+ GFP_KERNEL);
+ if (!pages_wraparound)
+ return -ENOMEM;
- pfns_wraparound = kcalloc(page_cnt * 2 - 1,
- sizeof(unsigned long), GFP_KERNEL);
- if (!pfns_wraparound)
- return -ENOMEM;
-
- pfns_wraparound[0] = pfn;
- for (i = 0; i < 2 * (page_cnt - 1); i++)
- pfns_wraparound[i + 1] = pfn + i % (page_cnt - 1) + 1;
-
- ring_info->ring_buffer = (struct hv_ring_buffer *)
- vmap_pfn(pfns_wraparound, page_cnt * 2 - 1,
- pgprot_decrypted(PAGE_KERNEL_NOENC));
- kfree(pfns_wraparound);
-
- if (!ring_info->ring_buffer)
- return -ENOMEM;
-
- /* Zero ring buffer after setting memory host visibility. */
- memset(ring_info->ring_buffer, 0x00, PAGE_SIZE * page_cnt);
- } else {
- pages_wraparound = kcalloc(page_cnt * 2 - 1,
- sizeof(struct page *),
- GFP_KERNEL);
- if (!pages_wraparound)
- return -ENOMEM;
-
- pages_wraparound[0] = pages;
- for (i = 0; i < 2 * (page_cnt - 1); i++)
- pages_wraparound[i + 1] =
- &pages[i % (page_cnt - 1) + 1];
+ pages_wraparound[0] = pages;
+ for (i = 0; i < 2 * (page_cnt - 1); i++)
+ pages_wraparound[i + 1] =
+ &pages[i % (page_cnt - 1) + 1];
- ring_info->ring_buffer = (struct hv_ring_buffer *)
- vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP,
- PAGE_KERNEL);
+ ring_info->ring_buffer = (struct hv_ring_buffer *)
+ vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP,
+ pgprot_decrypted(PAGE_KERNEL_NOENC));
- kfree(pages_wraparound);
- if (!ring_info->ring_buffer)
- return -ENOMEM;
- }
+ kfree(pages_wraparound);
+ if (!ring_info->ring_buffer)
+ return -ENOMEM;
+ /*
+ * Ensure the header page is zero'ed since
+ * encryption status may have changed.
+ */
+ memset(ring_info->ring_buffer, 0, HV_HYP_PAGE_SIZE);
ring_info->ring_buffer->read_index =
ring_info->ring_buffer->write_index = 0;
--
1.8.3.1
With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), creating a second kernel virtual
mapping for shared memory is no longer necessary. Everything needed
for the transition to shared is handled by set_memory_decrypted().
As such, remove the code to create and manage the second
mapping for the pre-allocated send and recv buffers. This mapping
is the last user of hv_map_memory()/hv_unmap_memory(), so delete
these functions as well. Finally, hv_map_memory() is the last
user of vmap_pfn() in Hyper-V guest code, so remove the Kconfig
selection of VMAP_PFN.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
arch/x86/hyperv/ivm.c | 28 ------------------------
drivers/hv/Kconfig | 1 -
drivers/hv/hv_common.c | 11 ----------
drivers/net/hyperv/hyperv_net.h | 2 --
drivers/net/hyperv/netvsc.c | 48 ++---------------------------------------
include/asm-generic/mshyperv.h | 2 --
6 files changed, 2 insertions(+), 90 deletions(-)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 29ccbe8..5e4b8b0 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -349,34 +349,6 @@ void __init hv_vtom_init(void)
#endif /* CONFIG_AMD_MEM_ENCRYPT */
-/*
- * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
- */
-void *hv_map_memory(void *addr, unsigned long size)
-{
- unsigned long *pfns = kcalloc(size / PAGE_SIZE,
- sizeof(unsigned long), GFP_KERNEL);
- void *vaddr;
- int i;
-
- if (!pfns)
- return NULL;
-
- for (i = 0; i < size / PAGE_SIZE; i++)
- pfns[i] = vmalloc_to_pfn(addr + i * PAGE_SIZE) +
- (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
-
- vaddr = vmap_pfn(pfns, size / PAGE_SIZE, pgprot_decrypted(PAGE_KERNEL_NOENC));
- kfree(pfns);
-
- return vaddr;
-}
-
-void hv_unmap_memory(void *addr)
-{
- vunmap(addr);
-}
-
enum hv_isolation_type hv_get_isolation_type(void)
{
if (!(ms_hyperv.priv_high & HV_ISOLATION))
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0747a8f..9a074cb 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -8,7 +8,6 @@ config HYPERV
|| (ARM64 && !CPU_BIG_ENDIAN))
select PARAVIRT
select X86_HV_CALLBACK_VECTOR if X86
- select VMAP_PFN
help
Select this option to run Linux as a Hyper-V client operating
system.
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index ae68298..566735f 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -308,14 +308,3 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_s
return HV_STATUS_INVALID_PARAMETER;
}
EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
-
-void __weak *hv_map_memory(void *addr, unsigned long size)
-{
- return NULL;
-}
-EXPORT_SYMBOL_GPL(hv_map_memory);
-
-void __weak hv_unmap_memory(void *addr)
-{
-}
-EXPORT_SYMBOL_GPL(hv_unmap_memory);
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index dd5919e..33d51e3 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1139,7 +1139,6 @@ struct netvsc_device {
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
- void *recv_original_buf;
u32 recv_buf_size; /* allocated bytes */
struct vmbus_gpadl recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1148,7 +1147,6 @@ struct netvsc_device {
/* Send buffer allocated by us */
void *send_buf;
- void *send_original_buf;
u32 send_buf_size;
struct vmbus_gpadl send_buf_gpadl_handle;
u32 send_section_cnt;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 9352dad..661bbe6 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -154,17 +154,8 @@ static void free_netvsc_device(struct rcu_head *head)
int i;
kfree(nvdev->extension);
-
- if (nvdev->recv_original_buf)
- vfree(nvdev->recv_original_buf);
- else
- vfree(nvdev->recv_buf);
-
- if (nvdev->send_original_buf)
- vfree(nvdev->send_original_buf);
- else
- vfree(nvdev->send_buf);
-
+ vfree(nvdev->recv_buf);
+ vfree(nvdev->send_buf);
bitmap_free(nvdev->send_section_map);
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -347,7 +338,6 @@ static int netvsc_init_buf(struct hv_device *device,
struct nvsp_message *init_packet;
unsigned int buf_size;
int i, ret = 0;
- void *vaddr;
/* Get receive buffer area. */
buf_size = device_info->recv_sections * device_info->recv_section_size;
@@ -383,17 +373,6 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
- if (hv_isolation_type_snp()) {
- vaddr = hv_map_memory(net_device->recv_buf, buf_size);
- if (!vaddr) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
- net_device->recv_original_buf = net_device->recv_buf;
- net_device->recv_buf = vaddr;
- }
-
/* Notify the NetVsp of the gpadl handle */
init_packet = &net_device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -497,17 +476,6 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
- if (hv_isolation_type_snp()) {
- vaddr = hv_map_memory(net_device->send_buf, buf_size);
- if (!vaddr) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
- net_device->send_original_buf = net_device->send_buf;
- net_device->send_buf = vaddr;
- }
-
/* Notify the NetVsp of the gpadl handle */
init_packet = &net_device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -762,12 +730,6 @@ void netvsc_device_remove(struct hv_device *device)
netvsc_teardown_send_gpadl(device, net_device, ndev);
}
- if (net_device->recv_original_buf)
- hv_unmap_memory(net_device->recv_buf);
-
- if (net_device->send_original_buf)
- hv_unmap_memory(net_device->send_buf);
-
/* Release all resources */
free_netvsc_device_rcu(net_device);
}
@@ -1831,12 +1793,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device,
netif_napi_del(&net_device->chan_table[0].napi);
cleanup2:
- if (net_device->recv_original_buf)
- hv_unmap_memory(net_device->recv_buf);
-
- if (net_device->send_original_buf)
- hv_unmap_memory(net_device->send_buf);
-
free_netvsc_device(&net_device->rcu);
return ERR_PTR(ret);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bfb9eb9..6fabc4a 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -267,8 +267,6 @@ static inline int cpumask_to_vpset_noself(struct hv_vpset *vpset,
void hyperv_cleanup(void);
bool hv_query_ext_cap(u64 cap_query);
void hv_setup_dma_ops(struct device *dev, bool coherent);
-void *hv_map_memory(void *addr, unsigned long size);
-void hv_unmap_memory(void *addr);
#else /* CONFIG_HYPERV */
static inline bool hv_is_hyperv_initialized(void) { return false; }
static inline bool hv_is_hibernation_supported(void) { return false; }
--
1.8.3.1
With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), creating a second kernel virtual
mapping for shared memory is no longer necessary. Everything needed
for the transition to shared is handled by set_memory_decrypted().
As such, remove swiotlb_unencrypted_base and the associated
code.
Signed-off-by: Michael Kelley <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
---
arch/x86/kernel/cpu/mshyperv.c | 7 +------
include/linux/swiotlb.h | 2 --
kernel/dma/swiotlb.c | 45 +-----------------------------------------
3 files changed, 2 insertions(+), 52 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index b080795..c87e411 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -18,7 +18,6 @@
#include <linux/kexec.h>
#include <linux/i8253.h>
#include <linux/random.h>
-#include <linux/swiotlb.h>
#include <asm/processor.h>
#include <asm/hypervisor.h>
#include <asm/hyperv-tlfs.h>
@@ -332,12 +331,8 @@ static void __init ms_hyperv_init_platform(void)
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
- if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
+ if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
static_branch_enable(&isolation_type_snp);
-#ifdef CONFIG_SWIOTLB
- swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
-#endif
- }
}
if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 35bc4e2..13d7075 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -185,6 +185,4 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
}
#endif /* CONFIG_DMA_RESTRICTED_POOL */
-extern phys_addr_t swiotlb_unencrypted_base;
-
#endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a34c38b..d3d6be0 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -73,8 +73,6 @@ struct io_tlb_slot {
struct io_tlb_mem io_tlb_default_mem;
-phys_addr_t swiotlb_unencrypted_base;
-
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
static unsigned long default_nareas;
@@ -210,34 +208,6 @@ static inline unsigned long nr_slots(u64 val)
}
/*
- * Remap swioltb memory in the unencrypted physical address space
- * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
- * Isolation VMs).
- */
-#ifdef CONFIG_HAS_IOMEM
-static void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
-{
- void *vaddr = NULL;
-
- if (swiotlb_unencrypted_base) {
- phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
-
- vaddr = memremap(paddr, bytes, MEMREMAP_WB);
- if (!vaddr)
- pr_err("Failed to map the unencrypted memory %pa size %lx.\n",
- &paddr, bytes);
- }
-
- return vaddr;
-}
-#else
-static void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
-{
- return NULL;
-}
-#endif
-
-/*
* Early SWIOTLB allocation may be too early to allow an architecture to
* perform the desired operations. This function allows the architecture to
* call SWIOTLB when the operations are possible. It needs to be called
@@ -246,18 +216,12 @@ static void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
void __init swiotlb_update_mem_attributes(void)
{
struct io_tlb_mem *mem = &io_tlb_default_mem;
- void *vaddr;
unsigned long bytes;
if (!mem->nslabs || mem->late_alloc)
return;
- vaddr = phys_to_virt(mem->start);
bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
- set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
-
- mem->vaddr = swiotlb_mem_remap(mem, bytes);
- if (!mem->vaddr)
- mem->vaddr = vaddr;
+ set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
}
static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
@@ -288,13 +252,6 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
mem->slots[i].alloc_size = 0;
}
- /*
- * If swiotlb_unencrypted_base is set, the bounce buffer memory will
- * be remapped and cleared in swiotlb_update_mem_attributes.
- */
- if (swiotlb_unencrypted_base)
- return;
-
memset(vaddr, 0, bytes);
mem->vaddr = vaddr;
return;
--
1.8.3.1
Hyper-V guests on AMD SEV-SNP hardware have the option of using the
"virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
architecture. With vTOM, shared vs. private memory accesses are
controlled by splitting the guest physical address space into two
halves. vTOM is the dividing line where the uppermost bit of the
physical address space is set; e.g., with 47 bits of guest physical
address space, vTOM is 0x40000000000 (bit 46 is set). Guest phyiscal
memory is accessible at two parallel physical addresses -- one below
vTOM and one above vTOM. Accesses below vTOM are private (encrypted)
while accesses above vTOM are shared (decrypted). In this sense, vTOM
is like the GPA.SHARED bit in Intel TDX.
Support for Hyper-V guests using vTOM was added to the Linux kernel in
two patch sets[1][2]. This support treats the vTOM bit as part of
the physical address. For accessing shared (decrypted) memory, these
patch sets create a second kernel virtual mapping that maps to physical
addresses above vTOM.
A better approach is to treat the vTOM bit as a protection flag, not
as part of the physical address. This new approach is like the approach
for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel
virtual mapping, the existing mapping is updated using recently added
coco mechanisms. When memory is changed between private and shared using
set_memory_decrypted() and set_memory_encrypted(), the PTEs for the
existing kernel mapping are changed to add or remove the vTOM bit
in the guest physical address, just as with TDX. The hypercalls to
change the memory status on the host side are made using the existing
callback mechanism. Everything just works, with a minor tweak to map
the I/O APIC to use private accesses.
To accomplish the switch in approach, the following must be done in
in this single patch:
* Update Hyper-V initialization to set the cc_mask based on vTOM
and do other coco initialization.
* Update physical_mask so the vTOM bit is no longer treated as part
of the physical address
* Update cc_mkenc() and cc_mkdec() to be active for Hyper-V guests.
This makes the vTOM bit part of the protection flags.
* Code already exists to make hypercalls to inform Hyper-V about pages
changing between shared and private. Update this code to run as a
callback from __set_memory_enc_pgtable().
* Remove the Hyper-V special case from __set_memory_enc_dec()
* Remove the Hyper-V specific call to swiotlb_update_mem_attributes()
since mem_encrypt_init() will now do it.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/coco/core.c | 11 +++++++++-
arch/x86/hyperv/hv_init.c | 11 ----------
arch/x86/hyperv/ivm.c | 45 +++++++++++++++++++++++++++++++----------
arch/x86/include/asm/mshyperv.h | 8 ++------
arch/x86/kernel/cpu/mshyperv.c | 15 +++++++-------
arch/x86/mm/pat/set_memory.c | 3 ---
6 files changed, 53 insertions(+), 40 deletions(-)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 49b44f8..f5e1f2d 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -78,7 +78,14 @@ static bool amd_cc_platform_has(enum cc_attr attr)
static bool hyperv_cc_platform_has(enum cc_attr attr)
{
- return attr == CC_ATTR_GUEST_MEM_ENCRYPT;
+ switch (attr) {
+ case CC_ATTR_GUEST_MEM_ENCRYPT:
+ case CC_ATTR_MEM_ENCRYPT:
+ case CC_ATTR_EMULATED_IOAPIC:
+ return true;
+ default:
+ return false;
+ }
}
bool cc_platform_has(enum cc_attr attr)
@@ -108,6 +115,7 @@ u64 cc_mkenc(u64 val)
switch (vendor) {
case CC_VENDOR_AMD:
return val | cc_mask;
+ case CC_VENDOR_HYPERV:
case CC_VENDOR_INTEL:
return val & ~cc_mask;
default:
@@ -121,6 +129,7 @@ u64 cc_mkdec(u64 val)
switch (vendor) {
case CC_VENDOR_AMD:
return val & ~cc_mask;
+ case CC_VENDOR_HYPERV:
case CC_VENDOR_INTEL:
return val | cc_mask;
default:
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index f49bc3e..89a97d7 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -29,7 +29,6 @@
#include <linux/syscore_ops.h>
#include <clocksource/hyperv_timer.h>
#include <linux/highmem.h>
-#include <linux/swiotlb.h>
int hyperv_init_cpuhp;
u64 hv_current_partition_id = ~0ull;
@@ -504,16 +503,6 @@ void __init hyperv_init(void)
/* Query the VMs extended capability once, so that it can be cached. */
hv_query_ext_cap(0);
-#ifdef CONFIG_SWIOTLB
- /*
- * Swiotlb bounce buffer needs to be mapped in extra address
- * space. Map function doesn't work in the early place and so
- * call swiotlb_update_mem_attributes() here.
- */
- if (hv_is_isolation_supported())
- swiotlb_update_mem_attributes();
-#endif
-
return;
clean_guest_os_id:
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index e8be4c2..29ccbe8 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -13,6 +13,7 @@
#include <asm/svm.h>
#include <asm/sev.h>
#include <asm/io.h>
+#include <asm/coco.h>
#include <asm/mshyperv.h>
#include <asm/hypervisor.h>
@@ -233,7 +234,6 @@ void hv_ghcb_msr_read(u64 msr, u64 *value)
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(hv_ghcb_msr_read);
-#endif
/*
* hv_mark_gpa_visibility - Set pages visible to host via hvcall.
@@ -286,27 +286,25 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
}
/*
- * hv_set_mem_host_visibility - Set specified memory visible to host.
+ * hv_vtom_set_host_visibility - Set specified memory visible to host.
*
* In Isolation VM, all guest memory is encrypted from host and guest
* needs to set memory visible to host via hvcall before sharing memory
* with host. This function works as wrap of hv_mark_gpa_visibility()
* with memory base and size.
*/
-int hv_set_mem_host_visibility(unsigned long kbuffer, int pagecount, bool visible)
+static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
{
- enum hv_mem_host_visibility visibility = visible ?
- VMBUS_PAGE_VISIBLE_READ_WRITE : VMBUS_PAGE_NOT_VISIBLE;
+ enum hv_mem_host_visibility visibility = enc ?
+ VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
u64 *pfn_array;
int ret = 0;
+ bool result = true;
int i, pfn;
- if (!hv_is_isolation_supported() || !hv_hypercall_pg)
- return 0;
-
pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
if (!pfn_array)
- return -ENOMEM;
+ return false;
for (i = 0, pfn = 0; i < pagecount; i++) {
pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
@@ -315,17 +313,42 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, int pagecount, bool visibl
if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
ret = hv_mark_gpa_visibility(pfn, pfn_array,
visibility);
- if (ret)
+ if (ret) {
+ result = false;
goto err_free_pfn_array;
+ }
pfn = 0;
}
}
err_free_pfn_array:
kfree(pfn_array);
- return ret;
+ return result;
+}
+
+static bool hv_vtom_tlb_flush_required(bool private)
+{
+ return true;
}
+static bool hv_vtom_cache_flush_required(void)
+{
+ return false;
+}
+
+void __init hv_vtom_init(void)
+{
+ cc_set_vendor(CC_VENDOR_HYPERV);
+ cc_set_mask(ms_hyperv.shared_gpa_boundary);
+ physical_mask &= ms_hyperv.shared_gpa_boundary - 1;
+
+ x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
+ x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
+ x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
+}
+
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+
/*
* hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
*/
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 61f0c20..59b3310 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -174,18 +174,19 @@ static inline void hv_apic_init(void) {}
int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
struct hv_interrupt_entry *entry);
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
-int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);
#ifdef CONFIG_AMD_MEM_ENCRYPT
void hv_ghcb_msr_write(u64 msr, u64 value);
void hv_ghcb_msr_read(u64 msr, u64 *value);
bool hv_ghcb_negotiate_protocol(void);
void hv_ghcb_terminate(unsigned int set, unsigned int reason);
+void hv_vtom_init(void);
#else
static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
static inline bool hv_ghcb_negotiate_protocol(void) { return false; }
static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {}
+static inline void hv_vtom_init(void) {}
#endif
extern bool hv_isolation_type_snp(void);
@@ -241,11 +242,6 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
}
static inline void hv_set_register(unsigned int reg, u64 value) { }
static inline u64 hv_get_register(unsigned int reg) { return 0; }
-static inline int hv_set_mem_host_visibility(unsigned long addr, int numpages,
- bool visible)
-{
- return -1;
-}
#endif /* CONFIG_HYPERV */
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 8316139..b080795 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -33,7 +33,6 @@
#include <asm/nmi.h>
#include <clocksource/hyperv_timer.h>
#include <asm/numa.h>
-#include <asm/coco.h>
/* Is Linux running as the root partition? */
bool hv_root_partition;
@@ -325,8 +324,10 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.priv_high & HV_ISOLATION) {
ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
- ms_hyperv.shared_gpa_boundary =
- BIT_ULL(ms_hyperv.shared_gpa_boundary_bits);
+
+ if (ms_hyperv.shared_gpa_boundary_active)
+ ms_hyperv.shared_gpa_boundary =
+ BIT_ULL(ms_hyperv.shared_gpa_boundary_bits);
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
@@ -337,11 +338,6 @@ static void __init ms_hyperv_init_platform(void)
swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
#endif
}
- /* Isolation VMs are unenlightened SEV-based VMs, thus this check: */
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- if (hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE)
- cc_set_vendor(CC_VENDOR_HYPERV);
- }
}
if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
@@ -410,6 +406,9 @@ static void __init ms_hyperv_init_platform(void)
i8253_clear_counter_on_shutdown = false;
#if IS_ENABLED(CONFIG_HYPERV)
+ if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
+ (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP))
+ hv_vtom_init();
/*
* Setup the hook to get control post apic initialization.
*/
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 06eb8910..0757cfe 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2126,9 +2126,6 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
- if (hv_is_isolation_supported())
- return hv_set_mem_host_visibility(addr, numpages, !enc);
-
if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
return __set_memory_enc_pgtable(addr, numpages, enc);
--
1.8.3.1
With the vTOM bit now treated as a protection flag and not part of
the physical address, avoid remapping physical addresses with vTOM set
since technically such addresses aren't valid. Use ioremap_cache()
instead of memremap() to ensure that the mapping provides decrypted
access, which will correctly set the vTOM bit as a protection flag.
While this change is not required for correctness with the current
implementation of memremap(), for general code hygiene it's better to
not depend on the mapping functions doing something reasonable with
a physical address that is out-of-range.
While here, fix typos in two error messages.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
arch/x86/hyperv/hv_init.c | 7 +++++--
drivers/hv/hv.c | 23 +++++++++++++----------
2 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 89a97d7..1346627 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -63,7 +63,10 @@ static int hyperv_init_ghcb(void)
* memory boundary and map it here.
*/
rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
- ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+
+ /* Mask out vTOM bit. ioremap_cache() maps decrypted */
+ ghcb_gpa &= ~ms_hyperv.shared_gpa_boundary;
+ ghcb_va = (void *)ioremap_cache(ghcb_gpa, HV_HYP_PAGE_SIZE);
if (!ghcb_va)
return -ENOMEM;
@@ -219,7 +222,7 @@ static int hv_cpu_die(unsigned int cpu)
if (hv_ghcb_pg) {
ghcb_va = (void **)this_cpu_ptr(hv_ghcb_pg);
if (*ghcb_va)
- memunmap(*ghcb_va);
+ iounmap(*ghcb_va);
*ghcb_va = NULL;
}
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 4d6480d..410e6c4 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -217,11 +217,13 @@ void hv_synic_enable_regs(unsigned int cpu)
simp.simp_enabled = 1;
if (hv_isolation_type_snp()) {
+ /* Mask out vTOM bit. ioremap_cache() maps decrypted */
+ u64 base = (simp.base_simp_gpa << HV_HYP_PAGE_SHIFT) &
+ ~ms_hyperv.shared_gpa_boundary;
hv_cpu->synic_message_page
- = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
- HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+ = (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
if (!hv_cpu->synic_message_page)
- pr_err("Fail to map syinc message page.\n");
+ pr_err("Fail to map synic message page.\n");
} else {
simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
>> HV_HYP_PAGE_SHIFT;
@@ -234,12 +236,13 @@ void hv_synic_enable_regs(unsigned int cpu)
siefp.siefp_enabled = 1;
if (hv_isolation_type_snp()) {
- hv_cpu->synic_event_page =
- memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
- HV_HYP_PAGE_SIZE, MEMREMAP_WB);
-
+ /* Mask out vTOM bit. ioremap_cache() maps decrypted */
+ u64 base = (siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT) &
+ ~ms_hyperv.shared_gpa_boundary;
+ hv_cpu->synic_event_page
+ = (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
if (!hv_cpu->synic_event_page)
- pr_err("Fail to map syinc event page.\n");
+ pr_err("Fail to map synic event page.\n");
} else {
siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
>> HV_HYP_PAGE_SHIFT;
@@ -316,7 +319,7 @@ void hv_synic_disable_regs(unsigned int cpu)
*/
simp.simp_enabled = 0;
if (hv_isolation_type_snp())
- memunmap(hv_cpu->synic_message_page);
+ iounmap(hv_cpu->synic_message_page);
else
simp.base_simp_gpa = 0;
@@ -326,7 +329,7 @@ void hv_synic_disable_regs(unsigned int cpu)
siefp.siefp_enabled = 0;
if (hv_isolation_type_snp())
- memunmap(hv_cpu->synic_event_page);
+ iounmap(hv_cpu->synic_event_page);
else
siefp.base_siefp_gpa = 0;
--
1.8.3.1
On 11/16/22 12:41, Michael Kelley wrote:
> Current code in sme_postprocess_startup() decrypts the bss_decrypted
> section when sme_me_mask is non-zero. But code in
> mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
> on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these
> conditions are not equivalent as sme_me_mask is always zero when
> using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts
> to re-encrypt memory that was never decrypted.
>
> Fix this in mem_encrypt_free_decrypted_mem() by conditioning the
> re-encryption on the same test for non-zero sme_me_mask. Hyper-V
> guests using vTOM don't need the bss_decrypted section to be
> decrypted, so skipping the decryption/re-encryption doesn't cause
> a problem.
>
> Signed-off-by: Michael Kelley <[email protected]>
> ---
> arch/x86/mm/mem_encrypt_amd.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
> index 9c4d8db..5a51343 100644
> --- a/arch/x86/mm/mem_encrypt_amd.c
> +++ b/arch/x86/mm/mem_encrypt_amd.c
> @@ -513,10 +513,14 @@ void __init mem_encrypt_free_decrypted_mem(void)
> npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
>
> /*
> - * The unused memory range was mapped decrypted, change the encryption
> - * attribute from decrypted to encrypted before freeing it.
> + * If the unused memory range was mapped decrypted, change the encryption
> + * attribute from decrypted to encrypted before freeing it. Base the
> + * re-encryption on the same condition used for the decryption in
> + * sme_postprocess_startup(). Higher level abstractions, such as
> + * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a Hyper-V VM
> + * using vTOM, where sme_me_mask is always zero.
> */
> - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> + if (sme_get_me_mask()) {
To be consistent within this file, you should use sme_me_mask directly.
Thanks,
Tom
> r = set_memory_encrypted(vaddr, npages);
> if (r) {
> pr_warn("failed to free unused decrypted pages\n");
On 11/16/22 12:41, Michael Kelley wrote:
> Full Hyper-V initialization, including support for hypercalls, is done
> as an apic_post_init callback via late_time_init(). mem_encrypt_init()
> needs to make hypercalls when it marks swiotlb memory as decrypted.
> But mem_encrypt_init() is currently called a few lines before
> late_time_init(), so the hypercalls don't work.
>
> Fix this by moving mem_encrypt_init() after late_time_init() and
> related clock initializations. The intervening initializations don't
> do any I/O that requires the swiotlb, so moving mem_encrypt_init()
> slightly later has no impact.
>
> Signed-off-by: Michael Kelley <[email protected]>
Some quick testing with mem_encrypt_init() in the new location hasn't
shown any problems under SME/SEV.
Reviewed-by: Tom Lendacky <[email protected]>
> ---
> init/main.c | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/init/main.c b/init/main.c
> index e1c3911..5a7c466 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1088,14 +1088,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> */
> locking_selftest();
>
> - /*
> - * This needs to be called before any devices perform DMA
> - * operations that might use the SWIOTLB bounce buffers. It will
> - * mark the bounce buffers as decrypted so that their usage will
> - * not cause "plain-text" data to be decrypted when accessed.
> - */
> - mem_encrypt_init();
> -
> #ifdef CONFIG_BLK_DEV_INITRD
> if (initrd_start && !initrd_below_start_ok &&
> page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
> @@ -1112,6 +1104,17 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> late_time_init();
> sched_clock_init();
> calibrate_delay();
> +
> + /*
> + * This needs to be called before any devices perform DMA
> + * operations that might use the SWIOTLB bounce buffers. It will
> + * mark the bounce buffers as decrypted so that their usage will
> + * not cause "plain-text" data to be decrypted when accessed. It
> + * must be called after late_time_init() so that Hyper-V x86/x64
> + * hypercalls work when the SWIOTLB bounce buffers are decrypted.
> + */
> + mem_encrypt_init();
> +
> pid_idr_init();
> anon_vma_init();
> #ifdef CONFIG_X86
On 11/16/22 14:35, Tom Lendacky wrote:
> On 11/16/22 12:41, Michael Kelley wrote:
>> Current code in sme_postprocess_startup() decrypts the bss_decrypted
>> section when sme_me_mask is non-zero. But code in
>> mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
>> on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these
>> conditions are not equivalent as sme_me_mask is always zero when
>> using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts
>> to re-encrypt memory that was never decrypted.
>>
>> Fix this in mem_encrypt_free_decrypted_mem() by conditioning the
>> re-encryption on the same test for non-zero sme_me_mask. Hyper-V
>> guests using vTOM don't need the bss_decrypted section to be
>> decrypted, so skipping the decryption/re-encryption doesn't cause
>> a problem.
>>
>> Signed-off-by: Michael Kelley <[email protected]>
Meant to add this in the previous reply...
With the change to use sme_me_mask directly
Reviewed-by: Tom Lendacky <[email protected]>
>> ---
>> Â arch/x86/mm/mem_encrypt_amd.c | 10 +++++++---
>> Â 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
>> index 9c4d8db..5a51343 100644
>> --- a/arch/x86/mm/mem_encrypt_amd.c
>> +++ b/arch/x86/mm/mem_encrypt_amd.c
>> @@ -513,10 +513,14 @@ void __init mem_encrypt_free_decrypted_mem(void)
>> Â Â Â Â Â npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
>> Â Â Â Â Â /*
>> -Â Â Â Â * The unused memory range was mapped decrypted, change the encryption
>> -Â Â Â Â * attribute from decrypted to encrypted before freeing it.
>> +Â Â Â Â * If the unused memory range was mapped decrypted, change the
>> encryption
>> +Â Â Â Â * attribute from decrypted to encrypted before freeing it. Base the
>> +Â Â Â Â * re-encryption on the same condition used for the decryption in
>> +Â Â Â Â * sme_postprocess_startup(). Higher level abstractions, such as
>> +Â Â Â Â * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a Hyper-V VM
>> +Â Â Â Â * using vTOM, where sme_me_mask is always zero.
>> Â Â Â Â Â Â */
>> -Â Â Â if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> +Â Â Â if (sme_get_me_mask()) {
>
> To be consistent within this file, you should use sme_me_mask directly.
>
> Thanks,
> Tom
>
>> Â Â Â Â Â Â Â Â Â r = set_memory_encrypted(vaddr, npages);
>> Â Â Â Â Â Â Â Â Â if (r) {
>> Â Â Â Â Â Â Â Â Â Â Â Â Â pr_warn("failed to free unused decrypted pages\n");
On 11/17/2022 2:41 AM, Michael Kelley wrote:
> Hyper-V guests on AMD SEV-SNP hardware have the option of using the
> "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
> architecture. With vTOM, shared vs. private memory accesses are
> controlled by splitting the guest physical address space into two
> halves. vTOM is the dividing line where the uppermost bit of the
> physical address space is set; e.g., with 47 bits of guest physical
> address space, vTOM is 0x40000000000 (bit 46 is set). Guest phyiscal
> memory is accessible at two parallel physical addresses -- one below
> vTOM and one above vTOM. Accesses below vTOM are private (encrypted)
> while accesses above vTOM are shared (decrypted). In this sense, vTOM
> is like the GPA.SHARED bit in Intel TDX.
>
> Support for Hyper-V guests using vTOM was added to the Linux kernel in
> two patch sets[1][2]. This support treats the vTOM bit as part of
> the physical address. For accessing shared (decrypted) memory, these
> patch sets create a second kernel virtual mapping that maps to physical
> addresses above vTOM.
>
> A better approach is to treat the vTOM bit as a protection flag, not
> as part of the physical address. This new approach is like the approach
> for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel
> virtual mapping, the existing mapping is updated using recently added
> coco mechanisms. When memory is changed between private and shared using
> set_memory_decrypted() and set_memory_encrypted(), the PTEs for the
> existing kernel mapping are changed to add or remove the vTOM bit
> in the guest physical address, just as with TDX. The hypercalls to
> change the memory status on the host side are made using the existing
> callback mechanism. Everything just works, with a minor tweak to map
> the I/O APIC to use private accesses.
>
> To accomplish the switch in approach, the following must be done in
> in this single patch:
>
> * Update Hyper-V initialization to set the cc_mask based on vTOM
> and do other coco initialization.
>
> * Update physical_mask so the vTOM bit is no longer treated as part
> of the physical address
>
> * Update cc_mkenc() and cc_mkdec() to be active for Hyper-V guests.
> This makes the vTOM bit part of the protection flags.
>
> * Code already exists to make hypercalls to inform Hyper-V about pages
> changing between shared and private. Update this code to run as a
> callback from __set_memory_enc_pgtable().
>
> * Remove the Hyper-V special case from __set_memory_enc_dec()
>
> * Remove the Hyper-V specific call to swiotlb_update_mem_attributes()
> since mem_encrypt_init() will now do it.
>
> [1]https://lore.kernel.org/all/[email protected]/
> [2]https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Michael Kelley<[email protected]>
> ---
> arch/x86/coco/core.c | 11 +++++++++-
> arch/x86/hyperv/hv_init.c | 11 ----------
> arch/x86/hyperv/ivm.c | 45 +++++++++++++++++++++++++++++++----------
> arch/x86/include/asm/mshyperv.h | 8 ++------
> arch/x86/kernel/cpu/mshyperv.c | 15 +++++++-------
> arch/x86/mm/pat/set_memory.c | 3 ---
> 6 files changed, 53 insertions(+), 40 deletions(-)
Reviewed-by: Tianyu Lan <[email protected]>
On Wed, Nov 16, 2022 at 10:41:36AM -0800, Michael Kelley wrote:
[...]
>
> +static void hv_pci_read_mmio(struct device *dev, phys_addr_t gpa, int size, u32 *val)
> +{
> + struct hv_mmio_read_input *in;
> + struct hv_mmio_read_output *out;
> + u64 ret;
> +
> + /*
> + * Must be called with interrupts disabled so it is safe
> + * to use the per-cpu input argument page. Use it for
> + * both input and output.
> + */
Perhaps adding something along this line?
WARN_ON(!irqs_disabled());
I can fold this in if you agree.
> + in = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + out = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*in);
> + in->gpa = gpa;
> + in->size = size;
> +
> + ret = hv_do_hypercall(HVCALL_MMIO_READ, in, out);
> + if (hv_result_success(ret)) {
> + switch (size) {
> + case 1:
> + *val = *(u8 *)(out->data);
> + break;
> + case 2:
> + *val = *(u16 *)(out->data);
> + break;
> + default:
> + *val = *(u32 *)(out->data);
> + break;
> + }
> + } else
> + dev_err(dev, "MMIO read hypercall error %llx addr %llx size %d\n",
> + ret, gpa, size);
> +}
> +
> +static void hv_pci_write_mmio(struct device *dev, phys_addr_t gpa, int size, u32 val)
> +{
> + struct hv_mmio_write_input *in;
> + u64 ret;
> +
> + /*
> + * Must be called with interrupts disabled so it is safe
> + * to use the per-cpu input argument memory.
> + */
Ditto.
Thanks,
Wei.
From: Wei Liu <[email protected]> Sent: Thursday, November 17, 2022 7:17 AM
>
> On Wed, Nov 16, 2022 at 10:41:36AM -0800, Michael Kelley wrote:
> [...]
> >
> > +static void hv_pci_read_mmio(struct device *dev, phys_addr_t gpa, int size, u32
> *val)
> > +{
> > + struct hv_mmio_read_input *in;
> > + struct hv_mmio_read_output *out;
> > + u64 ret;
> > +
> > + /*
> > + * Must be called with interrupts disabled so it is safe
> > + * to use the per-cpu input argument page. Use it for
> > + * both input and output.
> > + */
>
> Perhaps adding something along this line?
>
> WARN_ON(!irqs_disabled());
>
> I can fold this in if you agree.
These two new functions are only called within this module from code
that already has interrupts disabled (as added in Patch 14 of the series),
so I didn't do the extra check. But I'm OK with adding it. These functions
make a hypercall, so the additional check doesn't have enough perf
impact to matter.
Michael
>
> > + in = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + out = *this_cpu_ptr(hyperv_pcpu_input_arg) + sizeof(*in);
> > + in->gpa = gpa;
> > + in->size = size;
> > +
> > + ret = hv_do_hypercall(HVCALL_MMIO_READ, in, out);
> > + if (hv_result_success(ret)) {
> > + switch (size) {
> > + case 1:
> > + *val = *(u8 *)(out->data);
> > + break;
> > + case 2:
> > + *val = *(u16 *)(out->data);
> > + break;
> > + default:
> > + *val = *(u32 *)(out->data);
> > + break;
> > + }
> > + } else
> > + dev_err(dev, "MMIO read hypercall error %llx addr %llx size %d\n",
> > + ret, gpa, size);
> > +}
> > +
> > +static void hv_pci_write_mmio(struct device *dev, phys_addr_t gpa, int size, u32
> val)
> > +{
> > + struct hv_mmio_write_input *in;
> > + u64 ret;
> > +
> > + /*
> > + * Must be called with interrupts disabled so it is safe
> > + * to use the per-cpu input argument memory.
> > + */
>
> Ditto.
>
> Thanks,
> Wei.
On Thu, Nov 17, 2022, Wei Liu wrote:
> On Wed, Nov 16, 2022 at 10:41:36AM -0800, Michael Kelley wrote:
> [...]
> >
> > +static void hv_pci_read_mmio(struct device *dev, phys_addr_t gpa, int size, u32 *val)
> > +{
> > + struct hv_mmio_read_input *in;
> > + struct hv_mmio_read_output *out;
> > + u64 ret;
> > +
> > + /*
> > + * Must be called with interrupts disabled so it is safe
> > + * to use the per-cpu input argument page. Use it for
> > + * both input and output.
> > + */
There's no need to require interrupts to be disabled to safely use a per-cpu
variable, simply disabling preemption also provides the necessary protection.
And this_cpu_ptr() will complain with CONFIG_DEBUG_PREEMPT=y if preemption isn't
disabled.
IIUC, based on the existing code, what is really be guarded against is an IRQ arriving
and initiating a different hypercall from IRQ context, and thus corrupting the page
from this function's perspective.
> Perhaps adding something along this line?
>
> WARN_ON(!irqs_disabled());
Given that every use of hyperv_pcpu_input_arg except hv_common_cpu_init() disables
IRQs, what about adding a helper to retrieve the pointer and assert that IRQs are
disabled? I.e. add the sanity for all usage, not just this one-off case.
And since CPUHP_AP_ONLINE_DYN => hv_common_cpu_init() runs after scheduling is
activated by CPUHP_AP_SCHED_WAIT_EMPTY, I believe that hv_common_cpu_init() is
theoretically broken. Maybe someone can look at that when fixing he KVM vs.
Hyper-V issue?
https://lore.kernel.org/linux-hyperv/[email protected]
https://lore.kernel.org/all/[email protected]
On Thu, Nov 17, 2022 at 04:14:44PM +0000, Michael Kelley (LINUX) wrote:
> From: Wei Liu <[email protected]> Sent: Thursday, November 17, 2022 7:17 AM
> >
> > On Wed, Nov 16, 2022 at 10:41:36AM -0800, Michael Kelley wrote:
> > [...]
> > >
> > > +static void hv_pci_read_mmio(struct device *dev, phys_addr_t gpa, int size, u32
> > *val)
> > > +{
> > > + struct hv_mmio_read_input *in;
> > > + struct hv_mmio_read_output *out;
> > > + u64 ret;
> > > +
> > > + /*
> > > + * Must be called with interrupts disabled so it is safe
> > > + * to use the per-cpu input argument page. Use it for
> > > + * both input and output.
> > > + */
> >
> > Perhaps adding something along this line?
> >
> > WARN_ON(!irqs_disabled());
> >
> > I can fold this in if you agree.
>
> These two new functions are only called within this module from code
> that already has interrupts disabled (as added in Patch 14 of the series),
> so I didn't do the extra check. But I'm OK with adding it. These functions
> make a hypercall, so the additional check doesn't have enough perf
> impact to matter.
Okay, not adding them is fine too.
Thanks,
Wei.
On Thu, Nov 17, 2022 at 04:32:00PM +0000, Sean Christopherson wrote:
> On Thu, Nov 17, 2022, Wei Liu wrote:
> > On Wed, Nov 16, 2022 at 10:41:36AM -0800, Michael Kelley wrote:
> > [...]
> > >
> > > +static void hv_pci_read_mmio(struct device *dev, phys_addr_t gpa, int size, u32 *val)
> > > +{
> > > + struct hv_mmio_read_input *in;
> > > + struct hv_mmio_read_output *out;
> > > + u64 ret;
> > > +
> > > + /*
> > > + * Must be called with interrupts disabled so it is safe
> > > + * to use the per-cpu input argument page. Use it for
> > > + * both input and output.
> > > + */
>
> There's no need to require interrupts to be disabled to safely use a per-cpu
> variable, simply disabling preemption also provides the necessary protection.
> And this_cpu_ptr() will complain with CONFIG_DEBUG_PREEMPT=y if preemption isn't
> disabled.
>
> IIUC, based on the existing code, what is really be guarded against is an IRQ arriving
> and initiating a different hypercall from IRQ context, and thus corrupting the page
> from this function's perspective.
Exactly. Michael's comment did not say this explicitly but that's what's
being guarded.
>
> > Perhaps adding something along this line?
> >
> > WARN_ON(!irqs_disabled());
>
> Given that every use of hyperv_pcpu_input_arg except hv_common_cpu_init() disables
> IRQs, what about adding a helper to retrieve the pointer and assert that IRQs are
> disabled? I.e. add the sanity for all usage, not just this one-off case.
>
We can potentially introduce a pair of get/put functions for these pages,
but let's not feature-creep here...
> And since CPUHP_AP_ONLINE_DYN => hv_common_cpu_init() runs after scheduling is
> activated by CPUHP_AP_SCHED_WAIT_EMPTY, I believe that hv_common_cpu_init() is
> theoretically broken. Maybe someone can look at that when fixing he KVM vs.
> Hyper-V issue?
>
> https://lore.kernel.org/linux-hyperv/[email protected]
> https://lore.kernel.org/all/[email protected]
I read the mails before have not looked into those since they are only
theoretical per those threads. Sorry.
The only scenario I can think of for CPU hotplug right now is the
(upcoming) Linux root kernel, I guess we will cross the bridge when we
get there.
Thanks,
Wei.
> -----Original Message-----
> From: Michael Kelley (LINUX) <[email protected]>
> Sent: Wednesday, November 16, 2022 1:42 PM
> To: [email protected]; KY Srinivasan <[email protected]>; Haiyang Zhang
> <[email protected]>; [email protected]; Dexuan Cui
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Tianyu Lan
> <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Cc: Michael Kelley (LINUX) <[email protected]>
> Subject: [Patch v3 13/14] PCI: hv: Add hypercalls to read/write MMIO space
>
> To support PCI pass-thru devices in Confidential VMs, Hyper-V
> has added hypercalls to read and write MMIO space. Add the
> appropriate definitions to hyperv-tlfs.h and implement
> functions to make the hypercalls. These functions are used
> in a subsequent patch.
>
> Co-developed-by: Dexuan Cui <[email protected]>
> Signed-off-by: Dexuan Cui <[email protected]>
> Signed-off-by: Michael Kelley <[email protected]>
> ---
Reviewed-by: Haiyang Zhang <[email protected]>
One question - Will you put the document in patch 0 directly into some place of
the src tree?
On 11/16/22 10:41 AM, Michael Kelley wrote:
> Current code in sme_postprocess_startup() decrypts the bss_decrypted
> section when sme_me_mask is non-zero. But code in
> mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
> on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these
> conditions are not equivalent as sme_me_mask is always zero when
> using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts
> to re-encrypt memory that was never decrypted.
>
> Fix this in mem_encrypt_free_decrypted_mem() by conditioning the
> re-encryption on the same test for non-zero sme_me_mask. Hyper-V
> guests using vTOM don't need the bss_decrypted section to be
> decrypted, so skipping the decryption/re-encryption doesn't cause
> a problem.
>
Do you think it needs Fixes tag?
> Signed-off-by: Michael Kelley <[email protected]>
> ---
> arch/x86/mm/mem_encrypt_amd.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
> index 9c4d8db..5a51343 100644
> --- a/arch/x86/mm/mem_encrypt_amd.c
> +++ b/arch/x86/mm/mem_encrypt_amd.c
> @@ -513,10 +513,14 @@ void __init mem_encrypt_free_decrypted_mem(void)
> npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
>
> /*
> - * The unused memory range was mapped decrypted, change the encryption
> - * attribute from decrypted to encrypted before freeing it.
> + * If the unused memory range was mapped decrypted, change the encryption
> + * attribute from decrypted to encrypted before freeing it. Base the
> + * re-encryption on the same condition used for the decryption in
> + * sme_postprocess_startup(). Higher level abstractions, such as
> + * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a Hyper-V VM
> + * using vTOM, where sme_me_mask is always zero.
> */
> - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> + if (sme_get_me_mask()) {
> r = set_memory_encrypted(vaddr, npages);
> if (r) {
> pr_warn("failed to free unused decrypted pages\n");
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On 11/16/22 10:41 AM, Michael Kelley wrote:
> Current code always maps the IOAPIC as shared (decrypted) in a
> confidential VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM
> enabled use a paravisor running in VMPL0 to emulate the IOAPIC.
> In such a case, the IOAPIC must be accessed as private (encrypted).
>
> Fix this by gating the IOAPIC decrypted mapping on a new
> cc_platform_has() attribute that a subsequent patch in the series
> will set only for Hyper-V guests.
>
> Signed-off-by: Michael Kelley <[email protected]>
> Reviewed-by: Wei Liu <[email protected]>
> ---
Looks fine to me.
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
> arch/x86/kernel/apic/io_apic.c | 3 ++-
> include/linux/cc_platform.h | 12 ++++++++++++
> 2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index a868b76..c65e0cc 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2686,7 +2686,8 @@ static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys)
> * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> * bits, just like normal ioremap():
> */
> - flags = pgprot_decrypted(flags);
> + if (!cc_platform_has(CC_ATTR_EMULATED_IOAPIC))
> + flags = pgprot_decrypted(flags);
>
> __set_fixmap(idx, phys, flags);
> }
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index cb0d6cd..7a0da75 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -90,6 +90,18 @@ enum cc_attr {
> * Examples include TDX Guest.
> */
> CC_ATTR_HOTPLUG_DISABLED,
> +
> + /**
> + * @CC_ATTR_EMULATED_IOAPIC: Guest VM has an emulated I/O APIC
> + *
> + * The platform/OS is running as a guest/virtual machine with
> + * an I/O APIC that is emulated by a paravisor running in the
> + * guest VM context. As such, the I/O APIC is accessed in the
> + * encrypted portion of the guest physical address space.
> + *
> + * Examples include Hyper-V SEV-SNP guests using vTOM.
> + */
> + CC_ATTR_EMULATED_IOAPIC,
> };
>
> #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
From: Haiyang Zhang <[email protected]> Sent: Thursday, November 17, 2022 10:33 AM
> >
> > From: Michael Kelley (LINUX) <[email protected]> Sent: Wednesday, November 16, 2022 1:42 PM
> >
> > To support PCI pass-thru devices in Confidential VMs, Hyper-V
> > has added hypercalls to read and write MMIO space. Add the
> > appropriate definitions to hyperv-tlfs.h and implement
> > functions to make the hypercalls. These functions are used
> > in a subsequent patch.
> >
> > Co-developed-by: Dexuan Cui <[email protected]>
> > Signed-off-by: Dexuan Cui <[email protected]>
> > Signed-off-by: Michael Kelley <[email protected]>
> > ---
>
> Reviewed-by: Haiyang Zhang <[email protected]>
>
> One question - Will you put the document in patch 0 directly into some place of
> the src tree?
>
Not directly. Patch 0 is supposed to be a summary of a patch set, and it doesn't
have a place in the 'git' repo, though it will live on in the email archives. But the
details are captured in the individual patch commit messages, particularly in
Patch 7 for this series.
Separately, I want to add Confidential VMs as a topic in the Hyper-V documentation
under Documentation/virt/hyperv. I'll do that once our work related to confidential
computing is relatively stable.
Michael
From: Sathyanarayanan Kuppuswamy <[email protected]>
>
> On 11/16/22 10:41 AM, Michael Kelley wrote:
> > Current code in sme_postprocess_startup() decrypts the bss_decrypted
> > section when sme_me_mask is non-zero. But code in
> > mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
> > on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these
> > conditions are not equivalent as sme_me_mask is always zero when
> > using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts
> > to re-encrypt memory that was never decrypted.
> >
> > Fix this in mem_encrypt_free_decrypted_mem() by conditioning the
> > re-encryption on the same test for non-zero sme_me_mask. Hyper-V
> > guests using vTOM don't need the bss_decrypted section to be
> > decrypted, so skipping the decryption/re-encryption doesn't cause
> > a problem.
> >
>
> Do you think it needs Fixes tag?
>
At least for my purposes, it doesn't. The original assumption that non-zero
sme_me_mask and CC_ATTR_MEM_ENCRYPT are equivalent was valid until
this patch series where Hyper-V guests are reporting CC_ATTR_MEM_ENCRYPT
as "true" but sme_me_mask is zero. This patch series won't be backported,
so the old assumption remains valid for older kernels. There's no benefit in
backporting the change.
But I had not thought about TDX. In the TDX case, it appears that
sme_postprocess_startup() will not decrypt the bss_decrypted section.
The corresponding mem_encrypt_free_decrypted_mem() is a no-op unless
CONFIG_AMD_MEM_ENCRYPT is set. But maybe if someone builds a
kernel image that supports both TDX and AMD encryption, it could break
at runtime on a TDX system. I would also note that on a TDX system
without CONFIG_AMD_MEM_ENCRYPT, the unused memory in the
bss_decrypted section never gets freed.
But check my logic. :-) I'm not averse to adding the Fixes: tag if there's a
scenario for TDX where doing the backport will solve a real problem.
And thanks for reviewing the code!
Michael
From: Tom Lendacky <[email protected]> Sent: Wednesday, November 16, 2022 1:16 PM
>
> On 11/16/22 14:35, Tom Lendacky wrote:
> > On 11/16/22 12:41, Michael Kelley wrote:
> >> Current code in sme_postprocess_startup() decrypts the bss_decrypted
> >> section when sme_me_mask is non-zero. But code in
> >> mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
> >> on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these
> >> conditions are not equivalent as sme_me_mask is always zero when
> >> using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts
> >> to re-encrypt memory that was never decrypted.
> >>
> >> Fix this in mem_encrypt_free_decrypted_mem() by conditioning the
> >> re-encryption on the same test for non-zero sme_me_mask. Hyper-V
> >> guests using vTOM don't need the bss_decrypted section to be
> >> decrypted, so skipping the decryption/re-encryption doesn't cause
> >> a problem.
> >>
> >> Signed-off-by: Michael Kelley <[email protected]>
>
> Meant to add this in the previous reply...
>
> With the change to use sme_me_mask directly
>
> Reviewed-by: Tom Lendacky <[email protected]>
>
Thanks for the reviews. And I see your point about sme_me_mask. I had
not previously noticed that it is defined in the module, so no need to use
a getter function.
Michael
On Wed, Nov 16, 2022 at 10:41:25AM -0800, Michael Kelley wrote:
> Current code always maps the IOAPIC as shared (decrypted) in a
> confidential VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM
> enabled use a paravisor running in VMPL0 to emulate the IOAPIC.
"IO-APIC" I guess, in all your text.
> In such a case, the IOAPIC must be accessed as private (encrypted).
So the condition for the IO-APIC is pretty specific but the naming
CC_ATTR_EMULATED_IOAPIC too generic. Other HVs emulate IO-APICs too,
right?
If you have to be precise, the proper check should be (pseudo code):
if (cc_vendor(HYPERV) &&
SNP enabled &&
SNP features has vTOM &&
paravisor in use)
so I guess you're probably better off calling it
CC_ATTR_ACCESS_IOAPIC_ENCRYPTED
which then gets set on exactly those guests and nothing else.
I'd say.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Nov 16, 2022 at 10:41:24AM -0800, Michael Kelley wrote:
> Current code re-calculates the size after aligning the starting and
> ending physical addresses on a page boundary. But the re-calculation
> also embeds the masking of high order bits that exceed the size of
> the physical address space (via PHYSICAL_PAGE_MASK). If the masking
> removes any high order bits, the size calculation results in a huge
> value that is likely to immediately fail.
>
> Fix this by re-calculating the page-aligned size first. Then mask any
> high order bits using PHYSICAL_PAGE_MASK.
>
> Signed-off-by: Michael Kelley <[email protected]>
> ---
> arch/x86/mm/ioremap.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 78c5bc6..6453fba 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -217,9 +217,15 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,
> * Mappings have to be page-aligned
> */
> offset = phys_addr & ~PAGE_MASK;
> - phys_addr &= PHYSICAL_PAGE_MASK;
> + phys_addr &= PAGE_MASK;
> size = PAGE_ALIGN(last_addr+1) - phys_addr;
>
> + /*
> + * Mask out any bits not part of the actual physical
> + * address, like memory encryption bits.
> + */
> + phys_addr &= PHYSICAL_PAGE_MASK;
> +
> retval = memtype_reserve(phys_addr, (u64)phys_addr + size,
> pcm, &new_pcm);
> if (retval) {
> --
This looks like a fix to me that needs to go to independently to stable.
And it would need a Fixes tag.
/me does some git archeology...
I guess this one:
ffa71f33a820 ("x86, ioremap: Fix incorrect physical address handling in PAE mode")
should be old enough so that it goes to all relevant stable kernels...
Hmm?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Nov 16, 2022 at 10:41:29AM -0800, Michael Kelley wrote:
> Fix this by moving mem_encrypt_init() after late_time_init() and
> related clock initializations. The intervening initializations don't
> do any I/O that requires the swiotlb, so moving mem_encrypt_init()
> slightly later has no impact.
I hope you're right. Our boot ordering is fragile as hell. But
mem_encrypt_init() doesn't do a whole lot of important setup - that has
happened a lot earlier already - so I'm not too worried.
But we'll see what breaks in wider testing.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Nov 16, 2022 at 10:41:30AM -0800, Michael Kelley wrote:
> Hyper-V guests on AMD SEV-SNP hardware have the option of using the
> "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
> architecture. With vTOM, shared vs. private memory accesses are
> controlled by splitting the guest physical address space into two
> halves. vTOM is the dividing line where the uppermost bit of the
> physical address space is set; e.g., with 47 bits of guest physical
> address space, vTOM is 0x40000000000 (bit 46 is set). Guest phyiscal
Unknown word [phyiscal] in commit message.
Suggestions: ['physical', 'physically', ... ]
Please introduce a spellchecker into your patch creation workflow.
...
> @@ -108,6 +115,7 @@ u64 cc_mkenc(u64 val)
> switch (vendor) {
> case CC_VENDOR_AMD:
> return val | cc_mask;
> + case CC_VENDOR_HYPERV:
> case CC_VENDOR_INTEL:
> return val & ~cc_mask;
> default:
> @@ -121,6 +129,7 @@ u64 cc_mkdec(u64 val)
> switch (vendor) {
> case CC_VENDOR_AMD:
> return val & ~cc_mask;
> + case CC_VENDOR_HYPERV:
> case CC_VENDOR_INTEL:
> return val | cc_mask;
> default:
Uuuh, this needs a BIG FAT COMMENT.
You're running on SNP and yet the enc/dec meaning is flipped. And that's
because of vTOM.
What happens if you have other types of SNP-based VMs on HyperV? The
isolation VMs thing? Or is that the same?
What happens when you do TDX guests with HyperV?
This becomes wrong then.
I think you need a more finer-grained check here in the sense of "is it
a HyperV guest using a paravisor and vTOM is enabled" or so.
Otherwise, I like the removal of the HyperV-specific checks ofc.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Nov 16, 2022 at 10:41:28AM -0800, Michael Kelley wrote:
> Current code in sme_postprocess_startup() decrypts the bss_decrypted
> section when sme_me_mask is non-zero. But code in
> mem_encrypt_free_decrytped_mem() re-encrypts the unused portion based
^^
letters flipped.
> @@ -513,10 +513,14 @@ void __init mem_encrypt_free_decrypted_mem(void)
> npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
>
> /*
> - * The unused memory range was mapped decrypted, change the encryption
> - * attribute from decrypted to encrypted before freeing it.
> + * If the unused memory range was mapped decrypted, change the encryption
> + * attribute from decrypted to encrypted before freeing it. Base the
> + * re-encryption on the same condition used for the decryption in
> + * sme_postprocess_startup(). Higher level abstractions, such as
> + * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a Hyper-V VM
> + * using vTOM, where sme_me_mask is always zero.
Good, an example why one needs to pay attention here.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Nov 18, 2022 at 02:55:32AM +0000, Michael Kelley (LINUX) wrote:
> But I had not thought about TDX. In the TDX case, it appears that
> sme_postprocess_startup() will not decrypt the bss_decrypted section.
> The corresponding mem_encrypt_free_decrypted_mem() is a no-op unless
> CONFIG_AMD_MEM_ENCRYPT is set. But maybe if someone builds a
> kernel image that supports both TDX and AMD encryption, it could break
sme_me_mask better be 0 on a kernel with both built in and running as a
TDX guest.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Monday, November 21, 2022 5:51 AM
>
> On Wed, Nov 16, 2022 at 10:41:25AM -0800, Michael Kelley wrote:
> > Current code always maps the IOAPIC as shared (decrypted) in a
> > confidential VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM
> > enabled use a paravisor running in VMPL0 to emulate the IOAPIC.
>
> "IO-APIC" I guess, in all your text.
>
> > In such a case, the IOAPIC must be accessed as private (encrypted).
>
> So the condition for the IO-APIC is pretty specific but the naming
> CC_ATTR_EMULATED_IOAPIC too generic. Other HVs emulate IO-APICs too,
> right?
>
> If you have to be precise, the proper check should be (pseudo code):
>
> if (cc_vendor(HYPERV) &&
> SNP enabled &&
> SNP features has vTOM &&
> paravisor in use)
>
> so I guess you're probably better off calling it
>
> CC_ATTR_ACCESS_IOAPIC_ENCRYPTED
>
> which then gets set on exactly those guests and nothing else.
>
> I'd say.
>
I'm OK with naming it very narrowly. When/if there's a more general
case later, we can generalize to whatever degree is appropriate.
Michael
From: Borislav Petkov <[email protected]> Sent: Monday, November 21, 2022 5:33 AM
>
> On Wed, Nov 16, 2022 at 10:41:24AM -0800, Michael Kelley wrote:
> > Current code re-calculates the size after aligning the starting and
> > ending physical addresses on a page boundary. But the re-calculation
> > also embeds the masking of high order bits that exceed the size of
> > the physical address space (via PHYSICAL_PAGE_MASK). If the masking
> > removes any high order bits, the size calculation results in a huge
> > value that is likely to immediately fail.
> >
> > Fix this by re-calculating the page-aligned size first. Then mask any
> > high order bits using PHYSICAL_PAGE_MASK.
> >
> > Signed-off-by: Michael Kelley <[email protected]>
> > ---
> > arch/x86/mm/ioremap.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> > index 78c5bc6..6453fba 100644
> > --- a/arch/x86/mm/ioremap.c
> > +++ b/arch/x86/mm/ioremap.c
> > @@ -217,9 +217,15 @@ static void __ioremap_check_mem(resource_size_t addr,
> unsigned long size,
> > * Mappings have to be page-aligned
> > */
> > offset = phys_addr & ~PAGE_MASK;
> > - phys_addr &= PHYSICAL_PAGE_MASK;
> > + phys_addr &= PAGE_MASK;
> > size = PAGE_ALIGN(last_addr+1) - phys_addr;
> >
> > + /*
> > + * Mask out any bits not part of the actual physical
> > + * address, like memory encryption bits.
> > + */
> > + phys_addr &= PHYSICAL_PAGE_MASK;
> > +
> > retval = memtype_reserve(phys_addr, (u64)phys_addr + size,
> > pcm, &new_pcm);
> > if (retval) {
> > --
>
> This looks like a fix to me that needs to go to independently to stable.
> And it would need a Fixes tag.
>
> /me does some git archeology...
>
> I guess this one:
>
> ffa71f33a820 ("x86, ioremap: Fix incorrect physical address handling in PAE mode")
>
> should be old enough so that it goes to all relevant stable kernels...
>
> Hmm?
>
As discussed in a parallel thread [1], the incorrect code here doesn't have
any real impact in already released Linux kernels. It only affects the
transition that my patch series implements to change the way vTOM
is handled.
I don't know what the tradeoffs are for backporting a fix that doesn't solve
a real problem vs. just letting it be. Every backport carries some overhead
in the process and there's always a non-zero risk of breaking something.
I've leaned away from adding the "Fixes:" tag in such cases. But if it's better
to go ahead and add the "Fixes:" tag for what's only a theoretical problem,
I'm OK with doing so.
Michael
[1] https://lkml.org/lkml/2022/11/11/1348
On 11/16/22 10:41, Michael Kelley wrote:
> Current code re-calculates the size after aligning the starting and
> ending physical addresses on a page boundary. But the re-calculation
> also embeds the masking of high order bits that exceed the size of
> the physical address space (via PHYSICAL_PAGE_MASK). If the masking
> removes any high order bits, the size calculation results in a huge
> value that is likely to immediately fail.
>
> Fix this by re-calculating the page-aligned size first. Then mask any
> high order bits using PHYSICAL_PAGE_MASK.
>
> Signed-off-by: Michael Kelley <[email protected]>
Looks good:
Acked-by: Dave Hansen <[email protected]>
Although I do agree with Boris that this superficially looks like
something that's important to backport. It would be best to either beef
up the changelog to explain why that's not the case, or to treat this as
an actual fix and submit separately.
On Mon, Nov 21, 2022 at 04:40:16PM +0000, Michael Kelley (LINUX) wrote:
> As discussed in a parallel thread [1], the incorrect code here doesn't have
> any real impact in already released Linux kernels. It only affects the
> transition that my patch series implements to change the way vTOM
> is handled.
Are you sure?
PHYSICAL_PAGE_MASK is controlled by __PHYSICAL_MASK which is determined
by CONFIG_DYNAMIC_PHYSICAL_MASK and __PHYSICAL_MASK_SHIFT which all
differ depending on configurations and also dynamic.
It is probably still ok, in probably all possible cases even though I
wouldn't bet on it.
And this fix is simple and all clear so lemme ask it differently: what
would be any downsides in backporting it to stable, just in case?
> I don't know what the tradeoffs are for backporting a fix that doesn't solve
> a real problem vs. just letting it be. Every backport carries some overhead
> in the process
Have you seen the deluge of stable fixes? :-)
> and there's always a non-zero risk of breaking something.
I don't see how this one would cause any breakage...
> I've leaned away from adding the "Fixes:" tag in such cases. But if
> it's better to go ahead and add the "Fixes:" tag for what's only a
> theoretical problem, I'm OK with doing so.
I think this is a good to have fix anyway as it is Obviously
Correct(tm).
Unless you have any reservations you haven't shared yet...
> [1] https://lkml.org/lkml/2022/11/11/1348
Btw, the proper way to reference to a mail message now is simply to do:
https://lore.kernel.org/r/<Message-ID>
as long as it has been posted on some ML which lore archives. And I
think it archives all.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Nov 21, 2022 at 04:43:01PM +0000, Michael Kelley (LINUX) wrote:
> I'm OK with naming it very narrowly. When/if there's a more general
> case later, we can generalize to whatever degree is appropriate.
Exactly.
Those defines are free to change when we see fit.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Monday, November 21, 2022 11:45 AM
>
> On Mon, Nov 21, 2022 at 04:40:16PM +0000, Michael Kelley (LINUX) wrote:
> > As discussed in a parallel thread [1], the incorrect code here doesn't have
> > any real impact in already released Linux kernels. It only affects the
> > transition that my patch series implements to change the way vTOM
> > is handled.
>
> Are you sure?
>
> PHYSICAL_PAGE_MASK is controlled by __PHYSICAL_MASK which is determined
> by CONFIG_DYNAMIC_PHYSICAL_MASK and __PHYSICAL_MASK_SHIFT which all
> differ depending on configurations and also dynamic.
>
> It is probably still ok, in probably all possible cases even though I
> wouldn't bet on it.
>
> And this fix is simple and all clear so lemme ask it differently: what
> would be any downsides in backporting it to stable, just in case?
None
>
> > I don't know what the tradeoffs are for backporting a fix that doesn't solve
> > a real problem vs. just letting it be. Every backport carries some overhead
> > in the process
>
> Have you seen the deluge of stable fixes? :-)
>
> > and there's always a non-zero risk of breaking something.
>
> I don't see how this one would cause any breakage...
>
> > I've leaned away from adding the "Fixes:" tag in such cases. But if
> > it's better to go ahead and add the "Fixes:" tag for what's only a
> > theoretical problem, I'm OK with doing so.
>
> I think this is a good to have fix anyway as it is Obviously
> Correct(tm).
>
> Unless you have any reservations you haven't shared yet...
>
No reservations. I'll add the "Fixes:" tag.
Michael
From: Dave Hansen <[email protected]> Sent: Monday, November 21, 2022 10:14 AM
>
> On 11/16/22 10:41, Michael Kelley wrote:
> > Current code re-calculates the size after aligning the starting and
> > ending physical addresses on a page boundary. But the re-calculation
> > also embeds the masking of high order bits that exceed the size of
> > the physical address space (via PHYSICAL_PAGE_MASK). If the masking
> > removes any high order bits, the size calculation results in a huge
> > value that is likely to immediately fail.
> >
> > Fix this by re-calculating the page-aligned size first. Then mask any
> > high order bits using PHYSICAL_PAGE_MASK.
> >
> > Signed-off-by: Michael Kelley <[email protected]>
>
> Looks good:
>
> Acked-by: Dave Hansen <[email protected]>
>
> Although I do agree with Boris that this superficially looks like
> something that's important to backport. It would be best to either beef
> up the changelog to explain why that's not the case, or to treat this as
> an actual fix and submit separately.
You and Boris agree and I have no objection, so I'll add the "Fixes:" tag.
I'd like to keep the patch as part of this series because it *is* needed to
make the series work.
Michael
On Mon, Nov 21, 2022 at 09:04:06PM +0000, Michael Kelley (LINUX) wrote:
> You and Boris agree and I have no objection, so I'll add the "Fixes:" tag.
> I'd like to keep the patch as part of this series because it *is* needed to
> make the series work.
Yeah, no worries. I can take it tomorrow through urgent and send it to
Linus this week so whatever you rebase your tree on, it should already
have it.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 11/21/22 6:39 AM, Borislav Petkov wrote:
> On Fri, Nov 18, 2022 at 02:55:32AM +0000, Michael Kelley (LINUX) wrote:
>> But I had not thought about TDX. In the TDX case, it appears that
>> sme_postprocess_startup() will not decrypt the bss_decrypted section.
>> The corresponding mem_encrypt_free_decrypted_mem() is a no-op unless
>> CONFIG_AMD_MEM_ENCRYPT is set. But maybe if someone builds a
>> kernel image that supports both TDX and AMD encryption, it could break
>
> sme_me_mask better be 0 on a kernel with both built in and running as a
> TDX guest.
>
Yes. It will be 0 in TDX. In sme_enable(), AMD code checks for CPUID
support before updating the sme_me_mask.
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 4dbd6a3e90e03130973688fd79e19425f720d999
Gitweb: https://git.kernel.org/tip/4dbd6a3e90e03130973688fd79e19425f720d999
Author: Michael Kelley <[email protected]>
AuthorDate: Wed, 16 Nov 2022 10:41:24 -08:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Tue, 22 Nov 2022 12:21:16 +01:00
x86/ioremap: Fix page aligned size calculation in __ioremap_caller()
Current code re-calculates the size after aligning the starting and
ending physical addresses on a page boundary. But the re-calculation
also embeds the masking of high order bits that exceed the size of
the physical address space (via PHYSICAL_PAGE_MASK). If the masking
removes any high order bits, the size calculation results in a huge
value that is likely to immediately fail.
Fix this by re-calculating the page-aligned size first. Then mask any
high order bits using PHYSICAL_PAGE_MASK.
Fixes: ffa71f33a820 ("x86, ioremap: Fix incorrect physical address handling in PAE mode")
Signed-off-by: Michael Kelley <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/mm/ioremap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 78c5bc6..6453fba 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -217,9 +217,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
* Mappings have to be page-aligned
*/
offset = phys_addr & ~PAGE_MASK;
- phys_addr &= PHYSICAL_PAGE_MASK;
+ phys_addr &= PAGE_MASK;
size = PAGE_ALIGN(last_addr+1) - phys_addr;
+ /*
+ * Mask out any bits not part of the actual physical
+ * address, like memory encryption bits.
+ */
+ phys_addr &= PHYSICAL_PAGE_MASK;
+
retval = memtype_reserve(phys_addr, (u64)phys_addr + size,
pcm, &new_pcm);
if (retval) {
From: Sathyanarayanan Kuppuswamy <[email protected]>
>
> On 11/21/22 6:39 AM, Borislav Petkov wrote:
> > On Fri, Nov 18, 2022 at 02:55:32AM +0000, Michael Kelley (LINUX) wrote:
> >> But I had not thought about TDX. In the TDX case, it appears that
> >> sme_postprocess_startup() will not decrypt the bss_decrypted section.
> >> The corresponding mem_encrypt_free_decrypted_mem() is a no-op unless
> >> CONFIG_AMD_MEM_ENCRYPT is set. But maybe if someone builds a
> >> kernel image that supports both TDX and AMD encryption, it could break
> >
> > sme_me_mask better be 0 on a kernel with both built in and running as a
> > TDX guest.
> >
>
> Yes. It will be 0 in TDX. In sme_enable(), AMD code checks for CPUID
> support before updating the sme_me_mask.
>
Right. But here's my point: With current code and an image built with
CONFIG_AMD_MEM_ENCRYPT=y and running as a TDX guest,
sme_postprocess_startup() will not decrypt the bss_decrypted section.
Then later mem_encrypt_free_decrypted_mem() will run, see that
CC_ATTR_MEM_ENCRYPT is true, and try to re-encrypt the memory.
In other words, a TDX guest would break in the same way as a Hyper-V
vTOM guest would break. This patch fixes the problem for both cases.
The only things I see in the bss_decrypted section are two clock structures
In arch/x86/kernel/kvmclock.c, which aren't needed when Hyper-V is the
hypervisor. But with a TDX guest on KVM, will *not* decrypting the
bss_decrypted section be a problem? I don't know that kvmclock
code or why the two clock structures need to be decrypted for
AMD mem encryption.
Michael
From: Borislav Petkov <[email protected]> Sent: Monday, November 21, 2022 7:03 AM
>
> On Wed, Nov 16, 2022 at 10:41:30AM -0800, Michael Kelley wrote:
> ...
>
> > @@ -108,6 +115,7 @@ u64 cc_mkenc(u64 val)
> > switch (vendor) {
> > case CC_VENDOR_AMD:
> > return val | cc_mask;
> > + case CC_VENDOR_HYPERV:
> > case CC_VENDOR_INTEL:
> > return val & ~cc_mask;
> > default:
> > @@ -121,6 +129,7 @@ u64 cc_mkdec(u64 val)
> > switch (vendor) {
> > case CC_VENDOR_AMD:
> > return val & ~cc_mask;
> > + case CC_VENDOR_HYPERV:
> > case CC_VENDOR_INTEL:
> > return val | cc_mask;
> > default:
>
> Uuuh, this needs a BIG FAT COMMENT.
>
> You're running on SNP and yet the enc/dec meaning is flipped. And that's
> because of vTOM.
>
> What happens if you have other types of SNP-based VMs on HyperV? The
> isolation VMs thing? Or is that the same?
>
> What happens when you do TDX guests with HyperV?
>
> This becomes wrong then.
>
> I think you need a more finer-grained check here in the sense of "is it
> a HyperV guest using a paravisor and vTOM is enabled" or so.
>
> Otherwise, I like the removal of the HyperV-specific checks ofc.
>
I think the core problem here is the naming and meaning of
CC_VENDOR_HYPERV. The name was created originally when the
Hyper-V vTOM handling code was a lot of special cases. With the
changes in this patch series that make the vTOM functionality more
mainstream, the name would be better as CC_VENDOR_AMD_VTOM.
vTOM is part of the AMD SEV-SNP spec, and it's a different way of
doing the encryption from the "C-bit" based approach. As much as
possible, I'm trying to not make it be Hyper-V specific, though currently
we have N=1 for hypervisors that offer the vTOM option, so it's a little
hard to generalize.
With the thinking oriented that way, a Linux guest on Hyper-V using
TDX will run with CC_VENDOR_INTEL. A Linux guest on Hyper-V that
is fully enlightened to use the "C-bit" will run with CC_VENDOR_AMD.
Dexuan Cui just posted a patch set for initial TDX support on Hyper-V,
and I think that runs with CC_VENDOR_INTEL (Dexuan -- correct me if
I'm wrong about that -- I haven't reviewed your patches yet). Tianyu Lan
has a patch set out for Hyper-V guests using the "C-bit". That patch set
still uses CC_VENDOR_HYPERV. Tianyu and I need to work through
whether his patch set can run with CC_VENDOR_AMD like everyone
else using the "C-bit" approach.
Yes, the polarity of the AMD vTOM bit matches the polarity of the
TDX GPA.SHARED bit, and is the opposite polarity of the AMD "C-bit".
I'll add a comment to that effect.
Anyway, that's where I think this should go. Does it make sense?
Other thoughts?
Michael
On 11/22/22 10:22, Michael Kelley (LINUX) wrote:
> Anyway, that's where I think this should go. Does it make sense?
> Other thoughts?
I think hard-coding the C-bit behavior and/or position to a vendor was
probably a bad idea. Even the comment:
u64 cc_mkenc(u64 val)
{
/*
* Both AMD and Intel use a bit in the page table to indicate
* encryption status of the page.
*
* - for AMD, bit *set* means the page is encrypted
* - for Intel *clear* means encrypted.
*/
doesn't make a lot of sense now. Maybe we should just have a:
CC_ATTR_ENC_SET
which gets set for the "AMD" behavior and is clear for the "Intel"
behavior. Hyper-V code running on AMD can set the attribute to get teh
"Intel" behavior.
That sure beats having a Hyper-V vendor.
On Tue, Nov 22, 2022 at 06:22:46PM +0000, Michael Kelley (LINUX) wrote:
> I think the core problem here is the naming and meaning of
> CC_VENDOR_HYPERV. The name was created originally when the
> Hyper-V vTOM handling code was a lot of special cases. With the
> changes in this patch series that make the vTOM functionality more
> mainstream, the name would be better as CC_VENDOR_AMD_VTOM.
No, if CC_VENDOR_HYPERV means different things depending on what kind of
guests you're doing, then you should not use a CC_VENDOR at all.
> vTOM is part of the AMD SEV-SNP spec, and it's a different way of
> doing the encryption from the "C-bit" based approach. As much as
> possible, I'm trying to not make it be Hyper-V specific, though
> currently we have N=1 for hypervisors that offer the vTOM option, so
> it's a little hard to generalize.
Actually, it is very simple to generalize: vTOM and the paravisor and
VMPL are all part of the effort to support unenlightened, unmodified
guests with SNP.
So, if KVM wants to run Windows NT 4.0 guests as SNP guests, then it
probably would need the same contraptions.
> With the thinking oriented that way, a Linux guest on Hyper-V using
> TDX will run with CC_VENDOR_INTEL. A Linux guest on Hyper-V that
> is fully enlightened to use the "C-bit" will run with CC_VENDOR_AMD.
Right.
> Dexuan Cui just posted a patch set for initial TDX support on Hyper-V,
> and I think that runs with CC_VENDOR_INTEL (Dexuan -- correct me if
> I'm wrong about that -- I haven't reviewed your patches yet). Tianyu Lan
> has a patch set out for Hyper-V guests using the "C-bit". That patch set
> still uses CC_VENDOR_HYPERV. Tianyu and I need to work through
> whether his patch set can run with CC_VENDOR_AMD like everyone
> else using the "C-bit" approach.
So I'm not sure the vendor is the right approach here. I guess we need
to specify the *type* of guest being supported.
> Yes, the polarity of the AMD vTOM bit matches the polarity of the
> TDX GPA.SHARED bit, and is the opposite polarity of the AMD "C-bit".
> I'll add a comment to that effect.
>
> Anyway, that's where I think this should go. Does it make sense?
> Other thoughts?
I think all that polarity doesn't matter as long as we abstract it away
with, "mark encrypted" and "mark decrypted".
But it is late now and I could be wrong - I'll look at the rest
tomorrow-ish.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Dave Hansen <[email protected]> Sent: Tuesday, November 22, 2022 10:30 AM
>
> On 11/22/22 10:22, Michael Kelley (LINUX) wrote:
> > Anyway, that's where I think this should go. Does it make sense?
> > Other thoughts?
>
> I think hard-coding the C-bit behavior and/or position to a vendor was
> probably a bad idea. Even the comment:
>
> u64 cc_mkenc(u64 val)
> {
> /*
> * Both AMD and Intel use a bit in the page table to indicate
> * encryption status of the page.
> *
> * - for AMD, bit *set* means the page is encrypted
> * - for Intel *clear* means encrypted.
> */
>
> doesn't make a lot of sense now. Maybe we should just have a:
>
> CC_ATTR_ENC_SET
>
> which gets set for the "AMD" behavior and is clear for the "Intel"
> behavior. Hyper-V code running on AMD can set the attribute to get teh
> "Intel" behavior.
>
> That sure beats having a Hyper-V vendor.
To me, figuring out the encryption bit polarity and position isn't
the hard part to set up. We've got three technologies: TDX,
AMD "C-bit", and arguably, AMD vTOM. Future silicon and architectural
enhancements will most likely be variations and improvements on
these rather than something completely new (not that I'm necessarily
aware of what the processor vendors may have planned). The AMD
"C-bit" technology already has sub-cases (SME, SEV, SEV-ES, SEV-SNP)
because of the architectural history. Any of the technologies may
get additional subcases over time. Whether those subcases are
represented as new CC_VENDOR_* options, or CC_ATTR_* options
on an existing CC_VENDOR_* should be driven by the level of
commonality with what already exists. There's no hard-and-fast
rule. Just do whatever makes the most sense.
I'm proposing AMD vTOM as a separate CC_VENDOR_AMD_VTOM
because it is rather different from CC_VENDOR_AMD, and not just
in the polarity and position of the encryption bit. But if we really
wanted to just make it a sub-case of CC_VENDOR_AMD, I could
probably be convinced. The key is cleanly handling all the other
attributes like CC_ATTR_GUEST_STATE_ENCRYPT,
CC_ATTR_ACCESS_IOAPIC_ENCRYPTED (that I add in this patch set),
CC_ATTR_GUEST_UNROLL_STRING_IO, etc. where we want to avoid
too many hacky special cases. The current code structure where the
CC_VENDOR_* selection determines the CC_ATTR_* values seems
to work OK.
Anyway, that's my thinking. CC_VENDOR_HYPERV goes away, but
the behavior is still essentially the same when replaced by
CC_VENDOR_AMD_VTOM.
Michael
From: Borislav Petkov <[email protected]> Sent: Tuesday, November 22, 2022 2:18 PM
>
> On Tue, Nov 22, 2022 at 06:22:46PM +0000, Michael Kelley (LINUX) wrote:
> > I think the core problem here is the naming and meaning of
> > CC_VENDOR_HYPERV. The name was created originally when the
> > Hyper-V vTOM handling code was a lot of special cases. With the
> > changes in this patch series that make the vTOM functionality more
> > mainstream, the name would be better as CC_VENDOR_AMD_VTOM.
>
> No, if CC_VENDOR_HYPERV means different things depending on what kind of
> guests you're doing, then you should not use a CC_VENDOR at all.
Agreed. My proposal is to drop CC_VENDOR_HYPERV entirely.
Replace it with CC_VENDOR_AMD_VTOM (or something like that) that
is set *only* by Linux guests that are running on AMD SEV-SNP processors
and using the vTOM scheme instead of the AMD C-bit scheme.
>
> > vTOM is part of the AMD SEV-SNP spec, and it's a different way of
> > doing the encryption from the "C-bit" based approach. As much as
> > possible, I'm trying to not make it be Hyper-V specific, though
> > currently we have N=1 for hypervisors that offer the vTOM option, so
> > it's a little hard to generalize.
>
> Actually, it is very simple to generalize: vTOM and the paravisor and
> VMPL are all part of the effort to support unenlightened, unmodified
> guests with SNP.
>
> So, if KVM wants to run Windows NT 4.0 guests as SNP guests, then it
> probably would need the same contraptions.
Yes, agreed. My point about generalization is that Hyper-V is the only
actual implementation today. Edge cases, like whether the IO-APIC is
accessed as encrypted or as decrypted don't have a pattern yet. But
that's not a blocker. Such cases can be resolved or special-cased later
when/if N > 1.
>
> > With the thinking oriented that way, a Linux guest on Hyper-V using
> > TDX will run with CC_VENDOR_INTEL. A Linux guest on Hyper-V that
> > is fully enlightened to use the "C-bit" will run with CC_VENDOR_AMD.
>
> Right.
Good. We're in agreement. :-)
>
> > Dexuan Cui just posted a patch set for initial TDX support on Hyper-V,
> > and I think that runs with CC_VENDOR_INTEL (Dexuan -- correct me if
> > I'm wrong about that -- I haven't reviewed your patches yet).
I confirmed with Dexuan that his new patch set for TDX guests on Hyper-V
has the guest running with CC_VENDOR_INTEL, which is what we want.
> > Tianyu Lan
> > has a patch set out for Hyper-V guests using the "C-bit". That patch set
> > still uses CC_VENDOR_HYPERV. Tianyu and I need to work through
> > whether his patch set can run with CC_VENDOR_AMD like everyone
> > else using the "C-bit" approach.
I haven't followed up with Tianyu yet.
>
> So I'm not sure the vendor is the right approach here. I guess we need
> to specify the *type* of guest being supported.
Yes, calling it the "vendor" turns out to not quite be right because in
the AMD case, the technology/architecture/scheme/"type" (or
whatever you want to call it) is not 1:1 with the vendor. Intel has just
one (TDX) while AMD has two (C-bit and vTOM). "vendor" is just a label,
but we should get the label right to avoid future confusion. The key point
is that we'll have three top-level types:
* TDX
* AMD with C-bit (and this has some sub-types)
* AMD with vTOM
The CC_ATTR_* values are then derived from the "type".
>
> > Yes, the polarity of the AMD vTOM bit matches the polarity of the
> > TDX GPA.SHARED bit, and is the opposite polarity of the AMD "C-bit".
> > I'll add a comment to that effect.
> >
> > Anyway, that's where I think this should go. Does it make sense?
> > Other thoughts?
>
> I think all that polarity doesn't matter as long as we abstract it away
> with, "mark encrypted" and "mark decrypted".
Agreed.
Michael
> From: Michael Kelley (LINUX) <[email protected]>
> Sent: Thursday, November 17, 2022 6:56 PM
> [...]
>
> But I had not thought about TDX. In the TDX case, it appears that
> sme_postprocess_startup() will not decrypt the bss_decrypted section.
> The corresponding mem_encrypt_free_decrypted_mem() is a no-op unless
> CONFIG_AMD_MEM_ENCRYPT is set. But maybe if someone builds a
> kernel image that supports both TDX and AMD encryption, it could break
> at runtime on a TDX system. I would also note that on a TDX system
> without CONFIG_AMD_MEM_ENCRYPT, the unused memory in the
> bss_decrypted section never gets freed.
On a TDX system *with* CONFIG_AMD_MEM_ENCRYPT, the unused
memory in the bss_decrypted section also never gets freed due to the
below "return;"
I'd suggest a Fixes tag should be added to make sure the distro vendors
notice the patch and backport it :-)
BTW, I just posted a similar patch as I didn't notice this patch. I have
replied to my patch email, asking people to ignore my patch.
Fixes: b3f0907c71e0 ("x86/mm: Add .bss..decrypted section to hold shared variables")
void __init mem_encrypt_free_decrypted_mem(void)
{
unsigned long vaddr, vaddr_end, npages;
int r;
vaddr = (unsigned long)__start_bss_decrypted_unused;
vaddr_end = (unsigned long)__end_bss_decrypted;
npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
/*
* The unused memory range was mapped decrypted, change the encryption
* attribute from decrypted to encrypted before freeing it.
*/
if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
r = set_memory_encrypted(vaddr, npages);
if (r) {
pr_warn("failed to free unused decrypted pages\n");
return;
}
}
free_init_pages("unused decrypted", vaddr, vaddr_end);
}
On Tue, Nov 22, 2022 at 05:59:04PM +0000, Michael Kelley (LINUX) wrote:
> Right. But here's my point: With current code and an image built with
> CONFIG_AMD_MEM_ENCRYPT=y and running as a TDX guest,
> sme_postprocess_startup() will not decrypt the bss_decrypted section.
> Then later mem_encrypt_free_decrypted_mem() will run, see that
> CC_ATTR_MEM_ENCRYPT is true, and try to re-encrypt the memory.
> In other words, a TDX guest would break in the same way as a Hyper-V
> vTOM guest would break. This patch fixes the problem for both cases.
I guess making the check more concrete by checking sme_me_mask directly
along with a comment makes sense.
We need to be very careful here not to fragment the code too much for
all the different guest types.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 11/27/22 20:52, Dexuan Cui wrote:
>> From: Michael Kelley (LINUX) <[email protected]>
>> Sent: Thursday, November 17, 2022 6:56 PM
>> [...]
>>
>> But I had not thought about TDX. In the TDX case, it appears that
>> sme_postprocess_startup() will not decrypt the bss_decrypted section.
>> The corresponding mem_encrypt_free_decrypted_mem() is a no-op unless
>> CONFIG_AMD_MEM_ENCRYPT is set. But maybe if someone builds a
>> kernel image that supports both TDX and AMD encryption, it could break
>> at runtime on a TDX system. I would also note that on a TDX system
>> without CONFIG_AMD_MEM_ENCRYPT, the unused memory in the
>> bss_decrypted section never gets freed.
>
> On a TDX system *with* CONFIG_AMD_MEM_ENCRYPT, the unused
> memory in the bss_decrypted section also never gets freed due to the
> below "return;"
>
> I'd suggest a Fixes tag should be added to make sure the distro vendors
> notice the patch and backport it :-)
>
> BTW, I just posted a similar patch as I didn't notice this patch. I have
> replied to my patch email, asking people to ignore my patch.
>
> Fixes: b3f0907c71e0 ("x86/mm: Add .bss..decrypted section to hold shared variables")
I think the Fixes: tag should really be:
e9d1d2bb75b2 ("treewide: Replace the use of mem_encrypt_active() with cc_platform_has()")
since mem_encrypt_active() used to return sme_me_mask, so the checks were
balanced at that point.
Thanks,
Tom
>
> void __init mem_encrypt_free_decrypted_mem(void)
> {
> unsigned long vaddr, vaddr_end, npages;
> int r;
>
> vaddr = (unsigned long)__start_bss_decrypted_unused;
> vaddr_end = (unsigned long)__end_bss_decrypted;
> npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
>
> /*
> * The unused memory range was mapped decrypted, change the encryption
> * attribute from decrypted to encrypted before freeing it.
> */
> if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> r = set_memory_encrypted(vaddr, npages);
> if (r) {
> pr_warn("failed to free unused decrypted pages\n");
> return;
> }
> }
>
> free_init_pages("unused decrypted", vaddr, vaddr_end);
> }
>
From: Michael Kelley (LINUX) <[email protected]> Sent: Tuesday, November 22, 2022 4:59 PM
>
> From: Borislav Petkov <[email protected]> Sent: Tuesday, November 22, 2022 2:18 PM
> >
> > On Tue, Nov 22, 2022 at 06:22:46PM +0000, Michael Kelley (LINUX) wrote:
> > > I think the core problem here is the naming and meaning of
> > > CC_VENDOR_HYPERV. The name was created originally when the
> > > Hyper-V vTOM handling code was a lot of special cases. With the
> > > changes in this patch series that make the vTOM functionality more
> > > mainstream, the name would be better as CC_VENDOR_AMD_VTOM.
> >
> > No, if CC_VENDOR_HYPERV means different things depending on what kind of
> > guests you're doing, then you should not use a CC_VENDOR at all.
>
> Agreed. My proposal is to drop CC_VENDOR_HYPERV entirely.
> Replace it with CC_VENDOR_AMD_VTOM (or something like that) that
> is set *only* by Linux guests that are running on AMD SEV-SNP processors
> and using the vTOM scheme instead of the AMD C-bit scheme.
>
> >
> > > vTOM is part of the AMD SEV-SNP spec, and it's a different way of
> > > doing the encryption from the "C-bit" based approach. As much as
> > > possible, I'm trying to not make it be Hyper-V specific, though
> > > currently we have N=1 for hypervisors that offer the vTOM option, so
> > > it's a little hard to generalize.
> >
> > Actually, it is very simple to generalize: vTOM and the paravisor and
> > VMPL are all part of the effort to support unenlightened, unmodified
> > guests with SNP.
> >
> > So, if KVM wants to run Windows NT 4.0 guests as SNP guests, then it
> > probably would need the same contraptions.
>
> Yes, agreed. My point about generalization is that Hyper-V is the only
> actual implementation today. Edge cases, like whether the IO-APIC is
> accessed as encrypted or as decrypted don't have a pattern yet. But
> that's not a blocker. Such cases can be resolved or special-cased later
> when/if N > 1.
>
> >
> > > With the thinking oriented that way, a Linux guest on Hyper-V using
> > > TDX will run with CC_VENDOR_INTEL. A Linux guest on Hyper-V that
> > > is fully enlightened to use the "C-bit" will run with CC_VENDOR_AMD.
> >
> > Right.
>
> Good. We're in agreement. :-)
>
> >
> > > Dexuan Cui just posted a patch set for initial TDX support on Hyper-V,
> > > and I think that runs with CC_VENDOR_INTEL (Dexuan -- correct me if
> > > I'm wrong about that -- I haven't reviewed your patches yet).
>
> I confirmed with Dexuan that his new patch set for TDX guests on Hyper-V
> has the guest running with CC_VENDOR_INTEL, which is what we want.
>
> > > Tianyu Lan
> > > has a patch set out for Hyper-V guests using the "C-bit". That patch set
> > > still uses CC_VENDOR_HYPERV. Tianyu and I need to work through
> > > whether his patch set can run with CC_VENDOR_AMD like everyone
> > > else using the "C-bit" approach.
>
> I haven't followed up with Tianyu yet.
>
> >
> > So I'm not sure the vendor is the right approach here. I guess we need
> > to specify the *type* of guest being supported.
>
> Yes, calling it the "vendor" turns out to not quite be right because in
> the AMD case, the technology/architecture/scheme/"type" (or
> whatever you want to call it) is not 1:1 with the vendor. Intel has just
> one (TDX) while AMD has two (C-bit and vTOM). "vendor" is just a label,
> but we should get the label right to avoid future confusion. The key point
> is that we'll have three top-level types:
>
> * TDX
> * AMD with C-bit (and this has some sub-types)
> * AMD with vTOM
>
> The CC_ATTR_* values are then derived from the "type".
>
> >
> > > Yes, the polarity of the AMD vTOM bit matches the polarity of the
> > > TDX GPA.SHARED bit, and is the opposite polarity of the AMD "C-bit".
> > > I'll add a comment to that effect.
> > >
> > > Anyway, that's where I think this should go. Does it make sense?
> > > Other thoughts?
> >
> > I think all that polarity doesn't matter as long as we abstract it away
> > with, "mark encrypted" and "mark decrypted".
>
> Agreed.
>
Boris --
Any further comment on this patch? I think we're agreement. For
this patch series I propose to change the symbol "CC_VENDOR_HYPERV"
to "CC_VENDOR_AMD_VTOM" and the function name
hyperv_cc_platform_has() to amd_vtom_cc_platform_has(). There's
no functionality change vs. the current version of this patch. The naming
changes just make clear that the scope is only for the AMD vTOM encryption
scheme.
As discussed, the concept "vendor" isn't quite right either, but let's not
add that change to this patch series. I can follow with a separate patch
to change symbols with "vendor" or "VENDOR" to something else, which
will probably require some discussion as to what that something else
should be.
Michael
On Mon, Nov 28, 2022 at 02:38:11PM +0000, Michael Kelley (LINUX) wrote:
> Any further comment on this patch? I think we're agreement. For
> this patch series I propose to change the symbol "CC_VENDOR_HYPERV"
> to "CC_VENDOR_AMD_VTOM" and the function name
> hyperv_cc_platform_has() to amd_vtom_cc_platform_has().
That doesn't sound optimal to me.
So, let's clarify things first: those Isolation VMs - are they going to
be the paravisors?
I don't see any other option because the unmodified guest must be some
old windoze....
So, if they're going to be that, then I guess this should be called
CC_ATTR_PARAVISOR
to denote that it is a thin layer of virt gunk between an unmodified
guest and a hypervisor.
And if TDX wants to do that too later, then they can use that flag too.
Yes, no?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Monday, November 28, 2022 8:34 AM
>
> On Mon, Nov 28, 2022 at 02:38:11PM +0000, Michael Kelley (LINUX) wrote:
> > Any further comment on this patch? I think we're agreement. For
> > this patch series I propose to change the symbol "CC_VENDOR_HYPERV"
> > to "CC_VENDOR_AMD_VTOM" and the function name
> > hyperv_cc_platform_has() to amd_vtom_cc_platform_has().
>
> That doesn't sound optimal to me.
>
> So, let's clarify things first: those Isolation VMs - are they going to
> be the paravisors?
No. The paravisor exists in another layer (VMPL 0 in the AMD vTOM
architecture) that isn't visible to the Linux guest. The paravisor is
a separate code base that isn't currently open source. All code
in this patch series is code that runs in the Linux guest.
From an encryption standpoint, the Linux guest sees the following:
1) Guest memory is encrypted by default
2) The Linux guest must set the vTOM flag in a PTE to access a page
as unencrypted.
3) The Linux guest must explicitly notify the hypervisor to change a
page from private (encrypted) to shared (decrypted), and vice versa.
>
> I don't see any other option because the unmodified guest must be some
> old windoze....
What Windows guests do isn't really relevant. Again, the code in this patch
series all runs directly in the Linux guest, not the paravisor. And the Linux
guest isn't unmodified. We've added changes to understand vTOM and
the need to communicate with the hypervisor about page changes
between private and shared. But there are other changes for a fully
enlightened guest that we don't have to make when using AMD vTOM,
because the paravisor transparently (to the guest -- Linux or Windows)
handles those issues.
>
> So, if they're going to be that, then I guess this should be called
>
> CC_ATTR_PARAVISOR
>
> to denote that it is a thin layer of virt gunk between an unmodified
> guest and a hypervisor.
No, the code is this patch series is not that at all. It's not code that is
part of the paravisor. It's Linux guest code that understands it is running
in an environment where AMD vTOM is the encryption scheme, which is
different from the AMD C-bit encryption scheme.
>
> And if TDX wants to do that too later, then they can use that flag too.
>
Again, no. What I have proposed as CC_VENDOR_AMD_VTOM is
specific to AMD's virtual-Top-of-Memory architecture. The TDX
architecture doesn't really have a way to use a paravisor.
To summarize, the code in this patch series is about a 3rd encryption
scheme that is used by the guest. It is completely parallel to the AMD
C-bit encryption scheme and the Intel TDX encryption scheme. With
the AMD vTOM scheme, there is a paravisor that transparently emulates
some things for the guest so there are fewer code changes needed in the
guest, but this patch series is not about that paravisor code.
Michael
On Mon, Nov 28, 2022 at 04:59:27PM +0000, Michael Kelley (LINUX) wrote:
> 2) The Linux guest must set the vTOM flag in a PTE to access a page
> as unencrypted.
What exactly do you call the "vTOM flag in the PTE"?
I see this here:
"When bit 1 (vTOM) of SEV_FEATURES is set in the VMSA of an SNP-active
VM, the VIRTUAL_TOM field is used to determine the C-bit for data
accesses instead of the guest page table contents. All data accesses
below VIRTUAL_TOM are accessed with an effective C-bit of 1 and all
addresses at or above VIRTUAL_TOM are accessed with an effective C-bit
of 0."
Now you say
"vTOM is the dividing line where the uppermost bit of the physical
address space is set; e.g., with 47 bits of guest physical address
space, vTOM is 0x40000000000 (bit 46 is set)."
So on your guests, is VIRTUAL_TOM == 0x400000000000?
Btw, that 0x4000.. does not have bit 46 set but bit 42. Bit 46 set is
0x400000000000
which means one more '0' at the end.
So before we discuss this further, let's agree on the basics first.
> What Windows guests do isn't really relevant. Again, the code in this patch
> series all runs directly in the Linux guest, not the paravisor. And the Linux
> guest isn't unmodified. We've added changes to understand vTOM and
> the need to communicate with the hypervisor about page changes
> between private and shared. But there are other changes for a fully
> enlightened guest that we don't have to make when using AMD vTOM,
> because the paravisor transparently (to the guest -- Linux or Windows)
> handles those issues.
So this is some other type of guest you wanna run.
Where is the documentation of that thing?
I'd like to know what exactly it is going to use in the kernel.
> Again, no. What I have proposed as CC_VENDOR_AMD_VTOM is
There's no vendor AMD_VTOM!
We did the vendor thing to denote Intel or AMD wrt to confidential
computing.
Now you're coming up with something special. It can't be HYPERV because
Hyper-V does other types of confidential solutions too, apparently.
Now before this goes any further I'd like for "something special" to be
defined properly and not just be a one-off Hyper-V thing.
> specific to AMD's virtual-Top-of-Memory architecture. The TDX
> architecture doesn't really have a way to use a paravisor.
>
> To summarize, the code in this patch series is about a 3rd encryption
> scheme that is used by the guest.
Yes, can that third thing be used by other hypervisors or is this
Hyper-V special?
> It is completely parallel to the AMD C-bit encryption scheme and
> the Intel TDX encryption scheme. With the AMD vTOM scheme, there is
> a paravisor that transparently emulates some things for the guest
> so there are fewer code changes needed in the guest, but this patch
> series is not about that paravisor code.
Would other hypervisors want to support such a scheme?
Is this architecture documented somewhere? If so, where?
What would it mean for the kernel to support it.
And so on and so on.
Thanks.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Monday, November 28, 2022 9:25 AM
>
> On Mon, Nov 28, 2022 at 04:59:27PM +0000, Michael Kelley (LINUX) wrote:
> > 2) The Linux guest must set the vTOM flag in a PTE to access a page
> > as unencrypted.
>
> What exactly do you call the "vTOM flag in the PTE"?
>
> I see this here:
>
> "When bit 1 (vTOM) of SEV_FEATURES is set in the VMSA of an SNP-active
> VM, the VIRTUAL_TOM field is used to determine the C-bit for data
> accesses instead of the guest page table contents. All data accesses
> below VIRTUAL_TOM are accessed with an effective C-bit of 1 and all
> addresses at or above VIRTUAL_TOM are accessed with an effective C-bit
> of 0."
>
> Now you say
>
> "vTOM is the dividing line where the uppermost bit of the physical
> address space is set; e.g., with 47 bits of guest physical address
> space, vTOM is 0x40000000000 (bit 46 is set)."
>
> So on your guests, is VIRTUAL_TOM == 0x400000000000?
>
> Btw, that 0x4000.. does not have bit 46 set but bit 42. Bit 46 set is
>
> 0x400000000000
>
> which means one more '0' at the end.
Yeah, looks like I dropped a '0' in my comment text. Will fix.
>
> So before we discuss this further, let's agree on the basics first.
>
> > What Windows guests do isn't really relevant. Again, the code in this patch
> > series all runs directly in the Linux guest, not the paravisor. And the Linux
> > guest isn't unmodified. We've added changes to understand vTOM and
> > the need to communicate with the hypervisor about page changes
> > between private and shared. But there are other changes for a fully
> > enlightened guest that we don't have to make when using AMD vTOM,
> > because the paravisor transparently (to the guest -- Linux or Windows)
> > handles those issues.
>
> So this is some other type of guest you wanna run.
>
> Where is the documentation of that thing?
>
> I'd like to know what exactly it is going to use in the kernel.
Standard Linux is the guest. It's fully functional for running general
Purpose workloads that want "confidential computing" where guest
memory is encrypted and the data in the guest is not visible to the
host hypervisor. It's a standard Linux kernel. I'm not sure what you
mean by "some other type of guest".
>
> > Again, no. What I have proposed as CC_VENDOR_AMD_VTOM is
>
> There's no vendor AMD_VTOM!
>
> We did the vendor thing to denote Intel or AMD wrt to confidential
> computing.
But vendor AMD effectively offers two different encryption schemes that
could be seen by the guest VM. The hypervisor chooses which scheme a
particular guest will see. Hyper-V has chosen to present the vTOM scheme
to guest VMs, including normal Linux and Windows guests, that have been
modestly updated to understand vTOM.
>
> Now you're coming up with something special. It can't be HYPERV because
> Hyper-V does other types of confidential solutions too, apparently.
In the future, Hyper-V may also choose to present original AMD C-bit scheme
in some guest VMs, depending on the use case. And it will present the Intel
TDX scheme when running on that hardware.
>
> Now before this goes any further I'd like for "something special" to be
> defined properly and not just be a one-off Hyper-V thing.
>
> > specific to AMD's virtual-Top-of-Memory architecture. The TDX
> > architecture doesn't really have a way to use a paravisor.
> >
> > To summarize, the code in this patch series is about a 3rd encryption
> > scheme that is used by the guest.
>
> Yes, can that third thing be used by other hypervisors or is this
> Hyper-V special?
This third thing is part of the AMD SEV-SNP architecture and could be used
by any hypervisor.
>
> > It is completely parallel to the AMD C-bit encryption scheme and
> > the Intel TDX encryption scheme. With the AMD vTOM scheme, there is
> > a paravisor that transparently emulates some things for the guest
> > so there are fewer code changes needed in the guest, but this patch
> > series is not about that paravisor code.
>
> Would other hypervisors want to support such a scheme?
>
> Is this architecture documented somewhere? If so, where?
The AMD vTOM scheme is documented in the AMD SEV-SNP
architecture specs.
>
> What would it mean for the kernel to support it.
The Linux kernel running as a guest already supports it, at least when
running on Hyper-V. The code went into the 5.15 kernel [1][2] about
a year ago. But that code is more Hyper-V specific than is desirable.
This patch set makes the AMD vTOM support more parallel to the Intel
TDX and AMD C-bit support.
To my knowledge, KVM does not support the AMD vTOM scheme.
Someone from AMD may have a better sense whether adding that
support is likely in the future.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Michael
On Mon, Nov 28, 2022 at 05:55:11PM +0000, Michael Kelley (LINUX) wrote:
> But vendor AMD effectively offers two different encryption schemes that
> could be seen by the guest VM. The hypervisor chooses which scheme a
> particular guest will see. Hyper-V has chosen to present the vTOM scheme
> to guest VMs, including normal Linux and Windows guests, that have been
> modestly updated to understand vTOM.
If this is a standard SNP guest then you can detect vTOM support using
SEV_FEATURES. See this thread here:
https://lore.kernel.org/r/[email protected]
Which then means, you don't need any special gunk except extending this
patch above to check SNP has vTOM support.
> In the future, Hyper-V may also choose to present original AMD C-bit scheme
> in some guest VMs, depending on the use case. And it will present the Intel
> TDX scheme when running on that hardware.
And all those should JustWork(tm) because we already support such guests.
> To my knowledge, KVM does not support the AMD vTOM scheme.
> Someone from AMD may have a better sense whether adding that
> support is likely in the future.
Yah, see above.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Monday, November 28, 2022 11:57 AM
>
> On Mon, Nov 28, 2022 at 05:55:11PM +0000, Michael Kelley (LINUX) wrote:
> > But vendor AMD effectively offers two different encryption schemes that
> > could be seen by the guest VM. The hypervisor chooses which scheme a
> > particular guest will see. Hyper-V has chosen to present the vTOM scheme
> > to guest VMs, including normal Linux and Windows guests, that have been
> > modestly updated to understand vTOM.
>
> If this is a standard SNP guest then you can detect vTOM support using
> SEV_FEATURES. See this thread here:
>
> https://lore.kernel.org/all/[email protected]/
>
> Which then means, you don't need any special gunk except extending this
> patch above to check SNP has vTOM support.
Yes, we could do that. But this patch set is still needed to get the preliminaries
in place. The initial code for supporting AMD vTOM that went upstream a
year ago is too Hyper-V specific. This patch set removes the Hyper-V specific
stuff, and integrates vTOM support into the overall confidential computing
framework in arch/x86/coco/core.c. The Hyper-V code was already somewhat
there via CC_VENDOR_HYPERV, but that really should be named something
else now. So that's why I'm suggesting CC_VENDOR_AMD_VTOM.
>
> > In the future, Hyper-V may also choose to present original AMD C-bit scheme
> > in some guest VMs, depending on the use case. And it will present the Intel
> > TDX scheme when running on that hardware.
>
> And all those should JustWork(tm) because we already support such guests.
Agreed. That's where we want to be.
Of course, when you go from N=1 hypervisors (i.e., KVM) to N=2 (KVM
and Hyper-V, you find some places where incorrect assumptions were made
or some generalizations are needed. Dexuan Cui's patch set for TDX support
is fixing those places that he has encountered. But with those fixes, the TDX
support will JustWork(tm) for Linux guests on Hyper-V.
I haven't gone deeply into the situation with AMD C-bit support on Hyper-V.
Tianyu Lan's set of patches for that support is a bit bigger, and I'm
planning to look closely to understand whether it's also just fixing incorrect
assumptions and such, or whether they really are some differences with
Hyper-V. If there are differences, I want to understand why.
Michael
On Tue, Nov 29, 2022 at 01:15:39AM +0000, Michael Kelley (LINUX) wrote:
> So that's why I'm suggesting CC_VENDOR_AMD_VTOM.
There's no CC_VENDOR_AMD_VTOM. How many times do I need to explain this?!
CC_VENDOR is well the CC vendor - not some special case.
IOW, the goal here is for generic SNP functionality to be the same for
*all* SNP guests. We won't do the AMD's version of vTOM enablement,
Hyper-V's version of vTOM enablement and so on. It must be a single
vTOM feature which works on *all* hypervisors as vTOM is a generic SNP
feature - not Hyper-V feature.
> Of course, when you go from N=1 hypervisors (i.e., KVM) to N=2 (KVM
> and Hyper-V, you find some places where incorrect assumptions were
> made or some generalizations are needed. Dexuan Cui's patch set for
> TDX support is fixing those places that he has encountered. But with
> those fixes, the TDX support will JustWork(tm) for Linux guests on
> Hyper-V.
That sounds like the right direction to take.
> I haven't gone deeply into the situation with AMD C-bit support on
> Hyper-V. Tianyu Lan's set of patches for that support is a bit bigger,
> and I'm planning to look closely to understand whether it's also just
> fixing incorrect assumptions and such, or whether they really are
> some differences with Hyper-V. If there are differences, I want to
> understand why.
You and me too. So I guess we should look at Tianyu's set first.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Tuesday, November 29, 2022 12:41 AM
>
> On Tue, Nov 29, 2022 at 01:15:39AM +0000, Michael Kelley (LINUX) wrote:
> > So that's why I'm suggesting CC_VENDOR_AMD_VTOM.
>
> There's no CC_VENDOR_AMD_VTOM. How many times do I need to explain this?!
> CC_VENDOR is well the CC vendor - not some special case.
Let's back up half a step. We have CC_VENDOR_INTEL and
CC_VENDOR_AMD because that's a convenient way to identify
two fairly different schemes for doing guest encryption. The
schemes have some things in common, but the details are pretty
different. Tagging the schemes based on the vendor makes sense
because the schemes were independently developed by each
processor vendor.
But it turns out that AMD really has two fairly different schemes:
the C-bit scheme and the vTOM scheme. The details of these
two AMD schemes are pretty different. vTOM is *not* just a minor
option on the C-bit scheme. It's an either/or -- a guest VM is either
doing the C-bit scheme or the vTOM scheme, not some combination.
Linux code in coco/core.c could choose to treat C-bit and vTOM as
two sub-schemes under CC_VENDOR_AMD, but that makes the
code a bit messy because we end up with "if" statements to figure
out whether to do things the C-bit way or the vTOM way. The
difference in the C-bit way and the vTOM way is not Hyper-V
specific. Any hypervisor running on an AMD processor can make
the same choice to offer C-bit VMs or vTOM VMs.
Or we could model the two AMD schemes as two different vendors,
which is what I'm suggesting. Doing so recognizes that the two schemes
are fairly disjoint, and it makes the code cleaner.
Tom Lendacky -- any thoughts on this question from AMD's
standpoint?
>
> IOW, the goal here is for generic SNP functionality to be the same for
> *all* SNP guests. We won't do the AMD's version of vTOM enablement,
> Hyper-V's version of vTOM enablement and so on. It must be a single
> vTOM feature which works on *all* hypervisors as vTOM is a generic SNP
> feature - not Hyper-V feature.
Yes, there's only one version of vTOM enablement -- the AMD version.
But the broader AMD SNP functionality is bifurcated: there's the
C-bit scheme and the vTOM scheme. The question is how Linux code
should model those two different schemes. I'm suggesting that things
are cleaner conceptually and in the code if we model the two different
AMD schemes as two different vendors.
Michael
>
> > Of course, when you go from N=1 hypervisors (i.e., KVM) to N=2 (KVM
> > and Hyper-V, you find some places where incorrect assumptions were
> > made or some generalizations are needed. Dexuan Cui's patch set for
> > TDX support is fixing those places that he has encountered. But with
> > those fixes, the TDX support will JustWork(tm) for Linux guests on
> > Hyper-V.
>
> That sounds like the right direction to take.
>
> > I haven't gone deeply into the situation with AMD C-bit support on
> > Hyper-V. Tianyu Lan's set of patches for that support is a bit bigger,
> > and I'm planning to look closely to understand whether it's also just
> > fixing incorrect assumptions and such, or whether they really are
> > some differences with Hyper-V. If there are differences, I want to
> > understand why.
>
> You and me too. So I guess we should look at Tianyu's set first.
>
On Tue, Nov 29, 2022 at 03:49:06PM +0000, Michael Kelley (LINUX) wrote:
> But it turns out that AMD really has two fairly different schemes:
> the C-bit scheme and the vTOM scheme.
Except it doesn't:
"In the VMSA of an SNP-active guest, the VIRTUAL_TOM field designates
a 2MB aligned guest physical address called the virtual top of memory.
When bit 1 (vTOM) of SEV_FEATURES is set in the VMSA of an SNP-active
VM, the VIRTUAL_TOM..."
So SEV_FEATURES[1] is vTOM and it is part of SNP.
Why do you keep harping on this being something else is beyond me...
I already pointed you to the patch which adds this along with the other
SEV_FEATURES.
> The details of these two AMD schemes are pretty different. vTOM is
> *not* just a minor option on the C-bit scheme. It's an either/or -- a
> guest VM is either doing the C-bit scheme or the vTOM scheme, not some
> combination. Linux code in coco/core.c could choose to treat C-bit and
> vTOM as two sub-schemes under CC_VENDOR_AMD, but that makes the code a
> bit messy because we end up with "if" statements to figure out whether
> to do things the C-bit way or the vTOM way.
Are you saying that that:
if (cc_vendor == CC_VENDOR_AMD &&
sev_features & MSR_AMD64_SNP_VTOM_ENABLED)
is messy? Why?
We will have to support vTOM sooner or later.
> Or we could model the two AMD schemes as two different vendors,
> which is what I'm suggesting. Doing so recognizes that the two schemes
> are fairly disjoint, and it makes the code cleaner.
How is that any different from the above check?
You *need* some sort of a check to differentiate between the two anyway.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Tuesday, November 29, 2022 9:47 AM
>
> On Tue, Nov 29, 2022 at 03:49:06PM +0000, Michael Kelley (LINUX) wrote:
> > But it turns out that AMD really has two fairly different schemes:
> > the C-bit scheme and the vTOM scheme.
>
> Except it doesn't:
>
> "In the VMSA of an SNP-active guest, the VIRTUAL_TOM field designates
> a 2MB aligned guest physical address called the virtual top of memory.
> When bit 1 (vTOM) of SEV_FEATURES is set in the VMSA of an SNP-active
> VM, the VIRTUAL_TOM..."
>
> So SEV_FEATURES[1] is vTOM and it is part of SNP.
>
> Why do you keep harping on this being something else is beyond me...
>
> I already pointed you to the patch which adds this along with the other
> SEV_FEATURES.
>
> > The details of these two AMD schemes are pretty different. vTOM is
> > *not* just a minor option on the C-bit scheme. It's an either/or -- a
> > guest VM is either doing the C-bit scheme or the vTOM scheme, not some
> > combination. Linux code in coco/core.c could choose to treat C-bit and
> > vTOM as two sub-schemes under CC_VENDOR_AMD, but that makes the code a
> > bit messy because we end up with "if" statements to figure out whether
> > to do things the C-bit way or the vTOM way.
>
> Are you saying that that:
>
> if (cc_vendor == CC_VENDOR_AMD &&
> sev_features & MSR_AMD64_SNP_VTOM_ENABLED)
>
> is messy? Why?
>
> We will have to support vTOM sooner or later.
>
> > Or we could model the two AMD schemes as two different vendors,
> > which is what I'm suggesting. Doing so recognizes that the two schemes
> > are fairly disjoint, and it makes the code cleaner.
>
> How is that any different from the above check?
>
> You *need* some sort of a check to differentiate between the two anyway.
>
Alright. Enough conceptual debate. I'll do a v4 of the patch series with
the AMD C-bit and vTOM schemes folder under CC_VENDOR_AMD and
we can see if there's any further feedback. I should have that v4 out later
today or tomorrow.
Michael