2022-11-11 06:34:17

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH v2 00/12] Drivers: hv: Add PCI pass-thru support to Hyper-V Confidential VMs

This patch series adds support for PCI pass-thru devices to Hyper-V
Confidential VMs (also called "Isolation VMs"). But in preparation, it
first changes how private (encrypted) vs. shared (decrypted) memory is
handled in Hyper-V SEV-SNP guest VMs. The new approach builds on the
confidential computing (coco) mechanisms introduced in the 5.19 kernel
for TDX support and significantly reduces the amount of Hyper-V specific
code. Furthermore, with this new approach a proposed RFC patch set for
generic DMA layer functionality[1] is no longer necessary.

Background
==========
Hyper-V guests on AMD SEV-SNP hardware have the option of using the
"virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
architecture. With vTOM, shared vs. private memory accesses are
controlled by splitting the guest physical address space into two
halves. vTOM is the dividing line where the uppermost bit of the
physical address space is set; e.g., with 47 bits of guest physical
address space, vTOM is 0x40000000000 (bit 46 is set). Guest phyiscal
memory is accessible at two parallel physical addresses -- one below
vTOM and one above vTOM. Accesses below vTOM are private (encrypted)
while accesses above vTOM are shared (decrypted). In this sense, vTOM
is like the GPA.SHARED bit in Intel TDX.

In Hyper-V's use of vTOM, the normal guest OS runs at VMPL2, while
a Hyper-V provided "paravisor" runs at VMPL0 in the guest VM. (VMPL is
Virtual Machine Privilege Level. See AMD's SEV-SNP spec for more
details.) The paravisor provides emulation for various system devices
like the I/O APIC as part of the guest VM. Accesses to such devices
made by the normal guest OS trap to the paravisor and are emulated in
the guest VM context instead of in the Hyper-V host. This emulation is
invisible to the normal guest OS, but with the quirk that memory mapped
I/O accesses to these devices must be treated as private, not shared as
would be the case for other device accesses.

Support for Hyper-V guests using vTOM was added to the Linux kernel
in two patch sets[2][3]. This support treats the vTOM bit as part of
the physical address. For accessing shared (decrypted) memory, the core
approach is to create a second kernel virtual mapping that maps to
parallel physical addresses above vTOM, while leaving the original
mapping unchanged. Most of the code for creating that second virtual
mapping is confined to Hyper-V specific areas, but there are are also
changes to generic swiotlb code.

Changes in this patch set
=========================
In preparation for supporting PCI pass-thru devices, this patch set
changes the core approach for handling vTOM. In the new approach,
the vTOM bit is treated as a protection flag, and not as part of
the physical address. This new approach is like the approach for
the GPA.SHARED bit in Intel TDX. Furthermore, there's no need to
create a second kernel virtual mapping. When memory is changed
between private and shared using set_memory_decrypted() and
set_memory_encrypted(), the PTEs for the existing kernel mapping
are changed to add or remove the vTOM bit just as with TDX. The
hypercalls to change the memory status on the host side are made
using the existing callback mechanism. Everything just works, with
a minor tweak to map the I/O APIC to use private accesses as mentioned
above.

With the new handling of vTOM in place, existing Hyper-V code that
creates the second kernel virtual mapping still works, but it is now
redundant as the original kernel virtual mapping (as updated) maps
to the same physical address. To simplify things going forward, this
patch set removes the code that creates the second kernel virtual
mapping. And since a second kernel virtual mapping is no longer
needed, changes to the DMA layer proposed as an RFC[1] are no
longer needed.

Finally, to support PCI pass-thru in a Confidential VM, Hyper-V
requires that all accesses to PCI config space be emulated using
a hypercall. This patch set adds functions to invoke those
hypercalls and uses them in the config space access functions
in the Hyper-V PCI driver. Lastly, the Hyper-V PCI driver is
marked as allowed to be used in a Confidential VM. The Hyper-V
PCI driver has been hardened against a malicious Hyper-V in a
previous patch set.[4]

Patch Organization
==================
Patch 1 fixes a bug in __ioremap_caller() that affects the
existing Hyper-V code after the change to treat the vTOM bit as
a protection flag. Fixing the bug allows the old code to continue
to run until later patches in the series remove or update it.
This sequencing avoids the need to enable the new approach and
remove the old code in a single large patch.

Patch 2 handles the I/O APIC quirk by defining a new CC_ATTR enum
member that is set only when running on Hyper-V.

Patch 3 does some simple reordering of code to facilitate Patch 5.

Patch 4 tweaks calls to vmap_pfn() in the old Hyper-V code that
are deleted in later patches in the series. Like Patch 1, this
patch helps avoid the need to enable the new approach and remove
the old code in a single large patch.

Patch 5 enables the new approach to handling vTOM for Hyper-V
guest VMs.

Patches 6 thru 9 remove existing code for creating a second
kernel virtual mapping.

Patch 10 updates existing code so that it no longer assumes that
the vTOM bit is part of the physical address.

Patch 11 adds the new hypercalls for accessing MMIO Config Space.

Patch 12 updates the PCI Hyper-V driver to use the new hypercalls
and enables the PCI Hyper-V driver to be used in a Confidential VM.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/

---

Changes in v2:
* Patch 11: Include more detail in the error message if an MMIO
hypercall fails. [Bjorn Helgaas]

* Patch 12: Restore removed memory barriers. It seems like these
barriers should not be needed because of the spin_unlock() calls,
but commit bdd74440d9e8 indicates that they are. This patch series
will leave the barriers unchanged; whether they are really needed
can be sorted out separately. [Boqun Feng]


Michael Kelley (12):
x86/ioremap: Fix page aligned size calculation in __ioremap_caller()
x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute
x86/hyperv: Reorder code in prep for subsequent patch
Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
x86/hyperv: Change vTOM handling to use standard coco mechanisms
swiotlb: Remove bounce buffer remapping for Hyper-V
Drivers: hv: vmbus: Remove second mapping of VMBus monitor pages
Drivers: hv: vmbus: Remove second way of mapping ring buffers
hv_netvsc: Remove second mapping of send and recv buffers
Drivers: hv: Don't remap addresses that are above shared_gpa_boundary
PCI: hv: Add hypercalls to read/write MMIO space
PCI: hv: Enable PCI pass-thru devices in Confidential VMs

arch/x86/coco/core.c | 10 +-
arch/x86/hyperv/hv_init.c | 7 +-
arch/x86/hyperv/ivm.c | 121 +++++++++----------
arch/x86/include/asm/hyperv-tlfs.h | 3 +
arch/x86/include/asm/mshyperv.h | 8 +-
arch/x86/kernel/apic/io_apic.c | 3 +-
arch/x86/kernel/cpu/mshyperv.c | 22 ++--
arch/x86/mm/ioremap.c | 2 +-
arch/x86/mm/pat/set_memory.c | 6 +-
drivers/hv/Kconfig | 1 -
drivers/hv/channel_mgmt.c | 2 +-
drivers/hv/connection.c | 113 +++++-------------
drivers/hv/hv.c | 23 ++--
drivers/hv/hv_common.c | 11 --
drivers/hv/hyperv_vmbus.h | 2 -
drivers/hv/ring_buffer.c | 62 ++++------
drivers/net/hyperv/hyperv_net.h | 2 -
drivers/net/hyperv/netvsc.c | 48 +-------
drivers/pci/controller/pci-hyperv.c | 232 ++++++++++++++++++++++++++----------
include/asm-generic/hyperv-tlfs.h | 22 ++++
include/asm-generic/mshyperv.h | 2 -
include/linux/cc_platform.h | 13 ++
include/linux/swiotlb.h | 2 -
kernel/dma/swiotlb.c | 45 +------
24 files changed, 358 insertions(+), 404 deletions(-)

--
1.8.3.1



2022-11-11 06:34:29

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH v2 09/12] hv_netvsc: Remove second mapping of send and recv buffers

With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), creating a second kernel virtual
mapping for shared memory is no longer necessary. Everything needed
for the transition to shared is handled by set_memory_decrypted().

As such, remove the code to create and manage the second
mapping for the pre-allocated send and recv buffers. This mapping
is the last user of hv_map_memory()/hv_unmap_memory(), so delete
these functions as well. Finally, hv_map_memory() is the last
user of vmap_pfn() in Hyper-V guest code, so remove the Kconfig
selection of VMAP_PFN.

Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
arch/x86/hyperv/ivm.c | 28 ------------------------
drivers/hv/Kconfig | 1 -
drivers/hv/hv_common.c | 11 ----------
drivers/net/hyperv/hyperv_net.h | 2 --
drivers/net/hyperv/netvsc.c | 48 ++---------------------------------------
include/asm-generic/mshyperv.h | 2 --
6 files changed, 2 insertions(+), 90 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 29ccbe8..5e4b8b0 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -349,34 +349,6 @@ void __init hv_vtom_init(void)

#endif /* CONFIG_AMD_MEM_ENCRYPT */

-/*
- * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
- */
-void *hv_map_memory(void *addr, unsigned long size)
-{
- unsigned long *pfns = kcalloc(size / PAGE_SIZE,
- sizeof(unsigned long), GFP_KERNEL);
- void *vaddr;
- int i;
-
- if (!pfns)
- return NULL;
-
- for (i = 0; i < size / PAGE_SIZE; i++)
- pfns[i] = vmalloc_to_pfn(addr + i * PAGE_SIZE) +
- (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
-
- vaddr = vmap_pfn(pfns, size / PAGE_SIZE, pgprot_decrypted(PAGE_KERNEL_NOENC));
- kfree(pfns);
-
- return vaddr;
-}
-
-void hv_unmap_memory(void *addr)
-{
- vunmap(addr);
-}
-
enum hv_isolation_type hv_get_isolation_type(void)
{
if (!(ms_hyperv.priv_high & HV_ISOLATION))
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0747a8f..9a074cb 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -8,7 +8,6 @@ config HYPERV
|| (ARM64 && !CPU_BIG_ENDIAN))
select PARAVIRT
select X86_HV_CALLBACK_VECTOR if X86
- select VMAP_PFN
help
Select this option to run Linux as a Hyper-V client operating
system.
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index ae68298..566735f 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -308,14 +308,3 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_s
return HV_STATUS_INVALID_PARAMETER;
}
EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
-
-void __weak *hv_map_memory(void *addr, unsigned long size)
-{
- return NULL;
-}
-EXPORT_SYMBOL_GPL(hv_map_memory);
-
-void __weak hv_unmap_memory(void *addr)
-{
-}
-EXPORT_SYMBOL_GPL(hv_unmap_memory);
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index dd5919e..33d51e3 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1139,7 +1139,6 @@ struct netvsc_device {

/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
- void *recv_original_buf;
u32 recv_buf_size; /* allocated bytes */
struct vmbus_gpadl recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1148,7 +1147,6 @@ struct netvsc_device {

/* Send buffer allocated by us */
void *send_buf;
- void *send_original_buf;
u32 send_buf_size;
struct vmbus_gpadl send_buf_gpadl_handle;
u32 send_section_cnt;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 9352dad..661bbe6 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -154,17 +154,8 @@ static void free_netvsc_device(struct rcu_head *head)
int i;

kfree(nvdev->extension);
-
- if (nvdev->recv_original_buf)
- vfree(nvdev->recv_original_buf);
- else
- vfree(nvdev->recv_buf);
-
- if (nvdev->send_original_buf)
- vfree(nvdev->send_original_buf);
- else
- vfree(nvdev->send_buf);
-
+ vfree(nvdev->recv_buf);
+ vfree(nvdev->send_buf);
bitmap_free(nvdev->send_section_map);

for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -347,7 +338,6 @@ static int netvsc_init_buf(struct hv_device *device,
struct nvsp_message *init_packet;
unsigned int buf_size;
int i, ret = 0;
- void *vaddr;

/* Get receive buffer area. */
buf_size = device_info->recv_sections * device_info->recv_section_size;
@@ -383,17 +373,6 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}

- if (hv_isolation_type_snp()) {
- vaddr = hv_map_memory(net_device->recv_buf, buf_size);
- if (!vaddr) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
- net_device->recv_original_buf = net_device->recv_buf;
- net_device->recv_buf = vaddr;
- }
-
/* Notify the NetVsp of the gpadl handle */
init_packet = &net_device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -497,17 +476,6 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}

- if (hv_isolation_type_snp()) {
- vaddr = hv_map_memory(net_device->send_buf, buf_size);
- if (!vaddr) {
- ret = -ENOMEM;
- goto cleanup;
- }
-
- net_device->send_original_buf = net_device->send_buf;
- net_device->send_buf = vaddr;
- }
-
/* Notify the NetVsp of the gpadl handle */
init_packet = &net_device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -762,12 +730,6 @@ void netvsc_device_remove(struct hv_device *device)
netvsc_teardown_send_gpadl(device, net_device, ndev);
}

- if (net_device->recv_original_buf)
- hv_unmap_memory(net_device->recv_buf);
-
- if (net_device->send_original_buf)
- hv_unmap_memory(net_device->send_buf);
-
/* Release all resources */
free_netvsc_device_rcu(net_device);
}
@@ -1831,12 +1793,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device *device,
netif_napi_del(&net_device->chan_table[0].napi);

cleanup2:
- if (net_device->recv_original_buf)
- hv_unmap_memory(net_device->recv_buf);
-
- if (net_device->send_original_buf)
- hv_unmap_memory(net_device->send_buf);
-
free_netvsc_device(&net_device->rcu);

return ERR_PTR(ret);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bfb9eb9..6fabc4a 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -267,8 +267,6 @@ static inline int cpumask_to_vpset_noself(struct hv_vpset *vpset,
void hyperv_cleanup(void);
bool hv_query_ext_cap(u64 cap_query);
void hv_setup_dma_ops(struct device *dev, bool coherent);
-void *hv_map_memory(void *addr, unsigned long size);
-void hv_unmap_memory(void *addr);
#else /* CONFIG_HYPERV */
static inline bool hv_is_hyperv_initialized(void) { return false; }
static inline bool hv_is_hibernation_supported(void) { return false; }
--
1.8.3.1


2022-11-11 06:35:23

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH v2 10/12] Drivers: hv: Don't remap addresses that are above shared_gpa_boundary

With the vTOM bit now treated as a protection flag and not part of
the physical address, avoid remapping physical addresses with vTOM set
since technically such addresses aren't valid. Use ioremap_cache()
instead of memremap() to ensure that the mapping provides decrypted
access, which will correctly set the vTOM bit as a protection flag.

While this change is not required for correctness with the current
implementation of memremap(), for general code hygiene it's better to
not depend on the mapping functions doing something reasonable with
a physical address that is out-of-range.

While here, fix typos in two error messages.

Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
arch/x86/hyperv/hv_init.c | 7 +++++--
drivers/hv/hv.c | 23 +++++++++++++----------
2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index f49bc3e..7f46e12 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -64,7 +64,10 @@ static int hyperv_init_ghcb(void)
* memory boundary and map it here.
*/
rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
- ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+
+ /* Mask out vTOM bit. ioremap_cache() maps decrypted */
+ ghcb_gpa &= ~ms_hyperv.shared_gpa_boundary;
+ ghcb_va = (void *)ioremap_cache(ghcb_gpa, HV_HYP_PAGE_SIZE);
if (!ghcb_va)
return -ENOMEM;

@@ -220,7 +223,7 @@ static int hv_cpu_die(unsigned int cpu)
if (hv_ghcb_pg) {
ghcb_va = (void **)this_cpu_ptr(hv_ghcb_pg);
if (*ghcb_va)
- memunmap(*ghcb_va);
+ iounmap(*ghcb_va);
*ghcb_va = NULL;
}

diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 4d6480d..410e6c4 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -217,11 +217,13 @@ void hv_synic_enable_regs(unsigned int cpu)
simp.simp_enabled = 1;

if (hv_isolation_type_snp()) {
+ /* Mask out vTOM bit. ioremap_cache() maps decrypted */
+ u64 base = (simp.base_simp_gpa << HV_HYP_PAGE_SHIFT) &
+ ~ms_hyperv.shared_gpa_boundary;
hv_cpu->synic_message_page
- = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
- HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+ = (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
if (!hv_cpu->synic_message_page)
- pr_err("Fail to map syinc message page.\n");
+ pr_err("Fail to map synic message page.\n");
} else {
simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
>> HV_HYP_PAGE_SHIFT;
@@ -234,12 +236,13 @@ void hv_synic_enable_regs(unsigned int cpu)
siefp.siefp_enabled = 1;

if (hv_isolation_type_snp()) {
- hv_cpu->synic_event_page =
- memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
- HV_HYP_PAGE_SIZE, MEMREMAP_WB);
-
+ /* Mask out vTOM bit. ioremap_cache() maps decrypted */
+ u64 base = (siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT) &
+ ~ms_hyperv.shared_gpa_boundary;
+ hv_cpu->synic_event_page
+ = (void *)ioremap_cache(base, HV_HYP_PAGE_SIZE);
if (!hv_cpu->synic_event_page)
- pr_err("Fail to map syinc event page.\n");
+ pr_err("Fail to map synic event page.\n");
} else {
siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
>> HV_HYP_PAGE_SHIFT;
@@ -316,7 +319,7 @@ void hv_synic_disable_regs(unsigned int cpu)
*/
simp.simp_enabled = 0;
if (hv_isolation_type_snp())
- memunmap(hv_cpu->synic_message_page);
+ iounmap(hv_cpu->synic_message_page);
else
simp.base_simp_gpa = 0;

@@ -326,7 +329,7 @@ void hv_synic_disable_regs(unsigned int cpu)
siefp.siefp_enabled = 0;

if (hv_isolation_type_snp())
- memunmap(hv_cpu->synic_event_page);
+ iounmap(hv_cpu->synic_event_page);
else
siefp.base_siefp_gpa = 0;

--
1.8.3.1


2022-11-11 06:37:52

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH v2 02/12] x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute

Current code always maps the IOAPIC as shared (decrypted) in a
confidential VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM
enabled use a paravisor running in VMPL0 to emulate the IOAPIC.
In such a case, the IOAPIC must be accessed as private (encrypted).

Fix this by gating the IOAPIC decrypted mapping on a new
cc_platform_has() attribute that a subsequent patch in the series
will set only for Hyper-V guests. The new attribute is named
somewhat generically because similar paravisor emulation cases
may arise in the future.

Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Wei Liu <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 3 ++-
include/linux/cc_platform.h | 13 +++++++++++++
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index a868b76..d2c1bf7 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2686,7 +2686,8 @@ static void io_apic_set_fixmap(enum fixed_addresses idx, phys_addr_t phys)
* Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
* bits, just like normal ioremap():
*/
- flags = pgprot_decrypted(flags);
+ if (!cc_platform_has(CC_ATTR_HAS_PARAVISOR))
+ flags = pgprot_decrypted(flags);

__set_fixmap(idx, phys, flags);
}
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd..b6c4a79 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -90,6 +90,19 @@ enum cc_attr {
* Examples include TDX Guest.
*/
CC_ATTR_HOTPLUG_DISABLED,
+
+ /**
+ * @CC_ATTR_HAS_PARAVISOR: Guest VM is running with a paravisor
+ *
+ * The platform/OS is running as a guest/virtual machine with
+ * a paravisor in VMPL0. Having a paravisor affects things
+ * like whether the I/O APIC is emulated and operates in the
+ * encrypted or decrypted portion of the guest physical address
+ * space.
+ *
+ * Examples include Hyper-V SEV-SNP guests using vTOM.
+ */
+ CC_ATTR_HAS_PARAVISOR,
};

#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
--
1.8.3.1


2022-11-11 06:50:57

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH v2 12/12] PCI: hv: Enable PCI pass-thru devices in Confidential VMs

For PCI pass-thru devices in a Confidential VM, Hyper-V requires
that PCI config space be accessed via hypercalls. In normal VMs,
config space accesses are trapped to the Hyper-V host and emulated.
But in a confidential VM, the host can't access guest memory to
decode the instruction for emulation, so an explicit hypercall must
be used.

Update the PCI config space access functions to use the hypercalls
when such use is indicated by Hyper-V flags. Also, set the flag to
allow the Hyper-V PCI driver to be loaded and used in a Confidential
VM (a.k.a., "Isolation VM"). The driver has previously been hardened
against a malicious Hyper-V host[1].

[1] https://lore.kernel.org/all/[email protected]/

Co-developed-by: Dexuan Cui <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Boqun Feng <[email protected]>
---
drivers/hv/channel_mgmt.c | 2 +-
drivers/pci/controller/pci-hyperv.c | 168 ++++++++++++++++++++++--------------
2 files changed, 105 insertions(+), 65 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 5b12040..c0f9ac2 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -67,7 +67,7 @@
{ .dev_type = HV_PCIE,
HV_PCIE_GUID,
.perf_device = false,
- .allowed_in_isolated = false,
+ .allowed_in_isolated = true,
},

/* Synthetic Frame Buffer */
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 09b40a1..6ce83e4 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -514,6 +514,7 @@ struct hv_pcibus_device {

/* Highest slot of child device with resources allocated */
int wslot_res_allocated;
+ bool use_calls; /* Use hypercalls to access mmio cfg space */

/* hypercall arg, must not cross page boundary */
struct hv_retarget_device_interrupt retarget_msi_interrupt_params;
@@ -1136,8 +1137,10 @@ static void hv_pci_write_mmio(struct device *dev, phys_addr_t gpa, int size, u32
static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
int size, u32 *val)
{
+ struct hv_pcibus_device *hbus = hpdev->hbus;
+ struct device *dev = &hbus->hdev->device;
+ int offset = where + CFG_PAGE_OFFSET;
unsigned long flags;
- void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET + where;

/*
* If the attempt is to read the IDs or the ROM BAR, simulate that.
@@ -1165,56 +1168,79 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
*/
*val = 0;
} else if (where + size <= CFG_PAGE_SIZE) {
- spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
- /* Choose the function to be read. (See comment above) */
- writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
- /* Make sure the function was chosen before we start reading. */
- mb();
- /* Read from that function's config space. */
- switch (size) {
- case 1:
- *val = readb(addr);
- break;
- case 2:
- *val = readw(addr);
- break;
- default:
- *val = readl(addr);
- break;
+
+ spin_lock_irqsave(&hbus->config_lock, flags);
+ if (hbus->use_calls) {
+ phys_addr_t addr = hbus->mem_config->start + offset;
+
+ hv_pci_write_mmio(dev, hbus->mem_config->start, 4,
+ hpdev->desc.win_slot.slot);
+ hv_pci_read_mmio(dev, addr, size, val);
+ } else {
+ void __iomem *addr = hbus->cfg_addr + offset;
+
+ /* Choose the function to be read. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hbus->cfg_addr);
+ /* Make sure the function was chosen before reading. */
+ mb();
+ /* Read from that function's config space. */
+ switch (size) {
+ case 1:
+ *val = readb(addr);
+ break;
+ case 2:
+ *val = readw(addr);
+ break;
+ default:
+ *val = readl(addr);
+ break;
+ }
+ /*
+ * Make sure the read was done before we release the
+ * spinlock allowing consecutive reads/writes.
+ */
+ mb();
}
- /*
- * Make sure the read was done before we release the spinlock
- * allowing consecutive reads/writes.
- */
- mb();
- spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+ spin_unlock_irqrestore(&hbus->config_lock, flags);
} else {
- dev_err(&hpdev->hbus->hdev->device,
- "Attempt to read beyond a function's config space.\n");
+ dev_err(dev, "Attempt to read beyond a function's config space.\n");
}
}

static u16 hv_pcifront_get_vendor_id(struct hv_pci_dev *hpdev)
{
+ struct hv_pcibus_device *hbus = hpdev->hbus;
+ struct device *dev = &hbus->hdev->device;
+ u32 val;
u16 ret;
unsigned long flags;
- void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET +
- PCI_VENDOR_ID;

- spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
+ spin_lock_irqsave(&hbus->config_lock, flags);

- /* Choose the function to be read. (See comment above) */
- writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
- /* Make sure the function was chosen before we start reading. */
- mb();
- /* Read from that function's config space. */
- ret = readw(addr);
- /*
- * mb() is not required here, because the spin_unlock_irqrestore()
- * is a barrier.
- */
+ if (hbus->use_calls) {
+ phys_addr_t addr = hbus->mem_config->start +
+ CFG_PAGE_OFFSET + PCI_VENDOR_ID;
+
+ hv_pci_write_mmio(dev, hbus->mem_config->start, 4,
+ hpdev->desc.win_slot.slot);
+ hv_pci_read_mmio(dev, addr, 2, &val);
+ ret = val; /* Truncates to 16 bits */
+ } else {
+ void __iomem *addr = hbus->cfg_addr + CFG_PAGE_OFFSET +
+ PCI_VENDOR_ID;
+ /* Choose the function to be read. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hbus->cfg_addr);
+ /* Make sure the function was chosen before we start reading. */
+ mb();
+ /* Read from that function's config space. */
+ ret = readw(addr);
+ /*
+ * mb() is not required here, because the
+ * spin_unlock_irqrestore() is a barrier.
+ */
+ }

- spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+ spin_unlock_irqrestore(&hbus->config_lock, flags);

return ret;
}
@@ -1229,39 +1255,51 @@ static u16 hv_pcifront_get_vendor_id(struct hv_pci_dev *hpdev)
static void _hv_pcifront_write_config(struct hv_pci_dev *hpdev, int where,
int size, u32 val)
{
+ struct hv_pcibus_device *hbus = hpdev->hbus;
+ struct device *dev = &hbus->hdev->device;
+ int offset = where + CFG_PAGE_OFFSET;
unsigned long flags;
- void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET + where;

if (where >= PCI_SUBSYSTEM_VENDOR_ID &&
where + size <= PCI_CAPABILITY_LIST) {
/* SSIDs and ROM BARs are read-only */
} else if (where >= PCI_COMMAND && where + size <= CFG_PAGE_SIZE) {
- spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
- /* Choose the function to be written. (See comment above) */
- writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
- /* Make sure the function was chosen before we start writing. */
- wmb();
- /* Write to that function's config space. */
- switch (size) {
- case 1:
- writeb(val, addr);
- break;
- case 2:
- writew(val, addr);
- break;
- default:
- writel(val, addr);
- break;
+ spin_lock_irqsave(&hbus->config_lock, flags);
+
+ if (hbus->use_calls) {
+ phys_addr_t addr = hbus->mem_config->start + offset;
+
+ hv_pci_write_mmio(dev, hbus->mem_config->start, 4,
+ hpdev->desc.win_slot.slot);
+ hv_pci_write_mmio(dev, addr, size, val);
+ } else {
+ void __iomem *addr = hbus->cfg_addr + offset;
+
+ /* Choose the function to write. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hbus->cfg_addr);
+ /* Make sure the function was chosen before writing. */
+ wmb();
+ /* Write to that function's config space. */
+ switch (size) {
+ case 1:
+ writeb(val, addr);
+ break;
+ case 2:
+ writew(val, addr);
+ break;
+ default:
+ writel(val, addr);
+ break;
+ }
+ /*
+ * Make sure the write was done before we release the
+ * spinlock allowing consecutive reads/writes.
+ */
+ mb();
}
- /*
- * Make sure the write was done before we release the spinlock
- * allowing consecutive reads/writes.
- */
- mb();
- spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+ spin_unlock_irqrestore(&hbus->config_lock, flags);
} else {
- dev_err(&hpdev->hbus->hdev->device,
- "Attempt to write beyond a function's config space.\n");
+ dev_err(dev, "Attempt to write beyond a function's config space.\n");
}
}

@@ -3580,6 +3618,7 @@ static int hv_pci_probe(struct hv_device *hdev,
hbus->bridge->domain_nr = dom;
#ifdef CONFIG_X86
hbus->sysdata.domain = dom;
+ hbus->use_calls = !!(ms_hyperv.hints & HV_X64_USE_MMIO_HYPERCALLS);
#elif defined(CONFIG_ARM64)
/*
* Set the PCI bus parent to be the corresponding VMbus
@@ -3589,6 +3628,7 @@ static int hv_pci_probe(struct hv_device *hdev,
* information to devices created on the bus.
*/
hbus->sysdata.parent = hdev->device.parent;
+ hbus->use_calls = false;
#endif

hbus->hdev = hdev;
--
1.8.3.1


2022-11-11 07:08:30

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH v2 08/12] Drivers: hv: vmbus: Remove second way of mapping ring buffers

With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), it's no longer necessary to
have separate code paths for mapping VMBus ring buffers for
for normal VMs and for Confidential VMs.

As such, remove the code path that uses vmap_pfn(), and set
the protection flags argument to vmap() to account for the
difference between normal and Confidential VMs.

Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Tianyu Lan <[email protected]>
---
drivers/hv/ring_buffer.c | 62 ++++++++++++++++--------------------------------
1 file changed, 20 insertions(+), 42 deletions(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index b4a91b1..20a0631 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -186,8 +186,6 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
struct page *pages, u32 page_cnt, u32 max_pkt_size)
{
struct page **pages_wraparound;
- unsigned long *pfns_wraparound;
- u64 pfn;
int i;

BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
@@ -196,50 +194,30 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
* First page holds struct hv_ring_buffer, do wraparound mapping for
* the rest.
*/
- if (hv_isolation_type_snp()) {
- pfn = page_to_pfn(pages) +
- PFN_DOWN(ms_hyperv.shared_gpa_boundary);
+ pages_wraparound = kcalloc(page_cnt * 2 - 1,
+ sizeof(struct page *),
+ GFP_KERNEL);
+ if (!pages_wraparound)
+ return -ENOMEM;

- pfns_wraparound = kcalloc(page_cnt * 2 - 1,
- sizeof(unsigned long), GFP_KERNEL);
- if (!pfns_wraparound)
- return -ENOMEM;
-
- pfns_wraparound[0] = pfn;
- for (i = 0; i < 2 * (page_cnt - 1); i++)
- pfns_wraparound[i + 1] = pfn + i % (page_cnt - 1) + 1;
-
- ring_info->ring_buffer = (struct hv_ring_buffer *)
- vmap_pfn(pfns_wraparound, page_cnt * 2 - 1,
- pgprot_decrypted(PAGE_KERNEL_NOENC));
- kfree(pfns_wraparound);
-
- if (!ring_info->ring_buffer)
- return -ENOMEM;
-
- /* Zero ring buffer after setting memory host visibility. */
- memset(ring_info->ring_buffer, 0x00, PAGE_SIZE * page_cnt);
- } else {
- pages_wraparound = kcalloc(page_cnt * 2 - 1,
- sizeof(struct page *),
- GFP_KERNEL);
- if (!pages_wraparound)
- return -ENOMEM;
-
- pages_wraparound[0] = pages;
- for (i = 0; i < 2 * (page_cnt - 1); i++)
- pages_wraparound[i + 1] =
- &pages[i % (page_cnt - 1) + 1];
+ pages_wraparound[0] = pages;
+ for (i = 0; i < 2 * (page_cnt - 1); i++)
+ pages_wraparound[i + 1] =
+ &pages[i % (page_cnt - 1) + 1];

- ring_info->ring_buffer = (struct hv_ring_buffer *)
- vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP,
- PAGE_KERNEL);
+ ring_info->ring_buffer = (struct hv_ring_buffer *)
+ vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP,
+ pgprot_decrypted(PAGE_KERNEL_NOENC));

- kfree(pages_wraparound);
- if (!ring_info->ring_buffer)
- return -ENOMEM;
- }
+ kfree(pages_wraparound);
+ if (!ring_info->ring_buffer)
+ return -ENOMEM;

+ /*
+ * Ensure the header page is zero'ed since
+ * encryption status may have changed.
+ */
+ memset(ring_info->ring_buffer, 0, HV_HYP_PAGE_SIZE);

ring_info->ring_buffer->read_index =
ring_info->ring_buffer->write_index = 0;
--
1.8.3.1


2022-11-11 17:21:10

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Drivers: hv: Add PCI pass-thru support to Hyper-V Confidential VMs

On Thu, Nov 10, 2022 at 10:21:29PM -0800, Michael Kelley wrote:
[...]
> Patch Organization
> ==================
> Patch 1 fixes a bug in __ioremap_caller() that affects the
> existing Hyper-V code after the change to treat the vTOM bit as
> a protection flag. Fixing the bug allows the old code to continue
> to run until later patches in the series remove or update it.
> This sequencing avoids the need to enable the new approach and
> remove the old code in a single large patch.
>
> Patch 2 handles the I/O APIC quirk by defining a new CC_ATTR enum
> member that is set only when running on Hyper-V.

I'm waiting for x86 maintainers acks on these two patches before merging
this series.

Thanks,
Wei.

2022-11-12 00:56:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 02/12] x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute

On 11/10/22 22:21, Michael Kelley wrote:
> * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> * bits, just like normal ioremap():
> */
> - flags = pgprot_decrypted(flags);
> + if (!cc_platform_has(CC_ATTR_HAS_PARAVISOR))
> + flags = pgprot_decrypted(flags);

This begs the question whether *all* paravisors will want to avoid a
decrypted ioapic mapping. Is this _fundamental_ to paravisors, or it is
an implementation detail of this _individual_ paravisor?

2022-11-12 05:21:32

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v2 02/12] x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute

From: Dave Hansen <[email protected]> Sent: Friday, November 11, 2022 4:22 PM
>
> On 11/10/22 22:21, Michael Kelley wrote:
> > * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> > * bits, just like normal ioremap():
> > */
> > - flags = pgprot_decrypted(flags);
> > + if (!cc_platform_has(CC_ATTR_HAS_PARAVISOR))
> > + flags = pgprot_decrypted(flags);
>
> This begs the question whether *all* paravisors will want to avoid a
> decrypted ioapic mapping. Is this _fundamental_ to paravisors, or it is
> an implementation detail of this _individual_ paravisor?

Hard to say. The paravisor that Hyper-V provides for use with the vTOM
option in a SEV SNP VM is the only paravisor I've seen. At least as defined
by Hyper-V and AMD SNP Virtual Machine Privilege Levels (VMPLs), the
paravisor resides within the VM trust boundary. Anything that a paravisor
emulates would be in the "private" (i.e., encrypted) memory so it can be
accessed by both the guest OS and the paravisor. But nothing fundamental
says that IOAPIC emulation *must* be done in the paravisor.

I originally though about naming this attribute HAS_EMULATED_IOAPIC, but
that felt a bit narrow as other emulated hardware might need similar treatment
in the future, at least with the Hyper-V and AMD SEV SNP vTOM paravisor.

Net, we currently have N=1 for paravisors, and we won't know what the more
generalized case looks like until N >= 2. If/when that happens, additional logic
might be needed here, and the name of this attribute might need adjustment
to support broader usage. But if there's consensus on a different name now,
or on the narrower HAS_EMULATED_IOAPIC name, it doesn’t really matter
to me.

Michael

2022-11-14 17:24:49

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v2 02/12] x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute

From: Dave Hansen <[email protected]> Sent: Monday, November 14, 2022 8:23 AM
>
> On 11/11/22 20:48, Michael Kelley (LINUX) wrote:
> > From: Dave Hansen <[email protected]> Sent: Friday, November 11, 2022 4:22
> PM
> >> On 11/10/22 22:21, Michael Kelley wrote:
> >>> * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
> >>> * bits, just like normal ioremap():
> >>> */
> >>> - flags = pgprot_decrypted(flags);
> >>> + if (!cc_platform_has(CC_ATTR_HAS_PARAVISOR))
> >>> + flags = pgprot_decrypted(flags);
> >> This begs the question whether *all* paravisors will want to avoid a
> >> decrypted ioapic mapping. Is this _fundamental_ to paravisors, or it is
> >> an implementation detail of this _individual_ paravisor?
> > Hard to say. The paravisor that Hyper-V provides for use with the vTOM
> > option in a SEV SNP VM is the only paravisor I've seen. At least as defined
> > by Hyper-V and AMD SNP Virtual Machine Privilege Levels (VMPLs), the
> > paravisor resides within the VM trust boundary. Anything that a paravisor
> > emulates would be in the "private" (i.e., encrypted) memory so it can be
> > accessed by both the guest OS and the paravisor. But nothing fundamental
> > says that IOAPIC emulation *must* be done in the paravisor.
>
> Please just make this check more specific. Either make this a specific
> Hyper-V+SVM check, or rename it HAS_EMULATED_IOAPIC, like you were
> thinking. If paravisors catch on and we end up with ten more of these
> things across five different paravisors and see a pattern, *then* a
> paravisor-specific one makes sense.

I'm good with that. I'll use HAS_EMULATED_IOAPIC in v3.

Michael

2022-11-14 17:36:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 02/12] x86/ioapic: Gate decrypted mapping on cc_platform_has() attribute

On 11/11/22 20:48, Michael Kelley (LINUX) wrote:
> From: Dave Hansen <[email protected]> Sent: Friday, November 11, 2022 4:22 PM
>> On 11/10/22 22:21, Michael Kelley wrote:
>>> * Ensure fixmaps for IOAPIC MMIO respect memory encryption pgprot
>>> * bits, just like normal ioremap():
>>> */
>>> - flags = pgprot_decrypted(flags);
>>> + if (!cc_platform_has(CC_ATTR_HAS_PARAVISOR))
>>> + flags = pgprot_decrypted(flags);
>> This begs the question whether *all* paravisors will want to avoid a
>> decrypted ioapic mapping. Is this _fundamental_ to paravisors, or it is
>> an implementation detail of this _individual_ paravisor?
> Hard to say. The paravisor that Hyper-V provides for use with the vTOM
> option in a SEV SNP VM is the only paravisor I've seen. At least as defined
> by Hyper-V and AMD SNP Virtual Machine Privilege Levels (VMPLs), the
> paravisor resides within the VM trust boundary. Anything that a paravisor
> emulates would be in the "private" (i.e., encrypted) memory so it can be
> accessed by both the guest OS and the paravisor. But nothing fundamental
> says that IOAPIC emulation *must* be done in the paravisor.

Please just make this check more specific. Either make this a specific
Hyper-V+SVM check, or rename it HAS_EMULATED_IOAPIC, like you were
thinking. If paravisors catch on and we end up with ten more of these
things across five different paravisors and see a pattern, *then* a
paravisor-specific one makes sense.