From: Tianyu Lan <[email protected]>
Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.
The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.
Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.
There are two exceptions - packets sent by vmbus_sendpacket_
pagebuffer() and vmbus_sendpacket_mpb_desc(). These packets
contains IO stack memory address and host will access these memory.
So add allocation bounce buffer support in vmbus for these packets.
For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.
This patchset is based on the 5.15-rc1.
Change since v4:
- Hide hv_mark_gpa_visibility() and set memory visibility via
set_memory_encrypted/decrypted()
- Change gpadl handle in netvsc and uio driver from u32 to
struct vmbus_gpadl.
- Change vmbus_establish_gpadl()'s gpadl_handle parameter
to vmbus_gpadl data structure.
- Remove hv_get_simp(), hv_get_siefp() hv_get_synint_*()
helper function. Move the logic into hv_get/set_register().
- Use scsi_dma_map/unmap() instead of dma_map/unmap_sg() in storvsc driver.
- Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
Change since V3:
- Initalize GHCB page in the cpu init callbac.
- Change vmbus_teardown_gpadl() parameter in order to
mask the memory back to non-visible to host.
- Merge hv_ringbuffer_post_init() into hv_ringbuffer_init().
- Keep Hyper-V bounce buffer size as same as AMD SEV VM
- Use dma_map_sg() instead of dm_map_page() in the storvsc driver.
Change since V2:
- Drop x86_set_memory_enc static call and use platform check
in the __set_memory_enc_dec() to run platform callback of
set memory encrypted or decrypted.
Change since V1:
- Introduce x86_set_memory_enc static call and so platforms can
override __set_memory_enc_dec() with their implementation
- Introduce sev_es_ghcb_hv_call_simple() and share code
between SEV and Hyper-V code.
- Not remap monitor pages in the non-SNP isolation VM
- Make swiotlb_init_io_tlb_mem() return error code and return
error when dma_map_decrypted() fails.
Change since RFC V4:
- Introduce dma map decrypted function to remap bounce buffer
and provide dma map decrypted ops for platform to hook callback.
- Split swiotlb and dma map decrypted change into two patches
- Replace vstart with vaddr in swiotlb changes.
Change since RFC v3:
- Add interface set_memory_decrypted_map() to decrypt memory and
map bounce buffer in extra address space
- Remove swiotlb remap function and store the remap address
returned by set_memory_decrypted_map() in swiotlb mem data structure.
- Introduce hv_set_mem_enc() to make code more readable in the __set_memory_enc_dec().
Change since RFC v2:
- Remove not UIO driver in Isolation VM patch
- Use vmap_pfn() to replace ioremap_page_range function in
order to avoid exposing symbol ioremap_page_range() and
ioremap_page_range()
- Call hv set mem host visibility hvcall in set_memory_encrypted/decrypted()
- Enable swiotlb force mode instead of adding Hyper-V dma map/unmap hook
- Fix code style
Tianyu Lan (12):
x86/hyperv: Initialize GHCB page in Isolation VM
x86/hyperv: Initialize shared memory boundary in the Isolation VM.
x86/hyperv: Add new hvcall guest address host visibility support
Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in
Isolation VM
x86/hyperv: Add Write/Read MSR registers via ghcb page
x86/hyperv: Add ghcb hvcall support for SNP VM
Drivers: hv: vmbus: Add SNP support for VMbus channel initiate
message
Drivers: hv : vmbus: Initialize VMbus ring buffer for Isolation VM
x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
scsi: storvsc: Add Isolation VM support for storvsc driver
net: netvsc: Add Isolation VM support for netvsc driver
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 78 ++++++--
arch/x86/hyperv/ivm.c | 282 ++++++++++++++++++++++++++
arch/x86/include/asm/hyperv-tlfs.h | 17 ++
arch/x86/include/asm/mshyperv.h | 62 ++++--
arch/x86/include/asm/sev.h | 6 +
arch/x86/kernel/cpu/mshyperv.c | 5 +
arch/x86/kernel/sev-shared.c | 63 +++---
arch/x86/mm/mem_encrypt.c | 3 +-
arch/x86/mm/pat/set_memory.c | 19 +-
arch/x86/xen/pci-swiotlb-xen.c | 3 +-
drivers/hv/Kconfig | 1 +
drivers/hv/channel.c | 73 ++++---
drivers/hv/connection.c | 96 ++++++++-
drivers/hv/hv.c | 85 ++++++--
drivers/hv/hv_common.c | 12 ++
drivers/hv/hyperv_vmbus.h | 2 +
drivers/hv/ring_buffer.c | 55 ++++--
drivers/hv/vmbus_drv.c | 4 +
drivers/iommu/hyperv-iommu.c | 60 ++++++
drivers/net/hyperv/hyperv_net.h | 12 +-
drivers/net/hyperv/netvsc.c | 304 +++++++++++++++++++++++++++--
drivers/net/hyperv/netvsc_drv.c | 1 +
drivers/net/hyperv/rndis_filter.c | 2 +
drivers/scsi/storvsc_drv.c | 24 ++-
drivers/uio/uio_hv_generic.c | 20 +-
include/asm-generic/hyperv-tlfs.h | 1 +
include/asm-generic/mshyperv.h | 17 +-
include/linux/hyperv.h | 19 +-
include/linux/swiotlb.h | 6 +
kernel/dma/swiotlb.c | 41 +++-
31 files changed, 1204 insertions(+), 171 deletions(-)
create mode 100644 arch/x86/hyperv/ivm.c
--
2.25.1
From: Tianyu Lan <[email protected]>
Add new hvcall guest address host visibility support to mark
memory visible to host. Call it inside set_memory_decrypted
/encrypted(). Add HYPERVISOR feature check in the
hv_is_isolation_supported() to optimize in non-virtualization
environment.
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Fix typo in the comment
* Make hv_mark_gpa_visibility() to be a static function
* Merge __hv_set_mem_host_visibility() and hv_set_mem_host_visibility()
Change since v3:
* Fix error code handle in the __hv_set_mem_host_visibility().
* Move HvCallModifySparseGpaPageHostVisibility near to enum
hv_mem_host_visibility.
Change since v2:
* Rework __set_memory_enc_dec() and call Hyper-V and AMD function
according to platform check.
Change since v1:
* Use new staic call x86_set_memory_enc to avoid add Hyper-V
specific check in the set_memory code.
---
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 6 ++
arch/x86/hyperv/ivm.c | 105 +++++++++++++++++++++++++++++
arch/x86/include/asm/hyperv-tlfs.h | 17 +++++
arch/x86/include/asm/mshyperv.h | 2 +-
arch/x86/mm/pat/set_memory.c | 19 ++++--
include/asm-generic/hyperv-tlfs.h | 1 +
include/asm-generic/mshyperv.h | 1 +
8 files changed, 146 insertions(+), 7 deletions(-)
create mode 100644 arch/x86/hyperv/ivm.c
diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 48e2c51464e8..5d2de10809ae 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-y := hv_init.o mmu.o nested.o irqdomain.o
+obj-y := hv_init.o mmu.o nested.o irqdomain.o ivm.o
obj-$(CONFIG_X86_64) += hv_apic.o hv_proc.o
ifdef CONFIG_X86_64
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index a7e922755ad1..d57df6825527 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -603,6 +603,12 @@ EXPORT_SYMBOL_GPL(hv_get_isolation_type);
bool hv_is_isolation_supported(void)
{
+ if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+ return false;
+
+ if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
+ return false;
+
return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
}
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
new file mode 100644
index 000000000000..79e7fb83472a
--- /dev/null
+++ b/arch/x86/hyperv/ivm.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hyper-V Isolation VM interface with paravisor and hypervisor
+ *
+ * Author:
+ * Tianyu Lan <[email protected]>
+ */
+
+#include <linux/hyperv.h>
+#include <linux/types.h>
+#include <linux/bitfield.h>
+#include <linux/slab.h>
+#include <asm/io.h>
+#include <asm/mshyperv.h>
+
+/*
+ * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
+ *
+ * In Isolation VM, all guest memory is encrypted from host and guest
+ * needs to set memory visible to host via hvcall before sharing memory
+ * with host.
+ */
+static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
+ enum hv_mem_host_visibility visibility)
+{
+ struct hv_gpa_range_for_visibility **input_pcpu, *input;
+ u16 pages_processed;
+ u64 hv_status;
+ unsigned long flags;
+
+ /* no-op if partition isolation is not enabled */
+ if (!hv_is_isolation_supported())
+ return 0;
+
+ if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
+ pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
+ HV_MAX_MODIFY_GPA_REP_COUNT);
+ return -EINVAL;
+ }
+
+ local_irq_save(flags);
+ input_pcpu = (struct hv_gpa_range_for_visibility **)
+ this_cpu_ptr(hyperv_pcpu_input_arg);
+ input = *input_pcpu;
+ if (unlikely(!input)) {
+ local_irq_restore(flags);
+ return -EINVAL;
+ }
+
+ input->partition_id = HV_PARTITION_ID_SELF;
+ input->host_visibility = visibility;
+ input->reserved0 = 0;
+ input->reserved1 = 0;
+ memcpy((void *)input->gpa_page_list, pfn, count * sizeof(*pfn));
+ hv_status = hv_do_rep_hypercall(
+ HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
+ 0, input, &pages_processed);
+ local_irq_restore(flags);
+
+ if (hv_result_success(hv_status))
+ return 0;
+ else
+ return -EFAULT;
+}
+
+/*
+ * hv_set_mem_host_visibility - Set specified memory visible to host.
+ *
+ * In Isolation VM, all guest memory is encrypted from host and guest
+ * needs to set memory visible to host via hvcall before sharing memory
+ * with host. This function works as wrap of hv_mark_gpa_visibility()
+ * with memory base and size.
+ */
+int hv_set_mem_host_visibility(unsigned long kbuffer, int pagecount, bool visible)
+{
+ enum hv_mem_host_visibility visibility = visible ?
+ VMBUS_PAGE_VISIBLE_READ_WRITE : VMBUS_PAGE_NOT_VISIBLE;
+ u64 *pfn_array;
+ int ret = 0;
+ int i, pfn;
+
+ if (!hv_is_isolation_supported() || !hv_hypercall_pg)
+ return 0;
+
+ pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
+ if (!pfn_array)
+ return -ENOMEM;
+
+ for (i = 0, pfn = 0; i < pagecount; i++) {
+ pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
+ pfn++;
+
+ if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
+ ret = hv_mark_gpa_visibility(pfn, pfn_array,
+ visibility);
+ if (ret)
+ goto err_free_pfn_array;
+ pfn = 0;
+ }
+ }
+
+ err_free_pfn_array:
+ kfree(pfn_array);
+ return ret;
+}
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 2322d6bd5883..381e88122a5f 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -276,6 +276,23 @@ enum hv_isolation_type {
#define HV_X64_MSR_TIME_REF_COUNT HV_REGISTER_TIME_REF_COUNT
#define HV_X64_MSR_REFERENCE_TSC HV_REGISTER_REFERENCE_TSC
+/* Hyper-V memory host visibility */
+enum hv_mem_host_visibility {
+ VMBUS_PAGE_NOT_VISIBLE = 0,
+ VMBUS_PAGE_VISIBLE_READ_ONLY = 1,
+ VMBUS_PAGE_VISIBLE_READ_WRITE = 3
+};
+
+/* HvCallModifySparseGpaPageHostVisibility hypercall */
+#define HV_MAX_MODIFY_GPA_REP_COUNT ((PAGE_SIZE / sizeof(u64)) - 2)
+struct hv_gpa_range_for_visibility {
+ u64 partition_id;
+ u32 host_visibility:2;
+ u32 reserved0:30;
+ u32 reserved1;
+ u64 gpa_page_list[HV_MAX_MODIFY_GPA_REP_COUNT];
+} __packed;
+
/*
* Declare the MSR used to setup pages used to communicate with the hypervisor.
*/
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 37739a277ac6..ede440f9a1e2 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -192,7 +192,7 @@ struct irq_domain *hv_create_pci_msi_domain(void);
int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
struct hv_interrupt_entry *entry);
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
-
+int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);
#else /* CONFIG_HYPERV */
static inline void hyperv_init(void) {}
static inline void hyperv_setup_mmu_ops(void) {}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index ad8a5c586a35..1e4a0882820a 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -29,6 +29,8 @@
#include <asm/proto.h>
#include <asm/memtype.h>
#include <asm/set_memory.h>
+#include <asm/hyperv-tlfs.h>
+#include <asm/mshyperv.h>
#include "../mm_internal.h"
@@ -1980,15 +1982,11 @@ int set_memory_global(unsigned long addr, int numpages)
__pgprot(_PAGE_GLOBAL), 0);
}
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
{
struct cpa_data cpa;
int ret;
- /* Nothing to do if memory encryption is not active */
- if (!mem_encrypt_active())
- return 0;
-
/* Should not be working on unaligned addresses */
if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
addr &= PAGE_MASK;
@@ -2023,6 +2021,17 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
return ret;
}
+static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+{
+ if (hv_is_isolation_supported())
+ return hv_set_mem_host_visibility(addr, numpages, !enc);
+
+ if (mem_encrypt_active())
+ return __set_memory_enc_pgtable(addr, numpages, enc);
+
+ return 0;
+}
+
int set_memory_encrypted(unsigned long addr, int numpages)
{
return __set_memory_enc_dec(addr, numpages, true);
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 56348a541c50..8ed6733d5146 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -158,6 +158,7 @@ struct ms_hyperv_tsc_page {
#define HVCALL_RETARGET_INTERRUPT 0x007e
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
+#define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY 0x00db
/* Extended hypercalls */
#define HV_EXT_CALL_QUERY_CAPABILITIES 0x8001
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index e04efb87fee5..cb529c85c0ad 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -254,6 +254,7 @@ bool hv_query_ext_cap(u64 cap_query);
static inline bool hv_is_hyperv_initialized(void) { return false; }
static inline bool hv_is_hibernation_supported(void) { return false; }
static inline void hyperv_cleanup(void) {}
+static inline bool hv_is_isolation_supported(void) { return false; }
#endif /* CONFIG_HYPERV */
#endif
--
2.25.1
From: Tianyu Lan <[email protected]>
The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
with host in Isolation VM and so it's necessary to use hvcall to set
them visible to host. In Isolation VM with AMD SEV SNP, the access
address should be in the extra space which is above shared gpa
boundary. So remap these pages into the extra address(pa +
shared_gpa_boundary).
Introduce monitor_pages_original[] in the struct vmbus_connection
to store monitor page virtual address returned by hv_alloc_hyperv_
zeroed_page() and free monitor page via monitor_pages_original in
the vmbus_disconnect(). The monitor_pages[] is to used to access
monitor page and it is initialized to be equal with monitor_pages_
original. The monitor_pages[] will be overridden in the isolation VM
with va of extra address. Introduce monitor_pages_pa[] to store
monitor pages' physical address and use it to populate pa in the
initiate msg.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Introduce monitor_pages_pa[] to store monitor pages' physical
address and use it to populate pa in the initiate msg.
* Move code of mapping moniter pages in extra address into
vmbus_connect().
Change since v3:
* Rename monitor_pages_va with monitor_pages_original
* free monitor page via monitor_pages_original and
monitor_pages is used to access monitor page.
Change since v1:
* Not remap monitor pages in the non-SNP isolation VM.
---
drivers/hv/connection.c | 90 ++++++++++++++++++++++++++++++++++++---
drivers/hv/hyperv_vmbus.h | 2 +
2 files changed, 86 insertions(+), 6 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 8820ae68f20f..edd8f7dd169f 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -19,6 +19,8 @@
#include <linux/vmalloc.h>
#include <linux/hyperv.h>
#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/set_memory.h>
#include <asm/mshyperv.h>
#include "hyperv_vmbus.h"
@@ -102,8 +104,9 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID;
}
- msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
- msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
+ msg->monitor_page1 = vmbus_connection.monitor_pages_pa[0];
+ msg->monitor_page2 = vmbus_connection.monitor_pages_pa[1];
+
msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
/*
@@ -216,6 +219,65 @@ int vmbus_connect(void)
goto cleanup;
}
+ vmbus_connection.monitor_pages_original[0]
+ = vmbus_connection.monitor_pages[0];
+ vmbus_connection.monitor_pages_original[1]
+ = vmbus_connection.monitor_pages[1];
+ vmbus_connection.monitor_pages_pa[0]
+ = virt_to_phys(vmbus_connection.monitor_pages[0]);
+ vmbus_connection.monitor_pages_pa[1]
+ = virt_to_phys(vmbus_connection.monitor_pages[1]);
+
+ if (hv_is_isolation_supported()) {
+ vmbus_connection.monitor_pages_pa[0] +=
+ ms_hyperv.shared_gpa_boundary;
+ vmbus_connection.monitor_pages_pa[1] +=
+ ms_hyperv.shared_gpa_boundary;
+
+ ret = set_memory_decrypted((unsigned long)
+ vmbus_connection.monitor_pages[0],
+ 1);
+ ret |= set_memory_decrypted((unsigned long)
+ vmbus_connection.monitor_pages[1],
+ 1);
+ if (ret)
+ goto cleanup;
+
+ /*
+ * Isolation VM with AMD SNP needs to access monitor page via
+ * address space above shared gpa boundary.
+ */
+ if (hv_isolation_type_snp()) {
+ vmbus_connection.monitor_pages[0]
+ = memremap(vmbus_connection.monitor_pages_pa[0],
+ HV_HYP_PAGE_SIZE,
+ MEMREMAP_WB);
+ if (!vmbus_connection.monitor_pages[0]) {
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+
+ vmbus_connection.monitor_pages[1]
+ = memremap(vmbus_connection.monitor_pages_pa[1],
+ HV_HYP_PAGE_SIZE,
+ MEMREMAP_WB);
+ if (!vmbus_connection.monitor_pages[1]) {
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+ }
+
+ /*
+ * Set memory host visibility hvcall smears memory
+ * and so zero monitor pages here.
+ */
+ memset(vmbus_connection.monitor_pages[0], 0x00,
+ HV_HYP_PAGE_SIZE);
+ memset(vmbus_connection.monitor_pages[1], 0x00,
+ HV_HYP_PAGE_SIZE);
+
+ }
+
msginfo = kzalloc(sizeof(*msginfo) +
sizeof(struct vmbus_channel_initiate_contact),
GFP_KERNEL);
@@ -303,10 +365,26 @@ void vmbus_disconnect(void)
vmbus_connection.int_page = NULL;
}
- hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
- hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
- vmbus_connection.monitor_pages[0] = NULL;
- vmbus_connection.monitor_pages[1] = NULL;
+ if (hv_is_isolation_supported()) {
+ memunmap(vmbus_connection.monitor_pages[0]);
+ memunmap(vmbus_connection.monitor_pages[1]);
+
+ set_memory_encrypted((unsigned long)
+ vmbus_connection.monitor_pages_original[0],
+ 1);
+ set_memory_encrypted((unsigned long)
+ vmbus_connection.monitor_pages_original[1],
+ 1);
+ }
+
+ hv_free_hyperv_page((unsigned long)
+ vmbus_connection.monitor_pages_original[0]);
+ hv_free_hyperv_page((unsigned long)
+ vmbus_connection.monitor_pages_original[1]);
+ vmbus_connection.monitor_pages_original[0] =
+ vmbus_connection.monitor_pages[0] = NULL;
+ vmbus_connection.monitor_pages_original[1] =
+ vmbus_connection.monitor_pages[1] = NULL;
}
/*
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 42f3d9d123a1..560cba916d1d 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -240,6 +240,8 @@ struct vmbus_connection {
* is child->parent notification
*/
struct hv_monitor_page *monitor_pages[2];
+ void *monitor_pages_original[2];
+ unsigned long monitor_pages_pa[2];
struct list_head chn_msg_list;
spinlock_t channelmsg_lock;
--
2.25.1
From: Tianyu Lan <[email protected]>
Hyper-V exposes shared memory boundary via cpuid
HYPERV_CPUID_ISOLATION_CONFIG and store it in the
shared_gpa_boundary of ms_hyperv struct. This prepares
to share memory with host for SNP guest.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Rename reserve field.
Change since v3:
* user BIT_ULL to get shared_gpa_boundary
* Rename field Reserved* to reserved
---
arch/x86/kernel/cpu/mshyperv.c | 2 ++
include/asm-generic/mshyperv.h | 12 +++++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index b09ade389040..4794b716ec79 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -313,6 +313,8 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.priv_high & HV_ISOLATION) {
ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
+ ms_hyperv.shared_gpa_boundary =
+ BIT_ULL(ms_hyperv.shared_gpa_boundary_bits);
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 0924bbd8458e..e04efb87fee5 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -35,7 +35,17 @@ struct ms_hyperv_info {
u32 max_vp_index;
u32 max_lp_index;
u32 isolation_config_a;
- u32 isolation_config_b;
+ union {
+ u32 isolation_config_b;
+ struct {
+ u32 cvm_type : 4;
+ u32 reserved1 : 1;
+ u32 shared_gpa_boundary_active : 1;
+ u32 shared_gpa_boundary_bits : 6;
+ u32 reserved2 : 20;
+ };
+ };
+ u64 shared_gpa_boundary;
};
extern struct ms_hyperv_info ms_hyperv;
--
2.25.1
From: Tianyu Lan <[email protected]>
hyperv provides ghcb hvcall to handle VMBus
HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE
msg in SNP Isolation VM. Add such support.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v3:
* Add hv_ghcb_hypercall() stub function to avoid
compile error for ARM.
---
arch/x86/hyperv/ivm.c | 74 ++++++++++++++++++++++++++++++++++
drivers/hv/connection.c | 6 ++-
drivers/hv/hv.c | 8 +++-
drivers/hv/hv_common.c | 6 +++
include/asm-generic/mshyperv.h | 1 +
5 files changed, 93 insertions(+), 2 deletions(-)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 5439723446c9..dfdac3a40036 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -18,10 +18,84 @@
#include <asm/mshyperv.h>
#include <asm/hypervisor.h>
+#define GHCB_USAGE_HYPERV_CALL 1
+
union hv_ghcb {
struct ghcb ghcb;
+ struct {
+ u64 hypercalldata[509];
+ u64 outputgpa;
+ union {
+ union {
+ struct {
+ u32 callcode : 16;
+ u32 isfast : 1;
+ u32 reserved1 : 14;
+ u32 isnested : 1;
+ u32 countofelements : 12;
+ u32 reserved2 : 4;
+ u32 repstartindex : 12;
+ u32 reserved3 : 4;
+ };
+ u64 asuint64;
+ } hypercallinput;
+ union {
+ struct {
+ u16 callstatus;
+ u16 reserved1;
+ u32 elementsprocessed : 12;
+ u32 reserved2 : 20;
+ };
+ u64 asunit64;
+ } hypercalloutput;
+ };
+ u64 reserved2;
+ } hypercall;
} __packed __aligned(HV_HYP_PAGE_SIZE);
+u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
+{
+ union hv_ghcb *hv_ghcb;
+ void **ghcb_base;
+ unsigned long flags;
+ u64 status;
+
+ if (!hv_ghcb_pg)
+ return -EFAULT;
+
+ WARN_ON(in_nmi());
+
+ local_irq_save(flags);
+ ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
+ hv_ghcb = (union hv_ghcb *)*ghcb_base;
+ if (!hv_ghcb) {
+ local_irq_restore(flags);
+ return -EFAULT;
+ }
+
+ hv_ghcb->ghcb.protocol_version = GHCB_PROTOCOL_MAX;
+ hv_ghcb->ghcb.ghcb_usage = GHCB_USAGE_HYPERV_CALL;
+
+ hv_ghcb->hypercall.outputgpa = (u64)output;
+ hv_ghcb->hypercall.hypercallinput.asuint64 = 0;
+ hv_ghcb->hypercall.hypercallinput.callcode = control;
+
+ if (input_size)
+ memcpy(hv_ghcb->hypercall.hypercalldata, input, input_size);
+
+ VMGEXIT();
+
+ hv_ghcb->ghcb.ghcb_usage = 0xffffffff;
+ memset(hv_ghcb->ghcb.save.valid_bitmap, 0,
+ sizeof(hv_ghcb->ghcb.save.valid_bitmap));
+
+ status = hv_ghcb->hypercall.hypercalloutput.callstatus;
+
+ local_irq_restore(flags);
+
+ return status;
+}
+
void hv_ghcb_msr_write(u64 msr, u64 value)
{
union hv_ghcb *hv_ghcb;
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 5e479d54918c..8820ae68f20f 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -447,6 +447,10 @@ void vmbus_set_event(struct vmbus_channel *channel)
++channel->sig_events;
- hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
+ if (hv_isolation_type_snp())
+ hv_ghcb_hypercall(HVCALL_SIGNAL_EVENT, &channel->sig_event,
+ NULL, sizeof(channel->sig_event));
+ else
+ hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
}
EXPORT_SYMBOL_GPL(vmbus_set_event);
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index dee1a96bc535..5644ba2bfa5c 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -98,7 +98,13 @@ int hv_post_message(union hv_connection_id connection_id,
aligned_msg->payload_size = payload_size;
memcpy((void *)aligned_msg->payload, payload, payload_size);
- status = hv_do_hypercall(HVCALL_POST_MESSAGE, aligned_msg, NULL);
+ if (hv_isolation_type_snp())
+ status = hv_ghcb_hypercall(HVCALL_POST_MESSAGE,
+ (void *)aligned_msg, NULL,
+ sizeof(*aligned_msg));
+ else
+ status = hv_do_hypercall(HVCALL_POST_MESSAGE,
+ aligned_msg, NULL);
/* Preemption must remain disabled until after the hypercall
* so some other thread can't get scheduled onto this cpu and
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 1fc82d237161..7be173a99f27 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -289,3 +289,9 @@ void __weak hyperv_cleanup(void)
{
}
EXPORT_SYMBOL_GPL(hyperv_cleanup);
+
+u64 __weak hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
+{
+ return HV_STATUS_INVALID_PARAMETER;
+}
+EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 94750bafd4cc..a0ec607a2fd6 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -250,6 +250,7 @@ bool hv_is_hibernation_supported(void);
enum hv_isolation_type hv_get_isolation_type(void);
bool hv_is_isolation_supported(void);
bool hv_isolation_type_snp(void);
+u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size);
void hyperv_cleanup(void);
bool hv_query_ext_cap(u64 cap_query);
#else /* CONFIG_HYPERV */
--
2.25.1
From: Tianyu Lan <[email protected]>
VMbus ring buffer are shared with host and it's need to
be accessed via extra address space of Isolation VM with
AMD SNP support. This patch is to map the ring buffer
address in extra address space via vmap_pfn(). Hyperv set
memory host visibility hvcall smears data in the ring buffer
and so reset the ring buffer memory to zero after mapping.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Use PFN_DOWN instead of HVPFN_DOWN in the hv_ringbuffer_init()
Change since v3:
* Remove hv_ringbuffer_post_init(), merge map
operation for Isolation VM into hv_ringbuffer_init()
* Call hv_ringbuffer_init() after __vmbus_establish_gpadl().
---
drivers/hv/Kconfig | 1 +
drivers/hv/channel.c | 19 +++++++-------
drivers/hv/ring_buffer.c | 55 ++++++++++++++++++++++++++++++----------
3 files changed, 53 insertions(+), 22 deletions(-)
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index d1123ceb38f3..dd12af20e467 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -8,6 +8,7 @@ config HYPERV
|| (ARM64 && !CPU_BIG_ENDIAN))
select PARAVIRT
select X86_HV_CALLBACK_VECTOR if X86
+ select VMAP_PFN
help
Select this option to run Linux as a Hyper-V client operating
system.
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index cf419eb1de77..ec847bd14119 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -684,15 +684,6 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
if (!newchannel->max_pkt_size)
newchannel->max_pkt_size = VMBUS_DEFAULT_MAX_PKT_SIZE;
- err = hv_ringbuffer_init(&newchannel->outbound, page, send_pages, 0);
- if (err)
- goto error_clean_ring;
-
- err = hv_ringbuffer_init(&newchannel->inbound, &page[send_pages],
- recv_pages, newchannel->max_pkt_size);
- if (err)
- goto error_clean_ring;
-
/* Establish the gpadl for the ring buffer */
newchannel->ringbuffer_gpadlhandle.gpadl_handle = 0;
@@ -704,6 +695,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
if (err)
goto error_clean_ring;
+ err = hv_ringbuffer_init(&newchannel->outbound,
+ page, send_pages, 0);
+ if (err)
+ goto error_free_gpadl;
+
+ err = hv_ringbuffer_init(&newchannel->inbound, &page[send_pages],
+ recv_pages, newchannel->max_pkt_size);
+ if (err)
+ goto error_free_gpadl;
+
/* Create and init the channel open message */
open_info = kzalloc(sizeof(*open_info) +
sizeof(struct vmbus_channel_open_channel),
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 2aee356840a2..5e014d23a7ad 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -17,6 +17,8 @@
#include <linux/vmalloc.h>
#include <linux/slab.h>
#include <linux/prefetch.h>
+#include <linux/io.h>
+#include <asm/mshyperv.h>
#include "hyperv_vmbus.h"
@@ -183,8 +185,10 @@ void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
struct page *pages, u32 page_cnt, u32 max_pkt_size)
{
- int i;
struct page **pages_wraparound;
+ unsigned long *pfns_wraparound;
+ u64 pfn;
+ int i;
BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
@@ -192,23 +196,48 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
* First page holds struct hv_ring_buffer, do wraparound mapping for
* the rest.
*/
- pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),
- GFP_KERNEL);
- if (!pages_wraparound)
- return -ENOMEM;
+ if (hv_isolation_type_snp()) {
+ pfn = page_to_pfn(pages) +
+ PFN_DOWN(ms_hyperv.shared_gpa_boundary);
+
+ pfns_wraparound = kcalloc(page_cnt * 2 - 1,
+ sizeof(unsigned long), GFP_KERNEL);
+ if (!pfns_wraparound)
+ return -ENOMEM;
+
+ pfns_wraparound[0] = pfn;
+ for (i = 0; i < 2 * (page_cnt - 1); i++)
+ pfns_wraparound[i + 1] = pfn + i % (page_cnt - 1) + 1;
- pages_wraparound[0] = pages;
- for (i = 0; i < 2 * (page_cnt - 1); i++)
- pages_wraparound[i + 1] = &pages[i % (page_cnt - 1) + 1];
+ ring_info->ring_buffer = (struct hv_ring_buffer *)
+ vmap_pfn(pfns_wraparound, page_cnt * 2 - 1,
+ PAGE_KERNEL);
+ kfree(pfns_wraparound);
- ring_info->ring_buffer = (struct hv_ring_buffer *)
- vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);
+ if (!ring_info->ring_buffer)
+ return -ENOMEM;
+
+ /* Zero ring buffer after setting memory host visibility. */
+ memset(ring_info->ring_buffer, 0x00, PAGE_SIZE * page_cnt);
+ } else {
+ pages_wraparound = kcalloc(page_cnt * 2 - 1,
+ sizeof(struct page *),
+ GFP_KERNEL);
+
+ pages_wraparound[0] = pages;
+ for (i = 0; i < 2 * (page_cnt - 1); i++)
+ pages_wraparound[i + 1] =
+ &pages[i % (page_cnt - 1) + 1];
- kfree(pages_wraparound);
+ ring_info->ring_buffer = (struct hv_ring_buffer *)
+ vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP,
+ PAGE_KERNEL);
+ kfree(pages_wraparound);
+ if (!ring_info->ring_buffer)
+ return -ENOMEM;
+ }
- if (!ring_info->ring_buffer)
- return -ENOMEM;
ring_info->ring_buffer->read_index =
ring_info->ring_buffer->write_index = 0;
--
2.25.1
From: Tianyu Lan <[email protected]>
Hyperv exposes GHCB page via SEV ES GHCB MSR for SNP guest
to communicate with hypervisor. Map GHCB page for all
cpus to read/write MSR register and submit hvcall request
via ghcb page.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Fix typo comment
Chagne since v3:
* Rename ghcb_base to hv_ghcb_pg and move it out of
struct ms_hyperv_info.
* Allocate hv_ghcb_pg before cpuhp_setup_state() and leverage
hv_cpu_init() to initialize ghcb page.
---
arch/x86/hyperv/hv_init.c | 68 +++++++++++++++++++++++++++++----
arch/x86/include/asm/mshyperv.h | 4 ++
arch/x86/kernel/cpu/mshyperv.c | 3 ++
include/asm-generic/mshyperv.h | 1 +
4 files changed, 69 insertions(+), 7 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 708a2712a516..a7e922755ad1 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -20,6 +20,7 @@
#include <linux/kexec.h>
#include <linux/version.h>
#include <linux/vmalloc.h>
+#include <linux/io.h>
#include <linux/mm.h>
#include <linux/hyperv.h>
#include <linux/slab.h>
@@ -36,12 +37,42 @@ EXPORT_SYMBOL_GPL(hv_current_partition_id);
void *hv_hypercall_pg;
EXPORT_SYMBOL_GPL(hv_hypercall_pg);
+void __percpu **hv_ghcb_pg;
+
/* Storage to save the hypercall page temporarily for hibernation */
static void *hv_hypercall_pg_saved;
struct hv_vp_assist_page **hv_vp_assist_page;
EXPORT_SYMBOL_GPL(hv_vp_assist_page);
+static int hyperv_init_ghcb(void)
+{
+ u64 ghcb_gpa;
+ void *ghcb_va;
+ void **ghcb_base;
+
+ if (!hv_isolation_type_snp())
+ return 0;
+
+ if (!hv_ghcb_pg)
+ return -EINVAL;
+
+ /*
+ * GHCB page is allocated by paravisor. The address
+ * returned by MSR_AMD64_SEV_ES_GHCB is above shared
+ * memory boundary and map it here.
+ */
+ rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
+ ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+ if (!ghcb_va)
+ return -ENOMEM;
+
+ ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
+ *ghcb_base = ghcb_va;
+
+ return 0;
+}
+
static int hv_cpu_init(unsigned int cpu)
{
union hv_vp_assist_msr_contents msr = { 0 };
@@ -85,7 +116,7 @@ static int hv_cpu_init(unsigned int cpu)
}
}
- return 0;
+ return hyperv_init_ghcb();
}
static void (*hv_reenlightenment_cb)(void);
@@ -177,6 +208,14 @@ static int hv_cpu_die(unsigned int cpu)
{
struct hv_reenlightenment_control re_ctrl;
unsigned int new_cpu;
+ void **ghcb_va;
+
+ if (hv_ghcb_pg) {
+ ghcb_va = (void **)this_cpu_ptr(hv_ghcb_pg);
+ if (*ghcb_va)
+ memunmap(*ghcb_va);
+ *ghcb_va = NULL;
+ }
hv_common_cpu_die(cpu);
@@ -366,10 +405,16 @@ void __init hyperv_init(void)
goto common_free;
}
+ if (hv_isolation_type_snp()) {
+ hv_ghcb_pg = alloc_percpu(void *);
+ if (!hv_ghcb_pg)
+ goto free_vp_assist_page;
+ }
+
cpuhp = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
hv_cpu_init, hv_cpu_die);
if (cpuhp < 0)
- goto free_vp_assist_page;
+ goto free_ghcb_page;
/*
* Setup the hypercall page and enable hypercalls.
@@ -383,10 +428,8 @@ void __init hyperv_init(void)
VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
__builtin_return_address(0));
- if (hv_hypercall_pg == NULL) {
- wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
- goto remove_cpuhp_state;
- }
+ if (hv_hypercall_pg == NULL)
+ goto clean_guest_os_id;
rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
hypercall_msr.enable = 1;
@@ -456,8 +499,11 @@ void __init hyperv_init(void)
hv_query_ext_cap(0);
return;
-remove_cpuhp_state:
+clean_guest_os_id:
+ wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
cpuhp_remove_state(cpuhp);
+free_ghcb_page:
+ free_percpu(hv_ghcb_pg);
free_vp_assist_page:
kfree(hv_vp_assist_page);
hv_vp_assist_page = NULL;
@@ -559,3 +605,11 @@ bool hv_is_isolation_supported(void)
{
return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
}
+
+DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
+
+bool hv_isolation_type_snp(void)
+{
+ return static_branch_unlikely(&isolation_type_snp);
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index adccbc209169..37739a277ac6 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -11,6 +11,8 @@
#include <asm/paravirt.h>
#include <asm/mshyperv.h>
+DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
+
typedef int (*hyperv_fill_flush_list_func)(
struct hv_guest_mapping_flush_list *flush,
void *data);
@@ -39,6 +41,8 @@ extern void *hv_hypercall_pg;
extern u64 hv_current_partition_id;
+extern void __percpu **hv_ghcb_pg;
+
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e095c28d27ae..b09ade389040 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -316,6 +316,9 @@ static void __init ms_hyperv_init_platform(void)
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
+
+ if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
+ static_branch_enable(&isolation_type_snp);
}
if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c1ab6a6e72b5..0924bbd8458e 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -237,6 +237,7 @@ bool hv_is_hyperv_initialized(void);
bool hv_is_hibernation_supported(void);
enum hv_isolation_type hv_get_isolation_type(void);
bool hv_is_isolation_supported(void);
+bool hv_isolation_type_snp(void);
void hyperv_cleanup(void);
bool hv_query_ext_cap(u64 cap_query);
#else /* CONFIG_HYPERV */
--
2.25.1
From: Tianyu Lan <[email protected]>
In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.
Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and call memremap() to map bounce buffer in the
swiotlb code, store map address and use the address to copy data
from/to swiotlb bounce buffer.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Expose swiotlb_unencrypted_base to set unencrypted memory
offset.
* Use memremap() to map bounce buffer if swiotlb_unencrypted_
base is set.
Change since v1:
* Make swiotlb_init_io_tlb_mem() return error code and return
error when dma_map_decrypted() fails.
---
include/linux/swiotlb.h | 6 ++++++
kernel/dma/swiotlb.c | 41 +++++++++++++++++++++++++++++++++++------
2 files changed, 41 insertions(+), 6 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b0cb2a9973f4..4998ed44ae3d 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,6 +72,9 @@ extern enum swiotlb_force swiotlb_force;
* @end: The end address of the swiotlb memory pool. Used to do a quick
* range check to see if the memory was in fact allocated by this
* API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb
+ * memory pool may be remapped in the memory encrypted case and store
+ * virtual address for bounce buffer operation.
* @nslabs: The number of IO TLB blocks (in groups of 64) between @start and
* @end. For default swiotlb, this is command line adjustable via
* setup_io_tlb_npages.
@@ -91,6 +94,7 @@ extern enum swiotlb_force swiotlb_force;
struct io_tlb_mem {
phys_addr_t start;
phys_addr_t end;
+ void *vaddr;
unsigned long nslabs;
unsigned long used;
unsigned int index;
@@ -185,4 +189,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
}
#endif /* CONFIG_DMA_RESTRICTED_POOL */
+extern phys_addr_t swiotlb_unencrypted_base;
+
#endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 87c40517e822..9e30cc4bd872 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
#include <asm/io.h>
#include <asm/dma.h>
+#include <linux/io.h>
#include <linux/init.h>
#include <linux/memblock.h>
#include <linux/iommu-helper.h>
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
struct io_tlb_mem io_tlb_default_mem;
+phys_addr_t swiotlb_unencrypted_base;
+
/*
* Max segment that we can provide which (if pages are contingous) will
* not be bounced (unless SWIOTLB_FORCE is set).
@@ -175,7 +178,7 @@ void __init swiotlb_update_mem_attributes(void)
memset(vaddr, 0, bytes);
}
-static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
+static int swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
unsigned long nslabs, bool late_alloc)
{
void *vaddr = phys_to_virt(start);
@@ -196,13 +199,34 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
mem->slots[i].alloc_size = 0;
}
+
+ if (set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT))
+ return -EFAULT;
+
+ /*
+ * Map memory in the unencrypted physical address space when requested
+ * (e.g. for Hyper-V AMD SEV-SNP Isolation VMs).
+ */
+ if (swiotlb_unencrypted_base) {
+ phys_addr_t paddr = __pa(vaddr) + swiotlb_unencrypted_base;
+
+ vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+ if (!vaddr) {
+ pr_err("Failed to map the unencrypted memory.\n");
+ return -ENOMEM;
+ }
+ }
+
memset(vaddr, 0, bytes);
+ mem->vaddr = vaddr;
+ return 0;
}
int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
{
struct io_tlb_mem *mem = &io_tlb_default_mem;
size_t alloc_size;
+ int ret;
if (swiotlb_force == SWIOTLB_NO_FORCE)
return 0;
@@ -217,7 +241,11 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
__func__, alloc_size, PAGE_SIZE);
- swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
+ ret = swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
+ if (ret) {
+ memblock_free(__pa(mem), alloc_size);
+ return ret;
+ }
if (verbose)
swiotlb_print_info();
@@ -304,7 +332,7 @@ int
swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
{
struct io_tlb_mem *mem = &io_tlb_default_mem;
- unsigned long bytes = nslabs << IO_TLB_SHIFT;
+ int ret;
if (swiotlb_force == SWIOTLB_NO_FORCE)
return 0;
@@ -318,8 +346,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
if (!mem->slots)
return -ENOMEM;
- set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
- swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
+ ret = swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
+ if (ret)
+ return ret;
swiotlb_print_info();
swiotlb_set_max_segment(mem->nslabs << IO_TLB_SHIFT);
@@ -371,7 +400,7 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
unsigned long pfn = PFN_DOWN(orig_addr);
- unsigned char *vaddr = phys_to_virt(tlb_addr);
+ unsigned char *vaddr = mem->vaddr + tlb_addr - mem->start;
unsigned int tlb_offset, orig_addr_offset;
if (orig_addr == INVALID_PHYS_ADDR)
--
2.25.1
From: Tianyu Lan <[email protected]>
Mark vmbus ring buffer visible with set_memory_decrypted() when
establish gpadl handle.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change sincv v4
* Change gpadl handle in netvsc and uio driver from u32 to
struct vmbus_gpadl.
* Change vmbus_establish_gpadl()'s gpadl_handle parameter
to vmbus_gpadl data structure.
Change since v3:
* Change vmbus_teardown_gpadl() parameter and put gpadl handle,
buffer and buffer size in the struct vmbus_gpadl.
---
drivers/hv/channel.c | 54 ++++++++++++++++++++++++---------
drivers/net/hyperv/hyperv_net.h | 5 +--
drivers/net/hyperv/netvsc.c | 17 ++++++-----
drivers/uio/uio_hv_generic.c | 20 ++++++------
include/linux/hyperv.h | 12 ++++++--
5 files changed, 71 insertions(+), 37 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index f3761c73b074..cf419eb1de77 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -17,6 +17,7 @@
#include <linux/hyperv.h>
#include <linux/uio.h>
#include <linux/interrupt.h>
+#include <linux/set_memory.h>
#include <asm/page.h>
#include <asm/mshyperv.h>
@@ -456,7 +457,7 @@ static int create_gpadl_header(enum hv_gpadl_type type, void *kbuffer,
static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
enum hv_gpadl_type type, void *kbuffer,
u32 size, u32 send_offset,
- u32 *gpadl_handle)
+ struct vmbus_gpadl *gpadl)
{
struct vmbus_channel_gpadl_header *gpadlmsg;
struct vmbus_channel_gpadl_body *gpadl_body;
@@ -474,6 +475,15 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
if (ret)
return ret;
+ ret = set_memory_decrypted((unsigned long)kbuffer,
+ HVPFN_UP(size));
+ if (ret) {
+ dev_warn(&channel->device_obj->device,
+ "Failed to set host visibility for new GPADL %d.\n",
+ ret);
+ return ret;
+ }
+
init_completion(&msginfo->waitevent);
msginfo->waiting_channel = channel;
@@ -537,7 +547,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
}
/* At this point, we received the gpadl created msg */
- *gpadl_handle = gpadlmsg->gpadl;
+ gpadl->gpadl_handle = gpadlmsg->gpadl;
+ gpadl->buffer = kbuffer;
+ gpadl->size = size;
+
cleanup:
spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
@@ -549,6 +562,11 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
}
kfree(msginfo);
+
+ if (ret)
+ set_memory_encrypted((unsigned long)kbuffer,
+ HVPFN_UP(size));
+
return ret;
}
@@ -561,10 +579,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
* @gpadl_handle: some funky thing
*/
int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
- u32 size, u32 *gpadl_handle)
+ u32 size, struct vmbus_gpadl *gpadl)
{
return __vmbus_establish_gpadl(channel, HV_GPADL_BUFFER, kbuffer, size,
- 0U, gpadl_handle);
+ 0U, gpadl);
}
EXPORT_SYMBOL_GPL(vmbus_establish_gpadl);
@@ -639,6 +657,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
struct vmbus_channel_open_channel *open_msg;
struct vmbus_channel_msginfo *open_info = NULL;
struct page *page = newchannel->ringbuffer_page;
+ struct vmbus_gpadl gpadl;
u32 send_pages, recv_pages;
unsigned long flags;
int err;
@@ -675,7 +694,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
goto error_clean_ring;
/* Establish the gpadl for the ring buffer */
- newchannel->ringbuffer_gpadlhandle = 0;
+ newchannel->ringbuffer_gpadlhandle.gpadl_handle = 0;
err = __vmbus_establish_gpadl(newchannel, HV_GPADL_RING,
page_address(newchannel->ringbuffer_page),
@@ -701,7 +720,8 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
open_msg->header.msgtype = CHANNELMSG_OPENCHANNEL;
open_msg->openid = newchannel->offermsg.child_relid;
open_msg->child_relid = newchannel->offermsg.child_relid;
- open_msg->ringbuffer_gpadlhandle = newchannel->ringbuffer_gpadlhandle;
+ open_msg->ringbuffer_gpadlhandle
+ = newchannel->ringbuffer_gpadlhandle.gpadl_handle;
/*
* The unit of ->downstream_ringbuffer_pageoffset is HV_HYP_PAGE and
* the unit of ->ringbuffer_send_offset (i.e. send_pages) is PAGE, so
@@ -759,8 +779,8 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
error_free_info:
kfree(open_info);
error_free_gpadl:
- vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle);
- newchannel->ringbuffer_gpadlhandle = 0;
+ vmbus_teardown_gpadl(newchannel, &newchannel->ringbuffer_gpadlhandle);
+ newchannel->ringbuffer_gpadlhandle.gpadl_handle = 0;
error_clean_ring:
hv_ringbuffer_cleanup(&newchannel->outbound);
hv_ringbuffer_cleanup(&newchannel->inbound);
@@ -806,7 +826,7 @@ EXPORT_SYMBOL_GPL(vmbus_open);
/*
* vmbus_teardown_gpadl -Teardown the specified GPADL handle
*/
-int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
+int vmbus_teardown_gpadl(struct vmbus_channel *channel, struct vmbus_gpadl *gpadl)
{
struct vmbus_channel_gpadl_teardown *msg;
struct vmbus_channel_msginfo *info;
@@ -825,7 +845,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
msg->header.msgtype = CHANNELMSG_GPADL_TEARDOWN;
msg->child_relid = channel->offermsg.child_relid;
- msg->gpadl = gpadl_handle;
+ msg->gpadl = gpadl->gpadl_handle;
spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
list_add_tail(&info->msglistentry,
@@ -859,6 +879,12 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
kfree(info);
+
+ ret = set_memory_encrypted((unsigned long)gpadl->buffer,
+ HVPFN_UP(gpadl->size));
+ if (ret)
+ pr_warn("Fail to set mem host visibility in GPADL teardown %d.\n", ret);
+
return ret;
}
EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
@@ -896,6 +922,7 @@ void vmbus_reset_channel_cb(struct vmbus_channel *channel)
static int vmbus_close_internal(struct vmbus_channel *channel)
{
struct vmbus_channel_close_channel *msg;
+ struct vmbus_gpadl gpadl;
int ret;
vmbus_reset_channel_cb(channel);
@@ -933,9 +960,8 @@ static int vmbus_close_internal(struct vmbus_channel *channel)
}
/* Tear down the gpadl for the channel's ring buffer */
- else if (channel->ringbuffer_gpadlhandle) {
- ret = vmbus_teardown_gpadl(channel,
- channel->ringbuffer_gpadlhandle);
+ else if (channel->ringbuffer_gpadlhandle.gpadl_handle) {
+ ret = vmbus_teardown_gpadl(channel, &channel->ringbuffer_gpadlhandle);
if (ret) {
pr_err("Close failed: teardown gpadl return %d\n", ret);
/*
@@ -944,7 +970,7 @@ static int vmbus_close_internal(struct vmbus_channel *channel)
*/
}
- channel->ringbuffer_gpadlhandle = 0;
+ channel->ringbuffer_gpadlhandle.gpadl_handle = 0;
}
if (!ret)
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index bc48855dff10..315278a7cf88 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1075,14 +1075,15 @@ struct netvsc_device {
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
u32 recv_buf_size; /* allocated bytes */
- u32 recv_buf_gpadl_handle;
+ struct vmbus_gpadl recv_buf_gpadl_handle;
u32 recv_section_cnt;
u32 recv_section_size;
u32 recv_completion_cnt;
/* Send buffer allocated by us */
void *send_buf;
- u32 send_buf_gpadl_handle;
+ u32 send_buf_size;
+ struct vmbus_gpadl send_buf_gpadl_handle;
u32 send_section_cnt;
u32 send_section_size;
unsigned long *send_section_map;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 7bd935412853..1f87e570ed2b 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -278,9 +278,9 @@ static void netvsc_teardown_recv_gpadl(struct hv_device *device,
{
int ret;
- if (net_device->recv_buf_gpadl_handle) {
+ if (net_device->recv_buf_gpadl_handle.gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
- net_device->recv_buf_gpadl_handle);
+ &net_device->recv_buf_gpadl_handle);
/* If we failed here, we might as well return and have a leak
* rather than continue and a bugchk
@@ -290,7 +290,7 @@ static void netvsc_teardown_recv_gpadl(struct hv_device *device,
"unable to teardown receive buffer's gpadl\n");
return;
}
- net_device->recv_buf_gpadl_handle = 0;
+ net_device->recv_buf_gpadl_handle.gpadl_handle = 0;
}
}
@@ -300,9 +300,9 @@ static void netvsc_teardown_send_gpadl(struct hv_device *device,
{
int ret;
- if (net_device->send_buf_gpadl_handle) {
+ if (net_device->send_buf_gpadl_handle.gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
- net_device->send_buf_gpadl_handle);
+ &net_device->send_buf_gpadl_handle);
/* If we failed here, we might as well return and have a leak
* rather than continue and a bugchk
@@ -312,7 +312,7 @@ static void netvsc_teardown_send_gpadl(struct hv_device *device,
"unable to teardown send buffer's gpadl\n");
return;
}
- net_device->send_buf_gpadl_handle = 0;
+ net_device->send_buf_gpadl_handle.gpadl_handle = 0;
}
}
@@ -380,7 +380,7 @@ static int netvsc_init_buf(struct hv_device *device,
memset(init_packet, 0, sizeof(struct nvsp_message));
init_packet->hdr.msg_type = NVSP_MSG1_TYPE_SEND_RECV_BUF;
init_packet->msg.v1_msg.send_recv_buf.
- gpadl_handle = net_device->recv_buf_gpadl_handle;
+ gpadl_handle = net_device->recv_buf_gpadl_handle.gpadl_handle;
init_packet->msg.v1_msg.
send_recv_buf.id = NETVSC_RECEIVE_BUFFER_ID;
@@ -463,6 +463,7 @@ static int netvsc_init_buf(struct hv_device *device,
ret = -ENOMEM;
goto cleanup;
}
+ net_device->send_buf_size = buf_size;
/* Establish the gpadl handle for this buffer on this
* channel. Note: This call uses the vmbus connection rather
@@ -482,7 +483,7 @@ static int netvsc_init_buf(struct hv_device *device,
memset(init_packet, 0, sizeof(struct nvsp_message));
init_packet->hdr.msg_type = NVSP_MSG1_TYPE_SEND_SEND_BUF;
init_packet->msg.v1_msg.send_send_buf.gpadl_handle =
- net_device->send_buf_gpadl_handle;
+ net_device->send_buf_gpadl_handle.gpadl_handle;
init_packet->msg.v1_msg.send_send_buf.id = NETVSC_SEND_BUFFER_ID;
trace_nvsp_send(ndev, init_packet);
diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c
index 652fe2547587..548243dcd895 100644
--- a/drivers/uio/uio_hv_generic.c
+++ b/drivers/uio/uio_hv_generic.c
@@ -58,11 +58,11 @@ struct hv_uio_private_data {
atomic_t refcnt;
void *recv_buf;
- u32 recv_gpadl;
+ struct vmbus_gpadl recv_gpadl;
char recv_name[32]; /* "recv_4294967295" */
void *send_buf;
- u32 send_gpadl;
+ struct vmbus_gpadl send_gpadl;
char send_name[32];
};
@@ -179,15 +179,15 @@ hv_uio_new_channel(struct vmbus_channel *new_sc)
static void
hv_uio_cleanup(struct hv_device *dev, struct hv_uio_private_data *pdata)
{
- if (pdata->send_gpadl) {
- vmbus_teardown_gpadl(dev->channel, pdata->send_gpadl);
- pdata->send_gpadl = 0;
+ if (pdata->send_gpadl.gpadl_handle) {
+ vmbus_teardown_gpadl(dev->channel, &pdata->send_gpadl);
+ pdata->send_gpadl.gpadl_handle = 0;
vfree(pdata->send_buf);
}
- if (pdata->recv_gpadl) {
- vmbus_teardown_gpadl(dev->channel, pdata->recv_gpadl);
- pdata->recv_gpadl = 0;
+ if (pdata->recv_gpadl.gpadl_handle) {
+ vmbus_teardown_gpadl(dev->channel, &pdata->recv_gpadl);
+ pdata->recv_gpadl.gpadl_handle = 0;
vfree(pdata->recv_buf);
}
}
@@ -303,7 +303,7 @@ hv_uio_probe(struct hv_device *dev,
/* put Global Physical Address Label in name */
snprintf(pdata->recv_name, sizeof(pdata->recv_name),
- "recv:%u", pdata->recv_gpadl);
+ "recv:%u", pdata->recv_gpadl.gpadl_handle);
pdata->info.mem[RECV_BUF_MAP].name = pdata->recv_name;
pdata->info.mem[RECV_BUF_MAP].addr
= (uintptr_t)pdata->recv_buf;
@@ -324,7 +324,7 @@ hv_uio_probe(struct hv_device *dev,
}
snprintf(pdata->send_name, sizeof(pdata->send_name),
- "send:%u", pdata->send_gpadl);
+ "send:%u", pdata->send_gpadl.gpadl_handle);
pdata->info.mem[SEND_BUF_MAP].name = pdata->send_name;
pdata->info.mem[SEND_BUF_MAP].addr
= (uintptr_t)pdata->send_buf;
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index ddc8713ce57b..a9e0bc3b1511 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -803,6 +803,12 @@ struct vmbus_device {
#define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
+struct vmbus_gpadl {
+ u32 gpadl_handle;
+ u32 size;
+ void *buffer;
+};
+
struct vmbus_channel {
struct list_head listentry;
@@ -822,7 +828,7 @@ struct vmbus_channel {
bool rescind_ref; /* got rescind msg, got channel reference */
struct completion rescind_event;
- u32 ringbuffer_gpadlhandle;
+ struct vmbus_gpadl ringbuffer_gpadlhandle;
/* Allocated memory for ring buffer */
struct page *ringbuffer_page;
@@ -1192,10 +1198,10 @@ extern int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
extern int vmbus_establish_gpadl(struct vmbus_channel *channel,
void *kbuffer,
u32 size,
- u32 *gpadl_handle);
+ struct vmbus_gpadl *gpadl);
extern int vmbus_teardown_gpadl(struct vmbus_channel *channel,
- u32 gpadl_handle);
+ struct vmbus_gpadl *gpadl);
void vmbus_reset_channel_cb(struct vmbus_channel *channel);
--
2.25.1
From: Tianyu Lan <[email protected]>
Hyperv provides GHCB protocol to write Synthetic Interrupt
Controller MSR registers in Isolation VM with AMD SEV SNP
and these registers are emulated by hypervisor directly.
Hyperv requires to write SINTx MSR registers twice. First
writes MSR via GHCB page to communicate with hypervisor
and then writes wrmsr instruction to talk with paravisor
which runs in VMPL0. Guest OS ID MSR also needs to be set
via GHCB page.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Remove hv_get_simp(), hv_get_siefp() hv_get_synint_*()
helper function. Move the logic into hv_get/set_register().
Change since v3:
* Pass old_msg_type to hv_signal_eom() as parameter.
* Use HV_REGISTER_* marcro instead of HV_X64_MSR_*
* Add hv_isolation_type_snp() weak function.
* Add maros to set syinc register in ARM code.
Change since v1:
* Introduce sev_es_ghcb_hv_call_simple() and share code
between SEV and Hyper-V code.
Fix for hyperv: Add Write/Read MSR registers via ghcb page
---
arch/x86/hyperv/hv_init.c | 36 +++--------
arch/x86/hyperv/ivm.c | 103 ++++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 56 ++++++++++++-----
arch/x86/include/asm/sev.h | 6 ++
arch/x86/kernel/sev-shared.c | 63 +++++++++++--------
drivers/hv/hv.c | 77 +++++++++++++++++++-----
drivers/hv/hv_common.c | 6 ++
include/asm-generic/mshyperv.h | 2 +
8 files changed, 266 insertions(+), 83 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index d57df6825527..a16a83e46a30 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -37,7 +37,7 @@ EXPORT_SYMBOL_GPL(hv_current_partition_id);
void *hv_hypercall_pg;
EXPORT_SYMBOL_GPL(hv_hypercall_pg);
-void __percpu **hv_ghcb_pg;
+union hv_ghcb __percpu **hv_ghcb_pg;
/* Storage to save the hypercall page temporarily for hibernation */
static void *hv_hypercall_pg_saved;
@@ -406,7 +406,7 @@ void __init hyperv_init(void)
}
if (hv_isolation_type_snp()) {
- hv_ghcb_pg = alloc_percpu(void *);
+ hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
if (!hv_ghcb_pg)
goto free_vp_assist_page;
}
@@ -424,6 +424,9 @@ void __init hyperv_init(void)
guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);
+ /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
+ hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
+
hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
@@ -501,6 +504,7 @@ void __init hyperv_init(void)
clean_guest_os_id:
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
+ hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
cpuhp_remove_state(cpuhp);
free_ghcb_page:
free_percpu(hv_ghcb_pg);
@@ -522,6 +526,7 @@ void hyperv_cleanup(void)
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
+ hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
/*
* Reset hypercall page reference before reset the page,
@@ -592,30 +597,3 @@ bool hv_is_hyperv_initialized(void)
return hypercall_msr.enable;
}
EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
-
-enum hv_isolation_type hv_get_isolation_type(void)
-{
- if (!(ms_hyperv.priv_high & HV_ISOLATION))
- return HV_ISOLATION_TYPE_NONE;
- return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
-}
-EXPORT_SYMBOL_GPL(hv_get_isolation_type);
-
-bool hv_is_isolation_supported(void)
-{
- if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
- return false;
-
- if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
- return false;
-
- return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
-}
-
-DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
-
-bool hv_isolation_type_snp(void)
-{
- return static_branch_unlikely(&isolation_type_snp);
-}
-EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 79e7fb83472a..5439723446c9 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -6,12 +6,115 @@
* Tianyu Lan <[email protected]>
*/
+#include <linux/types.h>
+#include <linux/bitfield.h>
#include <linux/hyperv.h>
#include <linux/types.h>
#include <linux/bitfield.h>
#include <linux/slab.h>
+#include <asm/svm.h>
+#include <asm/sev.h>
#include <asm/io.h>
#include <asm/mshyperv.h>
+#include <asm/hypervisor.h>
+
+union hv_ghcb {
+ struct ghcb ghcb;
+} __packed __aligned(HV_HYP_PAGE_SIZE);
+
+void hv_ghcb_msr_write(u64 msr, u64 value)
+{
+ union hv_ghcb *hv_ghcb;
+ void **ghcb_base;
+ unsigned long flags;
+
+ if (!hv_ghcb_pg)
+ return;
+
+ WARN_ON(in_nmi());
+
+ local_irq_save(flags);
+ ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
+ hv_ghcb = (union hv_ghcb *)*ghcb_base;
+ if (!hv_ghcb) {
+ local_irq_restore(flags);
+ return;
+ }
+
+ ghcb_set_rcx(&hv_ghcb->ghcb, msr);
+ ghcb_set_rax(&hv_ghcb->ghcb, lower_32_bits(value));
+ ghcb_set_rdx(&hv_ghcb->ghcb, upper_32_bits(value));
+
+ if (sev_es_ghcb_hv_call_simple(&hv_ghcb->ghcb, SVM_EXIT_MSR, 1, 0))
+ pr_warn("Fail to write msr via ghcb %llx.\n", msr);
+
+ local_irq_restore(flags);
+}
+
+void hv_ghcb_msr_read(u64 msr, u64 *value)
+{
+ union hv_ghcb *hv_ghcb;
+ void **ghcb_base;
+ unsigned long flags;
+
+ /* Check size of union hv_ghcb here. */
+ BUILD_BUG_ON(sizeof(union hv_ghcb) != HV_HYP_PAGE_SIZE);
+
+ if (!hv_ghcb_pg)
+ return;
+
+ WARN_ON(in_nmi());
+
+ local_irq_save(flags);
+ ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
+ hv_ghcb = (union hv_ghcb *)*ghcb_base;
+ if (!hv_ghcb) {
+ local_irq_restore(flags);
+ return;
+ }
+
+ ghcb_set_rcx(&hv_ghcb->ghcb, msr);
+ if (sev_es_ghcb_hv_call_simple(&hv_ghcb->ghcb, SVM_EXIT_MSR, 0, 0))
+ pr_warn("Fail to read msr via ghcb %llx.\n", msr);
+ else
+ *value = (u64)lower_32_bits(hv_ghcb->ghcb.save.rax)
+ | ((u64)lower_32_bits(hv_ghcb->ghcb.save.rdx) << 32);
+ local_irq_restore(flags);
+}
+
+enum hv_isolation_type hv_get_isolation_type(void)
+{
+ if (!(ms_hyperv.priv_high & HV_ISOLATION))
+ return HV_ISOLATION_TYPE_NONE;
+ return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
+}
+EXPORT_SYMBOL_GPL(hv_get_isolation_type);
+
+/*
+ * hv_is_isolation_supported - Check system runs in the Hyper-V
+ * isolation VM.
+ */
+bool hv_is_isolation_supported(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+ return false;
+
+ if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
+ return false;
+
+ return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
+}
+
+DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
+
+/*
+ * hv_isolation_type_snp - Check system runs in the AMD SEV-SNP based
+ * isolation VM.
+ */
+bool hv_isolation_type_snp(void)
+{
+ return static_branch_unlikely(&isolation_type_snp);
+}
/*
* hv_mark_gpa_visibility - Set pages visible to host via hvcall.
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ede440f9a1e2..165423e8b67a 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -11,25 +11,14 @@
#include <asm/paravirt.h>
#include <asm/mshyperv.h>
+union hv_ghcb;
+
DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
typedef int (*hyperv_fill_flush_list_func)(
struct hv_guest_mapping_flush_list *flush,
void *data);
-static inline void hv_set_register(unsigned int reg, u64 value)
-{
- wrmsrl(reg, value);
-}
-
-static inline u64 hv_get_register(unsigned int reg)
-{
- u64 value;
-
- rdmsrl(reg, value);
- return value;
-}
-
#define hv_get_raw_timer() rdtsc_ordered()
void hyperv_vector_handler(struct pt_regs *regs);
@@ -41,7 +30,7 @@ extern void *hv_hypercall_pg;
extern u64 hv_current_partition_id;
-extern void __percpu **hv_ghcb_pg;
+extern union hv_ghcb __percpu **hv_ghcb_pg;
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
@@ -193,6 +182,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
struct hv_interrupt_entry *entry);
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);
+void hv_ghcb_msr_write(u64 msr, u64 value);
+void hv_ghcb_msr_read(u64 msr, u64 *value);
#else /* CONFIG_HYPERV */
static inline void hyperv_init(void) {}
static inline void hyperv_setup_mmu_ops(void) {}
@@ -209,9 +200,46 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
{
return -1;
}
+
+static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
+static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
#endif /* CONFIG_HYPERV */
+static inline void hv_set_register(unsigned int reg, u64 value);
#include <asm-generic/mshyperv.h>
+static inline bool hv_is_synic_reg(unsigned int reg)
+{
+ if ((reg >= HV_REGISTER_SCONTROL) &&
+ (reg <= HV_REGISTER_SINT15))
+ return true;
+ return false;
+}
+
+static inline u64 hv_get_register(unsigned int reg)
+{
+ u64 value;
+
+ if (hv_is_synic_reg(reg) && hv_isolation_type_snp())
+ hv_ghcb_msr_read(reg, &value);
+ else
+ rdmsrl(reg, value);
+ return value;
+}
+
+static inline void hv_set_register(unsigned int reg, u64 value)
+{
+ if (hv_is_synic_reg(reg) && hv_isolation_type_snp()) {
+ hv_ghcb_msr_write(reg, value);
+
+ /* Write proxy bit via wrmsl instruction */
+ if (reg >= HV_REGISTER_SINT0 &&
+ reg <= HV_REGISTER_SINT15)
+ wrmsrl(reg, value | 1 << 20);
+ } else {
+ wrmsrl(reg, value);
+ }
+}
+
#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index fa5cd05d3b5b..60bfdbd141b1 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -81,12 +81,18 @@ static __always_inline void sev_es_nmi_complete(void)
__sev_es_nmi_complete();
}
extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
+extern enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
+static inline enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2) { return ES_VMM_ERROR; }
#endif
#endif
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index 9f90f460a28c..dd7f37de640b 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -94,10 +94,9 @@ static void vc_finish_insn(struct es_em_ctxt *ctxt)
ctxt->regs->ip += ctxt->insn.length;
}
-static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
- struct es_em_ctxt *ctxt,
- u64 exit_code, u64 exit_info_1,
- u64 exit_info_2)
+enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2)
{
enum es_result ret;
@@ -109,29 +108,45 @@ static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
ghcb_set_sw_exit_info_1(ghcb, exit_info_1);
ghcb_set_sw_exit_info_2(ghcb, exit_info_2);
- sev_es_wr_ghcb_msr(__pa(ghcb));
VMGEXIT();
- if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1) {
- u64 info = ghcb->save.sw_exit_info_2;
- unsigned long v;
-
- info = ghcb->save.sw_exit_info_2;
- v = info & SVM_EVTINJ_VEC_MASK;
-
- /* Check if exception information from hypervisor is sane. */
- if ((info & SVM_EVTINJ_VALID) &&
- ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
- ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
- ctxt->fi.vector = v;
- if (info & SVM_EVTINJ_VALID_ERR)
- ctxt->fi.error_code = info >> 32;
- ret = ES_EXCEPTION;
- } else {
- ret = ES_VMM_ERROR;
- }
- } else {
+ if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1)
+ ret = ES_VMM_ERROR;
+ else
ret = ES_OK;
+
+ return ret;
+}
+
+static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+ struct es_em_ctxt *ctxt,
+ u64 exit_code, u64 exit_info_1,
+ u64 exit_info_2)
+{
+ unsigned long v;
+ enum es_result ret;
+ u64 info;
+
+ sev_es_wr_ghcb_msr(__pa(ghcb));
+
+ ret = sev_es_ghcb_hv_call_simple(ghcb, exit_code, exit_info_1,
+ exit_info_2);
+ if (ret == ES_OK)
+ return ret;
+
+ info = ghcb->save.sw_exit_info_2;
+ v = info & SVM_EVTINJ_VEC_MASK;
+
+ /* Check if exception information from hypervisor is sane. */
+ if ((info & SVM_EVTINJ_VALID) &&
+ ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
+ ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
+ ctxt->fi.vector = v;
+ if (info & SVM_EVTINJ_VALID_ERR)
+ ctxt->fi.error_code = info >> 32;
+ ret = ES_EXCEPTION;
+ } else {
+ ret = ES_VMM_ERROR;
}
return ret;
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index e83507f49676..dee1a96bc535 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -8,6 +8,7 @@
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/io.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/slab.h>
@@ -136,17 +137,24 @@ int hv_synic_alloc(void)
tasklet_init(&hv_cpu->msg_dpc,
vmbus_on_msg_dpc, (unsigned long) hv_cpu);
- hv_cpu->synic_message_page =
- (void *)get_zeroed_page(GFP_ATOMIC);
- if (hv_cpu->synic_message_page == NULL) {
- pr_err("Unable to allocate SYNIC message page\n");
- goto err;
- }
+ /*
+ * Synic message and event pages are allocated by paravisor.
+ * Skip these pages allocation here.
+ */
+ if (!hv_isolation_type_snp()) {
+ hv_cpu->synic_message_page =
+ (void *)get_zeroed_page(GFP_ATOMIC);
+ if (hv_cpu->synic_message_page == NULL) {
+ pr_err("Unable to allocate SYNIC message page\n");
+ goto err;
+ }
- hv_cpu->synic_event_page = (void *)get_zeroed_page(GFP_ATOMIC);
- if (hv_cpu->synic_event_page == NULL) {
- pr_err("Unable to allocate SYNIC event page\n");
- goto err;
+ hv_cpu->synic_event_page =
+ (void *)get_zeroed_page(GFP_ATOMIC);
+ if (hv_cpu->synic_event_page == NULL) {
+ pr_err("Unable to allocate SYNIC event page\n");
+ goto err;
+ }
}
hv_cpu->post_msg_page = (void *)get_zeroed_page(GFP_ATOMIC);
@@ -201,16 +209,35 @@ void hv_synic_enable_regs(unsigned int cpu)
/* Setup the Synic's message page */
simp.as_uint64 = hv_get_register(HV_REGISTER_SIMP);
simp.simp_enabled = 1;
- simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
- >> HV_HYP_PAGE_SHIFT;
+
+ if (hv_isolation_type_snp()) {
+ hv_cpu->synic_message_page
+ = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
+ HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+ if (!hv_cpu->synic_message_page)
+ pr_err("Fail to map syinc message page.\n");
+ } else {
+ simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
+ >> HV_HYP_PAGE_SHIFT;
+ }
hv_set_register(HV_REGISTER_SIMP, simp.as_uint64);
/* Setup the Synic's event page */
siefp.as_uint64 = hv_get_register(HV_REGISTER_SIEFP);
siefp.siefp_enabled = 1;
- siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
- >> HV_HYP_PAGE_SHIFT;
+
+ if (hv_isolation_type_snp()) {
+ hv_cpu->synic_event_page =
+ memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
+ HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+
+ if (!hv_cpu->synic_event_page)
+ pr_err("Fail to map syinc event page.\n");
+ } else {
+ siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
+ >> HV_HYP_PAGE_SHIFT;
+ }
hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);
@@ -257,30 +284,48 @@ int hv_synic_init(unsigned int cpu)
*/
void hv_synic_disable_regs(unsigned int cpu)
{
+ struct hv_per_cpu_context *hv_cpu
+ = per_cpu_ptr(hv_context.cpu_context, cpu);
union hv_synic_sint shared_sint;
union hv_synic_simp simp;
union hv_synic_siefp siefp;
union hv_synic_scontrol sctrl;
+
shared_sint.as_uint64 = hv_get_register(HV_REGISTER_SINT0 +
VMBUS_MESSAGE_SINT);
shared_sint.masked = 1;
+
+
/* Need to correctly cleanup in the case of SMP!!! */
/* Disable the interrupt */
hv_set_register(HV_REGISTER_SINT0 + VMBUS_MESSAGE_SINT,
shared_sint.as_uint64);
simp.as_uint64 = hv_get_register(HV_REGISTER_SIMP);
+ /*
+ * In Isolation VM, sim and sief pages are allocated by
+ * paravisor. These pages also will be used by kdump
+ * kernel. So just reset enable bit here and keep page
+ * addresses.
+ */
simp.simp_enabled = 0;
- simp.base_simp_gpa = 0;
+ if (hv_isolation_type_snp())
+ memunmap(hv_cpu->synic_message_page);
+ else
+ simp.base_simp_gpa = 0;
hv_set_register(HV_REGISTER_SIMP, simp.as_uint64);
siefp.as_uint64 = hv_get_register(HV_REGISTER_SIEFP);
siefp.siefp_enabled = 0;
- siefp.base_siefp_gpa = 0;
+
+ if (hv_isolation_type_snp())
+ memunmap(hv_cpu->synic_event_page);
+ else
+ siefp.base_siefp_gpa = 0;
hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index c0d9048a4112..1fc82d237161 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -249,6 +249,12 @@ bool __weak hv_is_isolation_supported(void)
}
EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
+bool __weak hv_isolation_type_snp(void)
+{
+ return false;
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
+
void __weak hv_setup_vmbus_handler(void (*handler)(void))
{
}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index cb529c85c0ad..94750bafd4cc 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -24,6 +24,7 @@
#include <linux/cpumask.h>
#include <linux/nmi.h>
#include <asm/ptrace.h>
+#include <asm/mshyperv.h>
#include <asm/hyperv-tlfs.h>
struct ms_hyperv_info {
@@ -54,6 +55,7 @@ extern void __percpu **hyperv_pcpu_output_arg;
extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
+extern bool hv_isolation_type_snp(void);
/* Helper functions that provide a consistent pattern for checking Hyper-V hypercall status. */
static inline int hv_result(u64 status)
--
2.25.1
From: Tianyu Lan <[email protected]>
hyperv Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.
In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.
Hyper-V initalizes swiotlb bounce buffer and default swiotlb
needs to be disabled. pci_swiotlb_detect_override() and
pci_swiotlb_detect_4gb() enable the default one. To override
the setting, hyperv_swiotlb_detect() needs to run before
these detect functions which depends on the pci_xen_swiotlb_
init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
_detect() to keep the order.
Swiotlb bounce buffer code calls set_memory_decrypted()
to mark bounce buffer visible to host and map it in extra
address space via memremap. Populate the shared_gpa_boundary
(vTOM) via swiotlb_unencrypted_base variable.
The map function memremap() can't work in the early place
hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
buffer in the hyperv_iommu_swiotlb_later_init().
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Use swiotlb_unencrypted_base variable to pass shared_gpa_
boundary and map bounce buffer inside swiotlb code.
Change since v3:
* Get hyperv bounce bufffer size via default swiotlb
bounce buffer size function and keep default size as
same as the one in the AMD SEV VM.
---
arch/x86/include/asm/mshyperv.h | 2 ++
arch/x86/mm/mem_encrypt.c | 3 +-
arch/x86/xen/pci-swiotlb-xen.c | 3 +-
drivers/hv/vmbus_drv.c | 3 ++
drivers/iommu/hyperv-iommu.c | 60 +++++++++++++++++++++++++++++++++
include/linux/hyperv.h | 1 +
6 files changed, 70 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 165423e8b67a..2d22f29f90c9 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -182,6 +182,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
struct hv_interrupt_entry *entry);
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);
+void *hv_map_memory(void *addr, unsigned long size);
+void hv_unmap_memory(void *addr);
void hv_ghcb_msr_write(u64 msr, u64 value);
void hv_ghcb_msr_read(u64 msr, u64 *value);
#else /* CONFIG_HYPERV */
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ff08dc463634..e2db0b8ed938 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -30,6 +30,7 @@
#include <asm/processor-flags.h>
#include <asm/msr.h>
#include <asm/cmdline.h>
+#include <asm/mshyperv.h>
#include "mm_internal.h"
@@ -202,7 +203,7 @@ void __init sev_setup_arch(void)
phys_addr_t total_mem = memblock_phys_mem_size();
unsigned long size;
- if (!sev_active())
+ if (!sev_active() && !hv_is_isolation_supported())
return;
/*
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 54f9aa7e8457..43bd031aa332 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -4,6 +4,7 @@
#include <linux/dma-map-ops.h>
#include <linux/pci.h>
+#include <linux/hyperv.h>
#include <xen/swiotlb-xen.h>
#include <asm/xen/hypervisor.h>
@@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
- NULL,
+ hyperv_swiotlb_detect,
pci_xen_swiotlb_init,
NULL);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 392c1ac4f819..b0be287e9a32 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -23,6 +23,7 @@
#include <linux/cpu.h>
#include <linux/sched/task_stack.h>
+#include <linux/dma-map-ops.h>
#include <linux/delay.h>
#include <linux/notifier.h>
#include <linux/panic_notifier.h>
@@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,
return child_device_obj;
}
+static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
/*
* vmbus_device_register - Register the child device
*/
@@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
}
hv_debug_add_dev_dir(child_device_obj);
+ child_device_obj->device.dma_mask = &vmbus_dma_mask;
return 0;
err_kset_unregister:
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e285a220c913..a8ac2239de0f 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -13,14 +13,22 @@
#include <linux/irq.h>
#include <linux/iommu.h>
#include <linux/module.h>
+#include <linux/hyperv.h>
+#include <linux/io.h>
#include <asm/apic.h>
#include <asm/cpu.h>
#include <asm/hw_irq.h>
#include <asm/io_apic.h>
+#include <asm/iommu.h>
+#include <asm/iommu_table.h>
#include <asm/irq_remapping.h>
#include <asm/hypervisor.h>
#include <asm/mshyperv.h>
+#include <asm/swiotlb.h>
+#include <linux/dma-map-ops.h>
+#include <linux/dma-direct.h>
+#include <linux/set_memory.h>
#include "irq_remapping.h"
@@ -36,6 +44,9 @@
static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
static struct irq_domain *ioapic_ir_domain;
+static unsigned long hyperv_io_tlb_size;
+static void *hyperv_io_tlb_start;
+
static int hyperv_ir_set_affinity(struct irq_data *data,
const struct cpumask *mask, bool force)
{
@@ -337,4 +348,53 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
.free = hyperv_root_irq_remapping_free,
};
+static void __init hyperv_iommu_swiotlb_init(void)
+{
+ /*
+ * Allocate Hyper-V swiotlb bounce buffer at early place
+ * to reserve large contiguous memory.
+ */
+ hyperv_io_tlb_size = swiotlb_size_or_default();
+ hyperv_io_tlb_start = memblock_alloc(
+ hyperv_io_tlb_size, PAGE_SIZE);
+
+ if (!hyperv_io_tlb_start) {
+ pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");
+ return;
+ }
+}
+
+int __init hyperv_swiotlb_detect(void)
+{
+ if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
+ return 0;
+
+ if (!hv_is_isolation_supported())
+ return 0;
+
+ /*
+ * Enable swiotlb force mode in Isolation VM to
+ * use swiotlb bounce buffer for dma transaction.
+ */
+ swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
+ swiotlb_force = SWIOTLB_FORCE;
+ return 1;
+}
+
+static void __init hyperv_iommu_swiotlb_later_init(void)
+{
+ /*
+ * Swiotlb bounce buffer needs to be mapped in extra address
+ * space. Map function doesn't work in the early place and so
+ * call swiotlb_late_init_with_tbl() here.
+ */
+ if (swiotlb_late_init_with_tbl(hyperv_io_tlb_start,
+ hyperv_io_tlb_size >> IO_TLB_SHIFT))
+ panic("Fail to initialize hyperv swiotlb.\n");
+}
+
+IOMMU_INIT_FINISH(hyperv_swiotlb_detect,
+ NULL, hyperv_iommu_swiotlb_init,
+ hyperv_iommu_swiotlb_later_init);
+
#endif
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index a9e0bc3b1511..bb1a1519b93a 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1739,6 +1739,7 @@ int hyperv_write_cfg_blk(struct pci_dev *dev, void *buf, unsigned int len,
int hyperv_reg_block_invalidate(struct pci_dev *dev, void *context,
void (*block_invalidate)(void *context,
u64 block_mask));
+int __init hyperv_swiotlb_detect(void);
struct hyperv_pci_block_ops {
int (*read_block)(struct pci_dev *dev, void *buf, unsigned int buf_len,
--
2.25.1
From: Tianyu Lan <[email protected]>
In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap)
to map these memory during sending/receiving packet and return swiotlb
bounce buffer dma address. In Isolation VM, swiotlb bounce buffer is
marked to be visible to host and the swiotlb force mode is enabled.
Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to
keep the original data offset in the bounce buffer.
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* use scsi_dma_map/unmap() instead of dma_map/unmap_sg()
* Add deleted comments back.
* Fix error calculation of hvpnfs_to_add
Change since v3:
* Rplace dma_map_page with dma_map_sg()
* Use for_each_sg() to populate payload->range.pfn_array.
* Remove storvsc_dma_map macro
---
drivers/hv/vmbus_drv.c | 1 +
drivers/scsi/storvsc_drv.c | 24 +++++++++++++++---------
include/linux/hyperv.h | 1 +
3 files changed, 17 insertions(+), 9 deletions(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index b0be287e9a32..9c53f823cde1 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2121,6 +2121,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
hv_debug_add_dev_dir(child_device_obj);
child_device_obj->device.dma_mask = &vmbus_dma_mask;
+ child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
return 0;
err_kset_unregister:
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index ebbbc1299c62..d10b450bcf0c 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -21,6 +21,8 @@
#include <linux/device.h>
#include <linux/hyperv.h>
#include <linux/blkdev.h>
+#include <linux/dma-mapping.h>
+
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
#include <scsi/scsi_host.h>
@@ -1322,6 +1324,7 @@ static void storvsc_on_channel_callback(void *context)
continue;
}
request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);
+ scsi_dma_unmap(scmnd);
}
storvsc_on_receive(stor_device, packet, request);
@@ -1735,7 +1738,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
struct hv_host_device *host_dev = shost_priv(host);
struct hv_device *dev = host_dev->dev;
struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd);
- int i;
struct scatterlist *sgl;
unsigned int sg_count;
struct vmscsi_request *vm_srb;
@@ -1817,10 +1819,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
payload_sz = sizeof(cmd_request->mpb);
if (sg_count) {
- unsigned int hvpgoff, hvpfns_to_add;
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
- u64 hvpfn;
+ struct scatterlist *sg;
+ unsigned long hvpfn, hvpfns_to_add;
+ int j, i = 0;
if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
@@ -1834,8 +1837,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
payload->range.len = length;
payload->range.offset = offset_in_hvpg;
+ sg_count = scsi_dma_map(scmnd);
+ if (sg_count < 0)
+ return SCSI_MLQUEUE_DEVICE_BUSY;
- for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
+ for_each_sg(sgl, sg, sg_count, j) {
/*
* Init values for the current sgl entry. hvpgoff
* and hvpfns_to_add are in units of Hyper-V size
@@ -1845,10 +1851,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
* even on other than the first sgl entry, provided
* they are a multiple of PAGE_SIZE.
*/
- hvpgoff = HVPFN_DOWN(sgl->offset);
- hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
- hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) -
- hvpgoff;
+ hvpfn = HVPFN_DOWN(sg_dma_address(sg));
+ hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
+ sg_dma_len(sg)) - hvpfn;
/*
* Fill the next portion of the PFN array with
@@ -1858,7 +1863,7 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
* the PFN array is filled.
*/
while (hvpfns_to_add--)
- payload->range.pfn_array[i++] = hvpfn++;
+ payload->range.pfn_array[i++] = hvpfn++;
}
}
@@ -2002,6 +2007,7 @@ static int storvsc_probe(struct hv_device *device,
stor_device->vmscsi_size_delta = sizeof(struct vmscsi_win8_extension);
spin_lock_init(&stor_device->lock);
hv_set_drvdata(device, stor_device);
+ dma_set_min_align_mask(&device->device, HV_HYP_PAGE_SIZE - 1);
stor_device->port_number = host->host_no;
ret = storvsc_connect_to_vsp(device, storvsc_ringbuffer_size, is_fc);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index bb1a1519b93a..c94c534a944e 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1274,6 +1274,7 @@ struct hv_device {
struct vmbus_channel *channel;
struct kset *channels_kset;
+ struct device_dma_parameters dma_parms;
/* place holder to keep track of the dir for hv device in debugfs */
struct dentry *debug_dir;
--
2.25.1
From: Tianyu Lan <[email protected]>
In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
pagebuffer() stills need to be handled. Use DMA API to map/umap
these memory during sending/receiving packet and Hyper-V swiotlb
bounce buffer dma address will be returned. The swiotlb bounce buffer
has been masked to be visible to host during boot up.
Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
these pages via vmap(). After calling vmbus_establish_gpadl() which
marks these pages visible to host, unmap these pages to release the
virtual address mapped with physical address below shared_gpa_boundary
and map them in the extra address space via vmap_pfn().
Signed-off-by: Tianyu Lan <[email protected]>
---
Change since v4:
* Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
* Map pages after calling vmbus_establish_gpadl().
* set dma_set_min_align_mask for netvsc driver.
Change since v3:
* Add comment to explain why not to use dma_map_sg()
* Fix some error handle.
---
drivers/net/hyperv/hyperv_net.h | 7 +
drivers/net/hyperv/netvsc.c | 287 +++++++++++++++++++++++++++++-
drivers/net/hyperv/netvsc_drv.c | 1 +
drivers/net/hyperv/rndis_filter.c | 2 +
include/linux/hyperv.h | 5 +
5 files changed, 296 insertions(+), 6 deletions(-)
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 315278a7cf88..87e8c74398a5 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -164,6 +164,7 @@ struct hv_netvsc_packet {
u32 total_bytes;
u32 send_buf_index;
u32 total_data_buflen;
+ struct hv_dma_range *dma_range;
};
#define NETVSC_HASH_KEYLEN 40
@@ -1074,6 +1075,8 @@ struct netvsc_device {
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
+ struct page **recv_pages;
+ u32 recv_page_count;
u32 recv_buf_size; /* allocated bytes */
struct vmbus_gpadl recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1082,6 +1085,8 @@ struct netvsc_device {
/* Send buffer allocated by us */
void *send_buf;
+ struct page **send_pages;
+ u32 send_page_count;
u32 send_buf_size;
struct vmbus_gpadl send_buf_gpadl_handle;
u32 send_section_cnt;
@@ -1731,4 +1736,6 @@ struct rndis_message {
#define RETRY_US_HI 10000
#define RETRY_MAX 2000 /* >10 sec */
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet);
#endif /* _HYPERV_NET_H */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 1f87e570ed2b..7d5254bf043e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -20,6 +20,7 @@
#include <linux/vmalloc.h>
#include <linux/rtnetlink.h>
#include <linux/prefetch.h>
+#include <linux/gfp.h>
#include <asm/sync_bitops.h>
#include <asm/mshyperv.h>
@@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head *head)
{
struct netvsc_device *nvdev
= container_of(head, struct netvsc_device, rcu);
+ unsigned int alloc_unit;
int i;
kfree(nvdev->extension);
- vfree(nvdev->recv_buf);
- vfree(nvdev->send_buf);
+
+ if (nvdev->recv_pages) {
+ alloc_unit = (nvdev->recv_buf_size /
+ nvdev->recv_page_count) >> PAGE_SHIFT;
+
+ vunmap(nvdev->recv_buf);
+ for (i = 0; i < nvdev->recv_page_count; i++)
+ __free_pages(nvdev->recv_pages[i], alloc_unit);
+ } else {
+ vfree(nvdev->recv_buf);
+ }
+
+ if (nvdev->send_pages) {
+ alloc_unit = (nvdev->send_buf_size /
+ nvdev->send_page_count) >> PAGE_SHIFT;
+
+ vunmap(nvdev->send_buf);
+ for (i = 0; i < nvdev->send_page_count; i++)
+ __free_pages(nvdev->send_pages[i], alloc_unit);
+ } else {
+ vfree(nvdev->send_buf);
+ }
+
kfree(nvdev->send_section_map);
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
return nvchan->mrc.slots ? 0 : -ENOMEM;
}
+void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
+ unsigned long size)
+{
+ struct page *page, **pages, **vmap_pages;
+ unsigned long pg_count = size >> PAGE_SHIFT;
+ int alloc_unit = MAX_ORDER_NR_PAGES;
+ int i, j, vmap_page_index = 0;
+ void *vaddr;
+
+ if (pg_count < alloc_unit)
+ alloc_unit = 1;
+
+ /* vmap() accepts page array with PAGE_SIZE as unit while try to
+ * allocate high order pages here in order to save page array space.
+ * vmap_pages[] is used as input parameter of vmap(). pages[] is to
+ * store allocated pages and map them later.
+ */
+ vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), GFP_KERNEL);
+ if (!vmap_pages)
+ return NULL;
+
+retry:
+ *array_len = pg_count / alloc_unit;
+ pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
+ if (!pages)
+ goto cleanup;
+
+ for (i = 0; i < *array_len; i++) {
+ page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(alloc_unit << PAGE_SHIFT));
+ if (!page) {
+ /* Try allocating small pages if high order pages are not available. */
+ if (alloc_unit == 1) {
+ goto cleanup;
+ } else {
+ memset(vmap_pages, 0,
+ sizeof(*vmap_pages) * vmap_page_index);
+ vmap_page_index = 0;
+
+ for (j = 0; j < i; j++)
+ __free_pages(pages[j], alloc_unit);
+
+ kfree(pages);
+ alloc_unit = 1;
+ goto retry;
+ }
+ }
+
+ pages[i] = page;
+ for (j = 0; j < alloc_unit; j++)
+ vmap_pages[vmap_page_index++] = page++;
+ }
+
+ vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
+ kfree(vmap_pages);
+
+ *pages_array = pages;
+ return vaddr;
+
+cleanup:
+ for (j = 0; j < i; j++)
+ __free_pages(pages[i], alloc_unit);
+
+ kfree(pages);
+ kfree(vmap_pages);
+ return NULL;
+}
+
+static void *netvsc_map_pages(struct page **pages, int count, int alloc_unit)
+{
+ int pg_count = count * alloc_unit;
+ struct page *page;
+ unsigned long *pfns;
+ int pfn_index = 0;
+ void *vaddr;
+ int i, j;
+
+ if (!pages)
+ return NULL;
+
+ pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
+ if (!pfns)
+ return NULL;
+
+ for (i = 0; i < count; i++) {
+ page = pages[i];
+ if (!page) {
+ pr_warn("page is not available %d.\n", i);
+ return NULL;
+ }
+
+ for (j = 0; j < alloc_unit; j++) {
+ pfns[pfn_index++] = page_to_pfn(page++) +
+ (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
+ }
+ }
+
+ vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
+ kfree(pfns);
+ return vaddr;
+}
+
static int netvsc_init_buf(struct hv_device *device,
struct netvsc_device *net_device,
const struct netvsc_device_info *device_info)
@@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
struct nvsp_1_message_send_receive_buffer_complete *resp;
struct net_device *ndev = hv_get_drvdata(device);
struct nvsp_message *init_packet;
- unsigned int buf_size;
+ unsigned int buf_size, alloc_unit;
size_t map_words;
int i, ret = 0;
@@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device *device,
buf_size = min_t(unsigned int, buf_size,
NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
- net_device->recv_buf = vzalloc(buf_size);
+ if (hv_isolation_type_snp())
+ net_device->recv_buf =
+ netvsc_alloc_pages(&net_device->recv_pages,
+ &net_device->recv_page_count,
+ buf_size);
+ else
+ net_device->recv_buf = vzalloc(buf_size);
+
if (!net_device->recv_buf) {
netdev_err(ndev,
"unable to allocate receive buffer of size %u\n",
@@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
+ if (hv_isolation_type_snp()) {
+ alloc_unit = (buf_size / net_device->recv_page_count)
+ >> PAGE_SHIFT;
+
+ /* Unmap previous virtual address and map pages in the extra
+ * address space(above shared gpa boundary) in Isolation VM.
+ */
+ vunmap(net_device->recv_buf);
+ net_device->recv_buf =
+ netvsc_map_pages(net_device->recv_pages,
+ net_device->recv_page_count,
+ alloc_unit);
+ if (!net_device->recv_buf) {
+ netdev_err(ndev,
+ "unable to allocate receive buffer of size %u\n",
+ buf_size);
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+ }
+
/* Notify the NetVsp of the gpadl handle */
init_packet = &net_device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device *device,
buf_size = device_info->send_sections * device_info->send_section_size;
buf_size = round_up(buf_size, PAGE_SIZE);
- net_device->send_buf = vzalloc(buf_size);
+ if (hv_isolation_type_snp())
+ net_device->send_buf =
+ netvsc_alloc_pages(&net_device->send_pages,
+ &net_device->send_page_count,
+ buf_size);
+ else
+ net_device->send_buf = vzalloc(buf_size);
+
if (!net_device->send_buf) {
netdev_err(ndev, "unable to allocate send buffer of size %u\n",
buf_size);
ret = -ENOMEM;
goto cleanup;
}
+
net_device->send_buf_size = buf_size;
/* Establish the gpadl handle for this buffer on this
@@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
+ if (hv_isolation_type_snp()) {
+ alloc_unit = (buf_size / net_device->send_page_count)
+ >> PAGE_SHIFT;
+
+ /* Unmap previous virtual address and map pages in the extra
+ * address space(above shared gpa boundary) in Isolation VM.
+ */
+ vunmap(net_device->send_buf);
+ net_device->send_buf =
+ netvsc_map_pages(net_device->send_pages,
+ net_device->send_page_count,
+ alloc_unit);
+ if (!net_device->send_buf) {
+ netdev_err(ndev,
+ "unable to allocate receive buffer of size %u\n",
+ buf_size);
+ ret = -ENOMEM;
+ goto cleanup;
+ }
+ }
+
/* Notify the NetVsp of the gpadl handle */
init_packet = &net_device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
/* Notify the layer above us */
if (likely(skb)) {
- const struct hv_netvsc_packet *packet
+ struct hv_netvsc_packet *packet
= (struct hv_netvsc_packet *)skb->cb;
u32 send_index = packet->send_buf_index;
struct netvsc_stats *tx_stats;
@@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
tx_stats->bytes += packet->total_bytes;
u64_stats_update_end(&tx_stats->syncp);
+ netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
napi_consume_skb(skb, budget);
}
@@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
memset(dest, 0, padding);
}
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet)
+{
+ u32 page_count = packet->cp_partial ?
+ packet->page_buf_cnt - packet->rmsg_pgcnt :
+ packet->page_buf_cnt;
+ int i;
+
+ if (!hv_is_isolation_supported())
+ return;
+
+ if (!packet->dma_range)
+ return;
+
+ for (i = 0; i < page_count; i++)
+ dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
+ packet->dma_range[i].mapping_size,
+ DMA_TO_DEVICE);
+
+ kfree(packet->dma_range);
+}
+
+/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
+ * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
+ * VM.
+ *
+ * In isolation VM, netvsc send buffer has been marked visible to
+ * host and so the data copied to send buffer doesn't need to use
+ * bounce buffer. The data pages handled by vmbus_sendpacket_pagebuffer()
+ * may not be copied to send buffer and so these pages need to be
+ * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
+ * that. The pfns in the struct hv_page_buffer need to be converted
+ * to bounce buffer's pfn. The loop here is necessary because the
+ * entries in the page buffer array are not necessarily full
+ * pages of data. Each entry in the array has a separate offset and
+ * len that may be non-zero, even for entries in the middle of the
+ * array. And the entries are not physically contiguous. So each
+ * entry must be individually mapped rather than as a contiguous unit.
+ * So not use dma_map_sg() here.
+ */
+static int netvsc_dma_map(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet,
+ struct hv_page_buffer *pb)
+{
+ u32 page_count = packet->cp_partial ?
+ packet->page_buf_cnt - packet->rmsg_pgcnt :
+ packet->page_buf_cnt;
+ dma_addr_t dma;
+ int i;
+
+ if (!hv_is_isolation_supported())
+ return 0;
+
+ packet->dma_range = kcalloc(page_count,
+ sizeof(*packet->dma_range),
+ GFP_KERNEL);
+ if (!packet->dma_range)
+ return -ENOMEM;
+
+ for (i = 0; i < page_count; i++) {
+ char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
+ + pb[i].offset);
+ u32 len = pb[i].len;
+
+ dma = dma_map_single(&hv_dev->device, src, len,
+ DMA_TO_DEVICE);
+ if (dma_mapping_error(&hv_dev->device, dma)) {
+ kfree(packet->dma_range);
+ return -ENOMEM;
+ }
+
+ packet->dma_range[i].dma = dma;
+ packet->dma_range[i].mapping_size = len;
+ pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
+ pb[i].offset = offset_in_hvpage(dma);
+ pb[i].len = len;
+ }
+
+ return 0;
+}
+
static inline int netvsc_send_pkt(
struct hv_device *device,
struct hv_netvsc_packet *packet,
@@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
trace_nvsp_send_pkt(ndev, out_channel, rpkt);
+ packet->dma_range = NULL;
if (packet->page_buf_cnt) {
if (packet->cp_partial)
pb += packet->rmsg_pgcnt;
+ ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
+ if (ret) {
+ ret = -EAGAIN;
+ goto exit;
+ }
+
ret = vmbus_sendpacket_pagebuffer(out_channel,
pb, packet->page_buf_cnt,
&nvmsg, sizeof(nvmsg),
req_id);
+
+ if (ret)
+ netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
} else {
ret = vmbus_sendpacket(out_channel,
&nvmsg, sizeof(nvmsg),
@@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
}
+exit:
if (ret == 0) {
atomic_inc_return(&nvchan->queue_sends);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 382bebc2420d..c3dc884b31e3 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
list_add(&net_device_ctx->list, &netvsc_dev_list);
rtnl_unlock();
+ dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
netvsc_devinfo_put(device_info);
return 0;
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index f6c9c2a670f9..448fcc325ed7 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct net_device *ndev,
}
}
+ netvsc_dma_unmap(((struct net_device_context *)
+ netdev_priv(ndev))->device_ctx, &request->pkt);
complete(&request->wait_event);
} else {
netdev_err(ndev,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index c94c534a944e..81e58dd582dc 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
void (*callback)(void *context);
};
+struct hv_dma_range {
+ dma_addr_t dma;
+ u32 mapping_size;
+};
+
#define MAX_SRV_VER 0x7ffffff
extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, u8 *buf, u32 buflen,
const int *fw_version, int fw_vercnt,
--
2.25.1
> -----Original Message-----
> From: Tianyu Lan <[email protected]>
> Sent: Tuesday, September 14, 2021 9:39 AM
> To: KY Srinivasan <[email protected]>; Haiyang Zhang
> <[email protected]>; Stephen Hemminger <[email protected]>;
> [email protected]; Dexuan Cui <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Tianyu Lan <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Michael Kelley
> <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; vkuznets
> <[email protected]>; [email protected]; [email protected]
> Subject: [PATCH V5 12/12] net: netvsc: Add Isolation VM support for
> netvsc driver
>
> From: Tianyu Lan <[email protected]>
>
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
>
> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
> these pages via vmap(). After calling vmbus_establish_gpadl() which
> marks these pages visible to host, unmap these pages to release the
> virtual address mapped with physical address below shared_gpa_boundary
> and map them in the extra address space via vmap_pfn().
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
> * Map pages after calling vmbus_establish_gpadl().
> * set dma_set_min_align_mask for netvsc driver.
>
> Change since v3:
> * Add comment to explain why not to use dma_map_sg()
> * Fix some error handle.
> ---
Reviewed-by: Haiyang Zhang <[email protected]>
Thank you!
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
> with host in Isolation VM and so it's necessary to use hvcall to set
> them visible to host. In Isolation VM with AMD SEV SNP, the access
> address should be in the extra space which is above shared gpa
> boundary. So remap these pages into the extra address(pa +
> shared_gpa_boundary).
>
> Introduce monitor_pages_original[] in the struct vmbus_connection
> to store monitor page virtual address returned by hv_alloc_hyperv_
> zeroed_page() and free monitor page via monitor_pages_original in
> the vmbus_disconnect(). The monitor_pages[] is to used to access
> monitor page and it is initialized to be equal with monitor_pages_
> original. The monitor_pages[] will be overridden in the isolation VM
> with va of extra address. Introduce monitor_pages_pa[] to store
> monitor pages' physical address and use it to populate pa in the
> initiate msg.
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * Introduce monitor_pages_pa[] to store monitor pages' physical
> address and use it to populate pa in the initiate msg.
> * Move code of mapping moniter pages in extra address into
> vmbus_connect().
>
> Change since v3:
> * Rename monitor_pages_va with monitor_pages_original
> * free monitor page via monitor_pages_original and
> monitor_pages is used to access monitor page.
>
> Change since v1:
> * Not remap monitor pages in the non-SNP isolation VM.
> ---
> drivers/hv/connection.c | 90 ++++++++++++++++++++++++++++++++++++---
> drivers/hv/hyperv_vmbus.h | 2 +
> 2 files changed, 86 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 8820ae68f20f..edd8f7dd169f 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -19,6 +19,8 @@
> #include <linux/vmalloc.h>
> #include <linux/hyperv.h>
> #include <linux/export.h>
> +#include <linux/io.h>
> +#include <linux/set_memory.h>
> #include <asm/mshyperv.h>
>
> #include "hyperv_vmbus.h"
> @@ -102,8 +104,9 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, u32 version)
> vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID;
> }
>
> - msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
> - msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
> + msg->monitor_page1 = vmbus_connection.monitor_pages_pa[0];
> + msg->monitor_page2 = vmbus_connection.monitor_pages_pa[1];
> +
> msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
>
> /*
> @@ -216,6 +219,65 @@ int vmbus_connect(void)
> goto cleanup;
> }
>
> + vmbus_connection.monitor_pages_original[0]
> + = vmbus_connection.monitor_pages[0];
> + vmbus_connection.monitor_pages_original[1]
> + = vmbus_connection.monitor_pages[1];
> + vmbus_connection.monitor_pages_pa[0]
> + = virt_to_phys(vmbus_connection.monitor_pages[0]);
> + vmbus_connection.monitor_pages_pa[1]
> + = virt_to_phys(vmbus_connection.monitor_pages[1]);
> +
> + if (hv_is_isolation_supported()) {
> + vmbus_connection.monitor_pages_pa[0] +=
> + ms_hyperv.shared_gpa_boundary;
> + vmbus_connection.monitor_pages_pa[1] +=
> + ms_hyperv.shared_gpa_boundary;
> +
> + ret = set_memory_decrypted((unsigned long)
> + vmbus_connection.monitor_pages[0],
> + 1);
> + ret |= set_memory_decrypted((unsigned long)
> + vmbus_connection.monitor_pages[1],
> + 1);
> + if (ret)
> + goto cleanup;
> +
> + /*
> + * Isolation VM with AMD SNP needs to access monitor page via
> + * address space above shared gpa boundary.
> + */
> + if (hv_isolation_type_snp()) {
> + vmbus_connection.monitor_pages[0]
> + = memremap(vmbus_connection.monitor_pages_pa[0],
> + HV_HYP_PAGE_SIZE,
> + MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[0]) {
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> +
> + vmbus_connection.monitor_pages[1]
> + = memremap(vmbus_connection.monitor_pages_pa[1],
> + HV_HYP_PAGE_SIZE,
> + MEMREMAP_WB);
> + if (!vmbus_connection.monitor_pages[1]) {
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> + }
> +
> + /*
> + * Set memory host visibility hvcall smears memory
> + * and so zero monitor pages here.
> + */
> + memset(vmbus_connection.monitor_pages[0], 0x00,
> + HV_HYP_PAGE_SIZE);
> + memset(vmbus_connection.monitor_pages[1], 0x00,
> + HV_HYP_PAGE_SIZE);
> +
> + }
> +
This all looks good. To me, this is a lot clearer to have all the mapping
and encryption/decryption handled in one place.
> msginfo = kzalloc(sizeof(*msginfo) +
> sizeof(struct vmbus_channel_initiate_contact),
> GFP_KERNEL);
> @@ -303,10 +365,26 @@ void vmbus_disconnect(void)
> vmbus_connection.int_page = NULL;
> }
>
> - hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
> - hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
> - vmbus_connection.monitor_pages[0] = NULL;
> - vmbus_connection.monitor_pages[1] = NULL;
> + if (hv_is_isolation_supported()) {
> + memunmap(vmbus_connection.monitor_pages[0]);
> + memunmap(vmbus_connection.monitor_pages[1]);
> +
> + set_memory_encrypted((unsigned long)
> + vmbus_connection.monitor_pages_original[0],
> + 1);
> + set_memory_encrypted((unsigned long)
> + vmbus_connection.monitor_pages_original[1],
> + 1);
> + }
> +
> + hv_free_hyperv_page((unsigned long)
> + vmbus_connection.monitor_pages_original[0]);
> + hv_free_hyperv_page((unsigned long)
> + vmbus_connection.monitor_pages_original[1]);
> + vmbus_connection.monitor_pages_original[0] =
> + vmbus_connection.monitor_pages[0] = NULL;
> + vmbus_connection.monitor_pages_original[1] =
> + vmbus_connection.monitor_pages[1] = NULL;
> }
>
> /*
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index 42f3d9d123a1..560cba916d1d 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -240,6 +240,8 @@ struct vmbus_connection {
> * is child->parent notification
> */
> struct hv_monitor_page *monitor_pages[2];
> + void *monitor_pages_original[2];
> + unsigned long monitor_pages_pa[2];
The type of this field really should be phys_addr_t. In addition to
just making semantic sense, then it will match the return type from
virt_to_phys() and the input arg to memremap() since resource_size_t
is typedef'ed as phys_addr_t.
> struct list_head chn_msg_list;
> spinlock_t channelmsg_lock;
>
> --
> 2.25.1
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
> extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
>
> Expose swiotlb_unencrypted_base for platforms to set unencrypted
> memory base offset and call memremap() to map bounce buffer in the
> swiotlb code, store map address and use the address to copy data
> from/to swiotlb bounce buffer.
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * Expose swiotlb_unencrypted_base to set unencrypted memory
> offset.
> * Use memremap() to map bounce buffer if swiotlb_unencrypted_
> base is set.
>
> Change since v1:
> * Make swiotlb_init_io_tlb_mem() return error code and return
> error when dma_map_decrypted() fails.
> ---
> include/linux/swiotlb.h | 6 ++++++
> kernel/dma/swiotlb.c | 41 +++++++++++++++++++++++++++++++++++------
> 2 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index b0cb2a9973f4..4998ed44ae3d 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -72,6 +72,9 @@ extern enum swiotlb_force swiotlb_force;
> * @end: The end address of the swiotlb memory pool. Used to do a quick
> * range check to see if the memory was in fact allocated by this
> * API.
> + * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb
> + * memory pool may be remapped in the memory encrypted case and store
> + * virtual address for bounce buffer operation.
> * @nslabs: The number of IO TLB blocks (in groups of 64) between @start and
> * @end. For default swiotlb, this is command line adjustable via
> * setup_io_tlb_npages.
> @@ -91,6 +94,7 @@ extern enum swiotlb_force swiotlb_force;
> struct io_tlb_mem {
> phys_addr_t start;
> phys_addr_t end;
> + void *vaddr;
> unsigned long nslabs;
> unsigned long used;
> unsigned int index;
> @@ -185,4 +189,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
> }
> #endif /* CONFIG_DMA_RESTRICTED_POOL */
>
> +extern phys_addr_t swiotlb_unencrypted_base;
> +
> #endif /* __LINUX_SWIOTLB_H */
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 87c40517e822..9e30cc4bd872 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -50,6 +50,7 @@
> #include <asm/io.h>
> #include <asm/dma.h>
>
> +#include <linux/io.h>
> #include <linux/init.h>
> #include <linux/memblock.h>
> #include <linux/iommu-helper.h>
> @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
>
> struct io_tlb_mem io_tlb_default_mem;
>
> +phys_addr_t swiotlb_unencrypted_base;
> +
> /*
> * Max segment that we can provide which (if pages are contingous) will
> * not be bounced (unless SWIOTLB_FORCE is set).
> @@ -175,7 +178,7 @@ void __init swiotlb_update_mem_attributes(void)
> memset(vaddr, 0, bytes);
> }
>
> -static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
> +static int swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
> unsigned long nslabs, bool late_alloc)
> {
> void *vaddr = phys_to_virt(start);
> @@ -196,13 +199,34 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
> mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
> mem->slots[i].alloc_size = 0;
> }
> +
> + if (set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT))
> + return -EFAULT;
> +
> + /*
> + * Map memory in the unencrypted physical address space when requested
> + * (e.g. for Hyper-V AMD SEV-SNP Isolation VMs).
> + */
> + if (swiotlb_unencrypted_base) {
> + phys_addr_t paddr = __pa(vaddr) + swiotlb_unencrypted_base;
Nit: Use "start" instead of "__pa(vaddr)" since "start" is already the needed
physical address.
> +
> + vaddr = memremap(paddr, bytes, MEMREMAP_WB);
> + if (!vaddr) {
> + pr_err("Failed to map the unencrypted memory.\n");
> + return -ENOMEM;
> + }
> + }
> +
> memset(vaddr, 0, bytes);
> + mem->vaddr = vaddr;
> + return 0;
> }
>
> int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> {
> struct io_tlb_mem *mem = &io_tlb_default_mem;
> size_t alloc_size;
> + int ret;
>
> if (swiotlb_force == SWIOTLB_NO_FORCE)
> return 0;
> @@ -217,7 +241,11 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
> __func__, alloc_size, PAGE_SIZE);
>
> - swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
> + ret = swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
> + if (ret) {
> + memblock_free(__pa(mem), alloc_size);
> + return ret;
> + }
>
> if (verbose)
> swiotlb_print_info();
> @@ -304,7 +332,7 @@ int
> swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
> {
> struct io_tlb_mem *mem = &io_tlb_default_mem;
> - unsigned long bytes = nslabs << IO_TLB_SHIFT;
> + int ret;
>
> if (swiotlb_force == SWIOTLB_NO_FORCE)
> return 0;
> @@ -318,8 +346,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
> if (!mem->slots)
> return -ENOMEM;
>
> - set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
> - swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
> + ret = swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
> + if (ret)
Before returning the error, free the pages obtained from the earlier call
to __get_free_pages()?
> + return ret;
>
> swiotlb_print_info();
> swiotlb_set_max_segment(mem->nslabs << IO_TLB_SHIFT);
> @@ -371,7 +400,7 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
> phys_addr_t orig_addr = mem->slots[index].orig_addr;
> size_t alloc_size = mem->slots[index].alloc_size;
> unsigned long pfn = PFN_DOWN(orig_addr);
> - unsigned char *vaddr = phys_to_virt(tlb_addr);
> + unsigned char *vaddr = mem->vaddr + tlb_addr - mem->start;
> unsigned int tlb_offset, orig_addr_offset;
>
> if (orig_addr == INVALID_PHYS_ADDR)
> --
> 2.25.1
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> Mark vmbus ring buffer visible with set_memory_decrypted() when
> establish gpadl handle.
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change sincv v4
> * Change gpadl handle in netvsc and uio driver from u32 to
> struct vmbus_gpadl.
> * Change vmbus_establish_gpadl()'s gpadl_handle parameter
> to vmbus_gpadl data structure.
>
> Change since v3:
> * Change vmbus_teardown_gpadl() parameter and put gpadl handle,
> buffer and buffer size in the struct vmbus_gpadl.
> ---
> drivers/hv/channel.c | 54 ++++++++++++++++++++++++---------
> drivers/net/hyperv/hyperv_net.h | 5 +--
> drivers/net/hyperv/netvsc.c | 17 ++++++-----
> drivers/uio/uio_hv_generic.c | 20 ++++++------
> include/linux/hyperv.h | 12 ++++++--
> 5 files changed, 71 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
> index f3761c73b074..cf419eb1de77 100644
> --- a/drivers/hv/channel.c
> +++ b/drivers/hv/channel.c
> @@ -17,6 +17,7 @@
> #include <linux/hyperv.h>
> #include <linux/uio.h>
> #include <linux/interrupt.h>
> +#include <linux/set_memory.h>
> #include <asm/page.h>
> #include <asm/mshyperv.h>
>
> @@ -456,7 +457,7 @@ static int create_gpadl_header(enum hv_gpadl_type type, void *kbuffer,
> static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
> enum hv_gpadl_type type, void *kbuffer,
> u32 size, u32 send_offset,
> - u32 *gpadl_handle)
> + struct vmbus_gpadl *gpadl)
> {
> struct vmbus_channel_gpadl_header *gpadlmsg;
> struct vmbus_channel_gpadl_body *gpadl_body;
> @@ -474,6 +475,15 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
> if (ret)
> return ret;
>
> + ret = set_memory_decrypted((unsigned long)kbuffer,
> + HVPFN_UP(size));
This should be PFN_UP, not HVPFN_UP. The numpages parameter to
set_memory_decrypted() is in guest size pages, not Hyper-V size pages.
> + if (ret) {
> + dev_warn(&channel->device_obj->device,
> + "Failed to set host visibility for new GPADL %d.\n",
> + ret);
> + return ret;
> + }
> +
> init_completion(&msginfo->waitevent);
> msginfo->waiting_channel = channel;
>
> @@ -537,7 +547,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
> }
>
> /* At this point, we received the gpadl created msg */
> - *gpadl_handle = gpadlmsg->gpadl;
> + gpadl->gpadl_handle = gpadlmsg->gpadl;
> + gpadl->buffer = kbuffer;
> + gpadl->size = size;
> +
>
> cleanup:
> spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> @@ -549,6 +562,11 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
> }
>
> kfree(msginfo);
> +
> + if (ret)
> + set_memory_encrypted((unsigned long)kbuffer,
> + HVPFN_UP(size));
Should be PFN_UP as noted on the previous call to set_memory_decrypted().
> +
> return ret;
> }
>
> @@ -561,10 +579,10 @@ static int __vmbus_establish_gpadl(struct vmbus_channel *channel,
> * @gpadl_handle: some funky thing
> */
> int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
> - u32 size, u32 *gpadl_handle)
> + u32 size, struct vmbus_gpadl *gpadl)
> {
> return __vmbus_establish_gpadl(channel, HV_GPADL_BUFFER, kbuffer, size,
> - 0U, gpadl_handle);
> + 0U, gpadl);
> }
> EXPORT_SYMBOL_GPL(vmbus_establish_gpadl);
>
> @@ -639,6 +657,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
> struct vmbus_channel_open_channel *open_msg;
> struct vmbus_channel_msginfo *open_info = NULL;
> struct page *page = newchannel->ringbuffer_page;
> + struct vmbus_gpadl gpadl;
I think this local variable was needed in a previous version of the patch, but
is now unused and should be deleted.
> u32 send_pages, recv_pages;
> unsigned long flags;
> int err;
> @@ -675,7 +694,7 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
> goto error_clean_ring;
>
> /* Establish the gpadl for the ring buffer */
> - newchannel->ringbuffer_gpadlhandle = 0;
> + newchannel->ringbuffer_gpadlhandle.gpadl_handle = 0;
>
> err = __vmbus_establish_gpadl(newchannel, HV_GPADL_RING,
> page_address(newchannel->ringbuffer_page),
> @@ -701,7 +720,8 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
> open_msg->header.msgtype = CHANNELMSG_OPENCHANNEL;
> open_msg->openid = newchannel->offermsg.child_relid;
> open_msg->child_relid = newchannel->offermsg.child_relid;
> - open_msg->ringbuffer_gpadlhandle = newchannel->ringbuffer_gpadlhandle;
> + open_msg->ringbuffer_gpadlhandle
> + = newchannel->ringbuffer_gpadlhandle.gpadl_handle;
> /*
> * The unit of ->downstream_ringbuffer_pageoffset is HV_HYP_PAGE and
> * the unit of ->ringbuffer_send_offset (i.e. send_pages) is PAGE, so
> @@ -759,8 +779,8 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
> error_free_info:
> kfree(open_info);
> error_free_gpadl:
> - vmbus_teardown_gpadl(newchannel, newchannel->ringbuffer_gpadlhandle);
> - newchannel->ringbuffer_gpadlhandle = 0;
> + vmbus_teardown_gpadl(newchannel, &newchannel->ringbuffer_gpadlhandle);
> + newchannel->ringbuffer_gpadlhandle.gpadl_handle = 0;
My previous comments had suggested letting vmbus_teardown_gpadl() set the
gpadl_handle to zero, avoiding the need for all the callers to set it to zero.
Did that not work for some reason? Just curious ....
> error_clean_ring:
> hv_ringbuffer_cleanup(&newchannel->outbound);
> hv_ringbuffer_cleanup(&newchannel->inbound);
> @@ -806,7 +826,7 @@ EXPORT_SYMBOL_GPL(vmbus_open);
> /*
> * vmbus_teardown_gpadl -Teardown the specified GPADL handle
> */
> -int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
> +int vmbus_teardown_gpadl(struct vmbus_channel *channel, struct vmbus_gpadl *gpadl)
> {
> struct vmbus_channel_gpadl_teardown *msg;
> struct vmbus_channel_msginfo *info;
> @@ -825,7 +845,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
>
> msg->header.msgtype = CHANNELMSG_GPADL_TEARDOWN;
> msg->child_relid = channel->offermsg.child_relid;
> - msg->gpadl = gpadl_handle;
> + msg->gpadl = gpadl->gpadl_handle;
>
> spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> list_add_tail(&info->msglistentry,
> @@ -859,6 +879,12 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
> spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
>
> kfree(info);
> +
> + ret = set_memory_encrypted((unsigned long)gpadl->buffer,
> + HVPFN_UP(gpadl->size));
PFN_UP here as well.
> + if (ret)
> + pr_warn("Fail to set mem host visibility in GPADL teardown %d.\n", ret);
> +
> return ret;
> }
> EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
> @@ -896,6 +922,7 @@ void vmbus_reset_channel_cb(struct vmbus_channel *channel)
> static int vmbus_close_internal(struct vmbus_channel *channel)
> {
> struct vmbus_channel_close_channel *msg;
> + struct vmbus_gpadl gpadl;
I think this local variable was needed in a previous version of the patch, but
is now unused and should be deleted.
> int ret;
>
> vmbus_reset_channel_cb(channel);
> @@ -933,9 +960,8 @@ static int vmbus_close_internal(struct vmbus_channel *channel)
> }
>
> /* Tear down the gpadl for the channel's ring buffer */
> - else if (channel->ringbuffer_gpadlhandle) {
> - ret = vmbus_teardown_gpadl(channel,
> - channel->ringbuffer_gpadlhandle);
> + else if (channel->ringbuffer_gpadlhandle.gpadl_handle) {
> + ret = vmbus_teardown_gpadl(channel, &channel->ringbuffer_gpadlhandle);
> if (ret) {
> pr_err("Close failed: teardown gpadl return %d\n", ret);
> /*
> @@ -944,7 +970,7 @@ static int vmbus_close_internal(struct vmbus_channel *channel)
> */
> }
>
> - channel->ringbuffer_gpadlhandle = 0;
> + channel->ringbuffer_gpadlhandle.gpadl_handle = 0;
> }
>
> if (!ret)
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index bc48855dff10..315278a7cf88 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -1075,14 +1075,15 @@ struct netvsc_device {
> /* Receive buffer allocated by us but manages by NetVSP */
> void *recv_buf;
> u32 recv_buf_size; /* allocated bytes */
> - u32 recv_buf_gpadl_handle;
> + struct vmbus_gpadl recv_buf_gpadl_handle;
> u32 recv_section_cnt;
> u32 recv_section_size;
> u32 recv_completion_cnt;
>
> /* Send buffer allocated by us */
> void *send_buf;
> - u32 send_buf_gpadl_handle;
> + u32 send_buf_size;
> + struct vmbus_gpadl send_buf_gpadl_handle;
> u32 send_section_cnt;
> u32 send_section_size;
> unsigned long *send_section_map;
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 7bd935412853..1f87e570ed2b 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -278,9 +278,9 @@ static void netvsc_teardown_recv_gpadl(struct hv_device *device,
> {
> int ret;
>
> - if (net_device->recv_buf_gpadl_handle) {
> + if (net_device->recv_buf_gpadl_handle.gpadl_handle) {
> ret = vmbus_teardown_gpadl(device->channel,
> - net_device->recv_buf_gpadl_handle);
> + &net_device->recv_buf_gpadl_handle);
>
> /* If we failed here, we might as well return and have a leak
> * rather than continue and a bugchk
> @@ -290,7 +290,7 @@ static void netvsc_teardown_recv_gpadl(struct hv_device *device,
> "unable to teardown receive buffer's gpadl\n");
> return;
> }
> - net_device->recv_buf_gpadl_handle = 0;
> + net_device->recv_buf_gpadl_handle.gpadl_handle = 0;
> }
> }
>
> @@ -300,9 +300,9 @@ static void netvsc_teardown_send_gpadl(struct hv_device *device,
> {
> int ret;
>
> - if (net_device->send_buf_gpadl_handle) {
> + if (net_device->send_buf_gpadl_handle.gpadl_handle) {
> ret = vmbus_teardown_gpadl(device->channel,
> - net_device->send_buf_gpadl_handle);
> + &net_device->send_buf_gpadl_handle);
>
> /* If we failed here, we might as well return and have a leak
> * rather than continue and a bugchk
> @@ -312,7 +312,7 @@ static void netvsc_teardown_send_gpadl(struct hv_device *device,
> "unable to teardown send buffer's gpadl\n");
> return;
> }
> - net_device->send_buf_gpadl_handle = 0;
> + net_device->send_buf_gpadl_handle.gpadl_handle = 0;
> }
> }
>
> @@ -380,7 +380,7 @@ static int netvsc_init_buf(struct hv_device *device,
> memset(init_packet, 0, sizeof(struct nvsp_message));
> init_packet->hdr.msg_type = NVSP_MSG1_TYPE_SEND_RECV_BUF;
> init_packet->msg.v1_msg.send_recv_buf.
> - gpadl_handle = net_device->recv_buf_gpadl_handle;
> + gpadl_handle = net_device->recv_buf_gpadl_handle.gpadl_handle;
> init_packet->msg.v1_msg.
> send_recv_buf.id = NETVSC_RECEIVE_BUFFER_ID;
>
> @@ -463,6 +463,7 @@ static int netvsc_init_buf(struct hv_device *device,
> ret = -ENOMEM;
> goto cleanup;
> }
> + net_device->send_buf_size = buf_size;
>
> /* Establish the gpadl handle for this buffer on this
> * channel. Note: This call uses the vmbus connection rather
> @@ -482,7 +483,7 @@ static int netvsc_init_buf(struct hv_device *device,
> memset(init_packet, 0, sizeof(struct nvsp_message));
> init_packet->hdr.msg_type = NVSP_MSG1_TYPE_SEND_SEND_BUF;
> init_packet->msg.v1_msg.send_send_buf.gpadl_handle =
> - net_device->send_buf_gpadl_handle;
> + net_device->send_buf_gpadl_handle.gpadl_handle;
> init_packet->msg.v1_msg.send_send_buf.id = NETVSC_SEND_BUFFER_ID;
>
> trace_nvsp_send(ndev, init_packet);
> diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c
> index 652fe2547587..548243dcd895 100644
> --- a/drivers/uio/uio_hv_generic.c
> +++ b/drivers/uio/uio_hv_generic.c
> @@ -58,11 +58,11 @@ struct hv_uio_private_data {
> atomic_t refcnt;
>
> void *recv_buf;
> - u32 recv_gpadl;
> + struct vmbus_gpadl recv_gpadl;
> char recv_name[32]; /* "recv_4294967295" */
>
> void *send_buf;
> - u32 send_gpadl;
> + struct vmbus_gpadl send_gpadl;
> char send_name[32];
> };
>
> @@ -179,15 +179,15 @@ hv_uio_new_channel(struct vmbus_channel *new_sc)
> static void
> hv_uio_cleanup(struct hv_device *dev, struct hv_uio_private_data *pdata)
> {
> - if (pdata->send_gpadl) {
> - vmbus_teardown_gpadl(dev->channel, pdata->send_gpadl);
> - pdata->send_gpadl = 0;
> + if (pdata->send_gpadl.gpadl_handle) {
> + vmbus_teardown_gpadl(dev->channel, &pdata->send_gpadl);
> + pdata->send_gpadl.gpadl_handle = 0;
> vfree(pdata->send_buf);
> }
>
> - if (pdata->recv_gpadl) {
> - vmbus_teardown_gpadl(dev->channel, pdata->recv_gpadl);
> - pdata->recv_gpadl = 0;
> + if (pdata->recv_gpadl.gpadl_handle) {
> + vmbus_teardown_gpadl(dev->channel, &pdata->recv_gpadl);
> + pdata->recv_gpadl.gpadl_handle = 0;
> vfree(pdata->recv_buf);
> }
> }
> @@ -303,7 +303,7 @@ hv_uio_probe(struct hv_device *dev,
>
> /* put Global Physical Address Label in name */
> snprintf(pdata->recv_name, sizeof(pdata->recv_name),
> - "recv:%u", pdata->recv_gpadl);
> + "recv:%u", pdata->recv_gpadl.gpadl_handle);
> pdata->info.mem[RECV_BUF_MAP].name = pdata->recv_name;
> pdata->info.mem[RECV_BUF_MAP].addr
> = (uintptr_t)pdata->recv_buf;
> @@ -324,7 +324,7 @@ hv_uio_probe(struct hv_device *dev,
> }
>
> snprintf(pdata->send_name, sizeof(pdata->send_name),
> - "send:%u", pdata->send_gpadl);
> + "send:%u", pdata->send_gpadl.gpadl_handle);
> pdata->info.mem[SEND_BUF_MAP].name = pdata->send_name;
> pdata->info.mem[SEND_BUF_MAP].addr
> = (uintptr_t)pdata->send_buf;
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index ddc8713ce57b..a9e0bc3b1511 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -803,6 +803,12 @@ struct vmbus_device {
>
> #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
>
> +struct vmbus_gpadl {
> + u32 gpadl_handle;
> + u32 size;
> + void *buffer;
> +};
> +
> struct vmbus_channel {
> struct list_head listentry;
>
> @@ -822,7 +828,7 @@ struct vmbus_channel {
> bool rescind_ref; /* got rescind msg, got channel reference */
> struct completion rescind_event;
>
> - u32 ringbuffer_gpadlhandle;
> + struct vmbus_gpadl ringbuffer_gpadlhandle;
>
> /* Allocated memory for ring buffer */
> struct page *ringbuffer_page;
> @@ -1192,10 +1198,10 @@ extern int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
> extern int vmbus_establish_gpadl(struct vmbus_channel *channel,
> void *kbuffer,
> u32 size,
> - u32 *gpadl_handle);
> + struct vmbus_gpadl *gpadl);
>
> extern int vmbus_teardown_gpadl(struct vmbus_channel *channel,
> - u32 gpadl_handle);
> + struct vmbus_gpadl *gpadl);
>
> void vmbus_reset_channel_cb(struct vmbus_channel *channel);
>
> --
> 2.25.1
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> Hyperv provides GHCB protocol to write Synthetic Interrupt
> Controller MSR registers in Isolation VM with AMD SEV SNP
> and these registers are emulated by hypervisor directly.
> Hyperv requires to write SINTx MSR registers twice. First
> writes MSR via GHCB page to communicate with hypervisor
> and then writes wrmsr instruction to talk with paravisor
> which runs in VMPL0. Guest OS ID MSR also needs to be set
> via GHCB page.
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * Remove hv_get_simp(), hv_get_siefp() hv_get_synint_*()
> helper function. Move the logic into hv_get/set_register().
>
> Change since v3:
> * Pass old_msg_type to hv_signal_eom() as parameter.
> * Use HV_REGISTER_* marcro instead of HV_X64_MSR_*
> * Add hv_isolation_type_snp() weak function.
> * Add maros to set syinc register in ARM code.
>
> Change since v1:
> * Introduce sev_es_ghcb_hv_call_simple() and share code
> between SEV and Hyper-V code.
>
> Fix for hyperv: Add Write/Read MSR registers via ghcb page
> ---
> arch/x86/hyperv/hv_init.c | 36 +++--------
> arch/x86/hyperv/ivm.c | 103 ++++++++++++++++++++++++++++++++
> arch/x86/include/asm/mshyperv.h | 56 ++++++++++++-----
> arch/x86/include/asm/sev.h | 6 ++
> arch/x86/kernel/sev-shared.c | 63 +++++++++++--------
> drivers/hv/hv.c | 77 +++++++++++++++++++-----
> drivers/hv/hv_common.c | 6 ++
> include/asm-generic/mshyperv.h | 2 +
> 8 files changed, 266 insertions(+), 83 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index d57df6825527..a16a83e46a30 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -37,7 +37,7 @@ EXPORT_SYMBOL_GPL(hv_current_partition_id);
> void *hv_hypercall_pg;
> EXPORT_SYMBOL_GPL(hv_hypercall_pg);
>
> -void __percpu **hv_ghcb_pg;
> +union hv_ghcb __percpu **hv_ghcb_pg;
>
> /* Storage to save the hypercall page temporarily for hibernation */
> static void *hv_hypercall_pg_saved;
> @@ -406,7 +406,7 @@ void __init hyperv_init(void)
> }
>
> if (hv_isolation_type_snp()) {
> - hv_ghcb_pg = alloc_percpu(void *);
> + hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
> if (!hv_ghcb_pg)
> goto free_vp_assist_page;
> }
> @@ -424,6 +424,9 @@ void __init hyperv_init(void)
> guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
> wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);
>
> + /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
> +
> hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
> VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
> VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> @@ -501,6 +504,7 @@ void __init hyperv_init(void)
>
> clean_guest_os_id:
> wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
> cpuhp_remove_state(cpuhp);
> free_ghcb_page:
> free_percpu(hv_ghcb_pg);
> @@ -522,6 +526,7 @@ void hyperv_cleanup(void)
>
> /* Reset our OS id */
> wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> + hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
>
> /*
> * Reset hypercall page reference before reset the page,
> @@ -592,30 +597,3 @@ bool hv_is_hyperv_initialized(void)
> return hypercall_msr.enable;
> }
> EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
> -
> -enum hv_isolation_type hv_get_isolation_type(void)
> -{
> - if (!(ms_hyperv.priv_high & HV_ISOLATION))
> - return HV_ISOLATION_TYPE_NONE;
> - return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
> -}
> -EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> -
> -bool hv_is_isolation_supported(void)
> -{
> - if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> - return false;
> -
> - if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> - return false;
> -
> - return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
> -}
> -
> -DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
> -
> -bool hv_isolation_type_snp(void)
> -{
> - return static_branch_unlikely(&isolation_type_snp);
> -}
> -EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 79e7fb83472a..5439723446c9 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -6,12 +6,115 @@
> * Tianyu Lan <[email protected]>
> */
>
> +#include <linux/types.h>
> +#include <linux/bitfield.h>
> #include <linux/hyperv.h>
> #include <linux/types.h>
> #include <linux/bitfield.h>
> #include <linux/slab.h>
> +#include <asm/svm.h>
> +#include <asm/sev.h>
> #include <asm/io.h>
> #include <asm/mshyperv.h>
> +#include <asm/hypervisor.h>
> +
> +union hv_ghcb {
> + struct ghcb ghcb;
> +} __packed __aligned(HV_HYP_PAGE_SIZE);
> +
> +void hv_ghcb_msr_write(u64 msr, u64 value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + if (!hv_ghcb_pg)
> + return;
> +
> + WARN_ON(in_nmi());
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> + ghcb_set_rcx(&hv_ghcb->ghcb, msr);
> + ghcb_set_rax(&hv_ghcb->ghcb, lower_32_bits(value));
> + ghcb_set_rdx(&hv_ghcb->ghcb, upper_32_bits(value));
> +
> + if (sev_es_ghcb_hv_call_simple(&hv_ghcb->ghcb, SVM_EXIT_MSR, 1, 0))
> + pr_warn("Fail to write msr via ghcb %llx.\n", msr);
> +
> + local_irq_restore(flags);
> +}
> +
> +void hv_ghcb_msr_read(u64 msr, u64 *value)
> +{
> + union hv_ghcb *hv_ghcb;
> + void **ghcb_base;
> + unsigned long flags;
> +
> + /* Check size of union hv_ghcb here. */
> + BUILD_BUG_ON(sizeof(union hv_ghcb) != HV_HYP_PAGE_SIZE);
> +
> + if (!hv_ghcb_pg)
> + return;
> +
> + WARN_ON(in_nmi());
> +
> + local_irq_save(flags);
> + ghcb_base = (void **)this_cpu_ptr(hv_ghcb_pg);
> + hv_ghcb = (union hv_ghcb *)*ghcb_base;
> + if (!hv_ghcb) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> + ghcb_set_rcx(&hv_ghcb->ghcb, msr);
> + if (sev_es_ghcb_hv_call_simple(&hv_ghcb->ghcb, SVM_EXIT_MSR, 0, 0))
> + pr_warn("Fail to read msr via ghcb %llx.\n", msr);
> + else
> + *value = (u64)lower_32_bits(hv_ghcb->ghcb.save.rax)
> + | ((u64)lower_32_bits(hv_ghcb->ghcb.save.rdx) << 32);
> + local_irq_restore(flags);
> +}
> +
> +enum hv_isolation_type hv_get_isolation_type(void)
> +{
> + if (!(ms_hyperv.priv_high & HV_ISOLATION))
> + return HV_ISOLATION_TYPE_NONE;
> + return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
> +}
> +EXPORT_SYMBOL_GPL(hv_get_isolation_type);
> +
> +/*
> + * hv_is_isolation_supported - Check system runs in the Hyper-V
> + * isolation VM.
> + */
> +bool hv_is_isolation_supported(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> + return false;
> +
> + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> + return false;
> +
> + return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
> +}
> +
> +DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
> +
> +/*
> + * hv_isolation_type_snp - Check system runs in the AMD SEV-SNP based
> + * isolation VM.
> + */
> +bool hv_isolation_type_snp(void)
> +{
> + return static_branch_unlikely(&isolation_type_snp);
> +}
>
> /*
> * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index ede440f9a1e2..165423e8b67a 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -11,25 +11,14 @@
> #include <asm/paravirt.h>
> #include <asm/mshyperv.h>
>
> +union hv_ghcb;
> +
> DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
>
> typedef int (*hyperv_fill_flush_list_func)(
> struct hv_guest_mapping_flush_list *flush,
> void *data);
>
> -static inline void hv_set_register(unsigned int reg, u64 value)
> -{
> - wrmsrl(reg, value);
> -}
> -
> -static inline u64 hv_get_register(unsigned int reg)
> -{
> - u64 value;
> -
> - rdmsrl(reg, value);
> - return value;
> -}
> -
> #define hv_get_raw_timer() rdtsc_ordered()
>
> void hyperv_vector_handler(struct pt_regs *regs);
> @@ -41,7 +30,7 @@ extern void *hv_hypercall_pg;
>
> extern u64 hv_current_partition_id;
>
> -extern void __percpu **hv_ghcb_pg;
> +extern union hv_ghcb __percpu **hv_ghcb_pg;
>
> int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> @@ -193,6 +182,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
> struct hv_interrupt_entry *entry);
> int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
> int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);
> +void hv_ghcb_msr_write(u64 msr, u64 value);
> +void hv_ghcb_msr_read(u64 msr, u64 *value);
> #else /* CONFIG_HYPERV */
> static inline void hyperv_init(void) {}
> static inline void hyperv_setup_mmu_ops(void) {}
> @@ -209,9 +200,46 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,
> {
> return -1;
> }
> +
> +static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
> +static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
> #endif /* CONFIG_HYPERV */
>
> +static inline void hv_set_register(unsigned int reg, u64 value);
I'm not seeing why this declaration is needed.
>
> #include <asm-generic/mshyperv.h>
>
> +static inline bool hv_is_synic_reg(unsigned int reg)
> +{
> + if ((reg >= HV_REGISTER_SCONTROL) &&
> + (reg <= HV_REGISTER_SINT15))
> + return true;
> + return false;
> +}
> +
> +static inline u64 hv_get_register(unsigned int reg)
> +{
> + u64 value;
> +
> + if (hv_is_synic_reg(reg) && hv_isolation_type_snp())
> + hv_ghcb_msr_read(reg, &value);
> + else
> + rdmsrl(reg, value);
> + return value;
> +}
> +
> +static inline void hv_set_register(unsigned int reg, u64 value)
> +{
> + if (hv_is_synic_reg(reg) && hv_isolation_type_snp()) {
> + hv_ghcb_msr_write(reg, value);
> +
> + /* Write proxy bit via wrmsl instruction */
> + if (reg >= HV_REGISTER_SINT0 &&
> + reg <= HV_REGISTER_SINT15)
> + wrmsrl(reg, value | 1 << 20);
> + } else {
> + wrmsrl(reg, value);
> + }
> +}
> +
This all looks OK to me, except that it would really be nice if the
#include of asm-generic/mshyperv.h stays last in the file. I think the
problem is needing a declaration for hv_isolation_type_snp(), right?
And it is added into asm-generic/mshyperv.h at the very end of this
patch.
The alternative would be to put hv_get_register() and
hv_set_register() in a .c file rather than as static inline. They get
called in quite a few places, and arguably are now fairly large for
being static inline, in my judgment. But I guess I'm OK either way.
In previous versions, the EOM register was being handled
differently (GHCB vs. MSR write) for timer messages vs. other messages.
That distinction is no longer being made. Did you learn something new
from the Hyper-V team about this? Just want to make sure nothing
was inadvertently dropped.
> #endif
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index fa5cd05d3b5b..60bfdbd141b1 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -81,12 +81,18 @@ static __always_inline void sev_es_nmi_complete(void)
> __sev_es_nmi_complete();
> }
> extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> +extern enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,
> + u64 exit_code, u64 exit_info_1,
> + u64 exit_info_2);
> #else
> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> static inline void sev_es_ist_exit(void) { }
> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> static inline void sev_es_nmi_complete(void) { }
> static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> +static inline enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,
> + u64 exit_code, u64 exit_info_1,
> + u64 exit_info_2) { return ES_VMM_ERROR; }
> #endif
>
> #endif
> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
> index 9f90f460a28c..dd7f37de640b 100644
> --- a/arch/x86/kernel/sev-shared.c
> +++ b/arch/x86/kernel/sev-shared.c
> @@ -94,10 +94,9 @@ static void vc_finish_insn(struct es_em_ctxt *ctxt)
> ctxt->regs->ip += ctxt->insn.length;
> }
>
> -static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
> - struct es_em_ctxt *ctxt,
> - u64 exit_code, u64 exit_info_1,
> - u64 exit_info_2)
> +enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,
> + u64 exit_code, u64 exit_info_1,
> + u64 exit_info_2)
> {
> enum es_result ret;
>
> @@ -109,29 +108,45 @@ static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
> ghcb_set_sw_exit_info_1(ghcb, exit_info_1);
> ghcb_set_sw_exit_info_2(ghcb, exit_info_2);
>
> - sev_es_wr_ghcb_msr(__pa(ghcb));
> VMGEXIT();
>
> - if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1) {
> - u64 info = ghcb->save.sw_exit_info_2;
> - unsigned long v;
> -
> - info = ghcb->save.sw_exit_info_2;
> - v = info & SVM_EVTINJ_VEC_MASK;
> -
> - /* Check if exception information from hypervisor is sane. */
> - if ((info & SVM_EVTINJ_VALID) &&
> - ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
> - ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
> - ctxt->fi.vector = v;
> - if (info & SVM_EVTINJ_VALID_ERR)
> - ctxt->fi.error_code = info >> 32;
> - ret = ES_EXCEPTION;
> - } else {
> - ret = ES_VMM_ERROR;
> - }
> - } else {
> + if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1)
> + ret = ES_VMM_ERROR;
> + else
> ret = ES_OK;
> +
> + return ret;
> +}
> +
> +static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
> + struct es_em_ctxt *ctxt,
> + u64 exit_code, u64 exit_info_1,
> + u64 exit_info_2)
> +{
> + unsigned long v;
> + enum es_result ret;
> + u64 info;
> +
> + sev_es_wr_ghcb_msr(__pa(ghcb));
> +
> + ret = sev_es_ghcb_hv_call_simple(ghcb, exit_code, exit_info_1,
> + exit_info_2);
> + if (ret == ES_OK)
> + return ret;
> +
> + info = ghcb->save.sw_exit_info_2;
> + v = info & SVM_EVTINJ_VEC_MASK;
> +
> + /* Check if exception information from hypervisor is sane. */
> + if ((info & SVM_EVTINJ_VALID) &&
> + ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
> + ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
> + ctxt->fi.vector = v;
> + if (info & SVM_EVTINJ_VALID_ERR)
> + ctxt->fi.error_code = info >> 32;
> + ret = ES_EXCEPTION;
> + } else {
> + ret = ES_VMM_ERROR;
> }
>
> return ret;
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index e83507f49676..dee1a96bc535 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -8,6 +8,7 @@
> */
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> +#include <linux/io.h>
> #include <linux/kernel.h>
> #include <linux/mm.h>
> #include <linux/slab.h>
> @@ -136,17 +137,24 @@ int hv_synic_alloc(void)
> tasklet_init(&hv_cpu->msg_dpc,
> vmbus_on_msg_dpc, (unsigned long) hv_cpu);
>
> - hv_cpu->synic_message_page =
> - (void *)get_zeroed_page(GFP_ATOMIC);
> - if (hv_cpu->synic_message_page == NULL) {
> - pr_err("Unable to allocate SYNIC message page\n");
> - goto err;
> - }
> + /*
> + * Synic message and event pages are allocated by paravisor.
> + * Skip these pages allocation here.
> + */
> + if (!hv_isolation_type_snp()) {
> + hv_cpu->synic_message_page =
> + (void *)get_zeroed_page(GFP_ATOMIC);
> + if (hv_cpu->synic_message_page == NULL) {
> + pr_err("Unable to allocate SYNIC message page\n");
> + goto err;
> + }
>
> - hv_cpu->synic_event_page = (void *)get_zeroed_page(GFP_ATOMIC);
> - if (hv_cpu->synic_event_page == NULL) {
> - pr_err("Unable to allocate SYNIC event page\n");
> - goto err;
> + hv_cpu->synic_event_page =
> + (void *)get_zeroed_page(GFP_ATOMIC);
> + if (hv_cpu->synic_event_page == NULL) {
> + pr_err("Unable to allocate SYNIC event page\n");
> + goto err;
> + }
> }
>
> hv_cpu->post_msg_page = (void *)get_zeroed_page(GFP_ATOMIC);
> @@ -201,16 +209,35 @@ void hv_synic_enable_regs(unsigned int cpu)
> /* Setup the Synic's message page */
> simp.as_uint64 = hv_get_register(HV_REGISTER_SIMP);
> simp.simp_enabled = 1;
> - simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
> - >> HV_HYP_PAGE_SHIFT;
> +
> + if (hv_isolation_type_snp()) {
> + hv_cpu->synic_message_page
> + = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
> + HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> + if (!hv_cpu->synic_message_page)
> + pr_err("Fail to map syinc message page.\n");
> + } else {
> + simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
> + >> HV_HYP_PAGE_SHIFT;
> + }
>
> hv_set_register(HV_REGISTER_SIMP, simp.as_uint64);
>
> /* Setup the Synic's event page */
> siefp.as_uint64 = hv_get_register(HV_REGISTER_SIEFP);
> siefp.siefp_enabled = 1;
> - siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
> - >> HV_HYP_PAGE_SHIFT;
> +
> + if (hv_isolation_type_snp()) {
> + hv_cpu->synic_event_page =
> + memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
> + HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> +
> + if (!hv_cpu->synic_event_page)
> + pr_err("Fail to map syinc event page.\n");
> + } else {
> + siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
> + >> HV_HYP_PAGE_SHIFT;
> + }
>
> hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);
>
> @@ -257,30 +284,48 @@ int hv_synic_init(unsigned int cpu)
> */
> void hv_synic_disable_regs(unsigned int cpu)
> {
> + struct hv_per_cpu_context *hv_cpu
> + = per_cpu_ptr(hv_context.cpu_context, cpu);
> union hv_synic_sint shared_sint;
> union hv_synic_simp simp;
> union hv_synic_siefp siefp;
> union hv_synic_scontrol sctrl;
>
> +
Spurious blank line?
> shared_sint.as_uint64 = hv_get_register(HV_REGISTER_SINT0 +
> VMBUS_MESSAGE_SINT);
>
> shared_sint.masked = 1;
>
> +
> +
Spurious blank lines?
> /* Need to correctly cleanup in the case of SMP!!! */
> /* Disable the interrupt */
> hv_set_register(HV_REGISTER_SINT0 + VMBUS_MESSAGE_SINT,
> shared_sint.as_uint64);
>
> simp.as_uint64 = hv_get_register(HV_REGISTER_SIMP);
> + /*
> + * In Isolation VM, sim and sief pages are allocated by
> + * paravisor. These pages also will be used by kdump
> + * kernel. So just reset enable bit here and keep page
> + * addresses.
> + */
> simp.simp_enabled = 0;
> - simp.base_simp_gpa = 0;
> + if (hv_isolation_type_snp())
> + memunmap(hv_cpu->synic_message_page);
> + else
> + simp.base_simp_gpa = 0;
>
> hv_set_register(HV_REGISTER_SIMP, simp.as_uint64);
>
> siefp.as_uint64 = hv_get_register(HV_REGISTER_SIEFP);
> siefp.siefp_enabled = 0;
> - siefp.base_siefp_gpa = 0;
> +
> + if (hv_isolation_type_snp())
> + memunmap(hv_cpu->synic_event_page);
> + else
> + siefp.base_siefp_gpa = 0;
>
> hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);
>
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index c0d9048a4112..1fc82d237161 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -249,6 +249,12 @@ bool __weak hv_is_isolation_supported(void)
> }
> EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
>
> +bool __weak hv_isolation_type_snp(void)
> +{
> + return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> +
> void __weak hv_setup_vmbus_handler(void (*handler)(void))
> {
> }
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index cb529c85c0ad..94750bafd4cc 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -24,6 +24,7 @@
> #include <linux/cpumask.h>
> #include <linux/nmi.h>
> #include <asm/ptrace.h>
> +#include <asm/mshyperv.h>
This #include should not be done. The architecture specific version
of mshyperv.h #includes the asm-generic version, not the other
way around.
In any case, I'm not seeing that this #include is needed.
> #include <asm/hyperv-tlfs.h>
>
> struct ms_hyperv_info {
> @@ -54,6 +55,7 @@ extern void __percpu **hyperv_pcpu_output_arg;
>
> extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
> extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
> +extern bool hv_isolation_type_snp(void);
>
> /* Helper functions that provide a consistent pattern for checking Hyper-V hypercall status. */
> static inline int hv_result(u64 status)
> --
> 2.25.1
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap)
> to map these memory during sending/receiving packet and return swiotlb
> bounce buffer dma address. In Isolation VM, swiotlb bounce buffer is
> marked to be visible to host and the swiotlb force mode is enabled.
>
> Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to
> keep the original data offset in the bounce buffer.
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * use scsi_dma_map/unmap() instead of dma_map/unmap_sg()
> * Add deleted comments back.
> * Fix error calculation of hvpnfs_to_add
>
> Change since v3:
> * Rplace dma_map_page with dma_map_sg()
> * Use for_each_sg() to populate payload->range.pfn_array.
> * Remove storvsc_dma_map macro
> ---
> drivers/hv/vmbus_drv.c | 1 +
> drivers/scsi/storvsc_drv.c | 24 +++++++++++++++---------
> include/linux/hyperv.h | 1 +
> 3 files changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index b0be287e9a32..9c53f823cde1 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2121,6 +2121,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> hv_debug_add_dev_dir(child_device_obj);
>
> child_device_obj->device.dma_mask = &vmbus_dma_mask;
> + child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
> return 0;
>
> err_kset_unregister:
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index ebbbc1299c62..d10b450bcf0c 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -21,6 +21,8 @@
> #include <linux/device.h>
> #include <linux/hyperv.h>
> #include <linux/blkdev.h>
> +#include <linux/dma-mapping.h>
> +
> #include <scsi/scsi.h>
> #include <scsi/scsi_cmnd.h>
> #include <scsi/scsi_host.h>
> @@ -1322,6 +1324,7 @@ static void storvsc_on_channel_callback(void *context)
> continue;
> }
> request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);
> + scsi_dma_unmap(scmnd);
> }
>
> storvsc_on_receive(stor_device, packet, request);
> @@ -1735,7 +1738,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
> struct hv_host_device *host_dev = shost_priv(host);
> struct hv_device *dev = host_dev->dev;
> struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd);
> - int i;
> struct scatterlist *sgl;
> unsigned int sg_count;
> struct vmscsi_request *vm_srb;
> @@ -1817,10 +1819,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
> payload_sz = sizeof(cmd_request->mpb);
>
> if (sg_count) {
> - unsigned int hvpgoff, hvpfns_to_add;
> unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
> unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
> - u64 hvpfn;
> + struct scatterlist *sg;
> + unsigned long hvpfn, hvpfns_to_add;
> + int j, i = 0;
>
> if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
>
> @@ -1834,8 +1837,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
> payload->range.len = length;
> payload->range.offset = offset_in_hvpg;
>
> + sg_count = scsi_dma_map(scmnd);
> + if (sg_count < 0)
> + return SCSI_MLQUEUE_DEVICE_BUSY;
>
> - for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
> + for_each_sg(sgl, sg, sg_count, j) {
> /*
> * Init values for the current sgl entry. hvpgoff
> * and hvpfns_to_add are in units of Hyper-V size
Nit: The above comment is now out-of-date because hvpgoff has
been removed.
> @@ -1845,10 +1851,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
> * even on other than the first sgl entry, provided
> * they are a multiple of PAGE_SIZE.
> */
> - hvpgoff = HVPFN_DOWN(sgl->offset);
> - hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
> - hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) -
> - hvpgoff;
> + hvpfn = HVPFN_DOWN(sg_dma_address(sg));
> + hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
> + sg_dma_len(sg)) - hvpfn;
Good. This looks correct now.
>
> /*
> * Fill the next portion of the PFN array with
> @@ -1858,7 +1863,7 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
> * the PFN array is filled.
> */
> while (hvpfns_to_add--)
> - payload->range.pfn_array[i++] = hvpfn++;
> + payload->range.pfn_array[i++] = hvpfn++;
> }
> }
>
> @@ -2002,6 +2007,7 @@ static int storvsc_probe(struct hv_device *device,
> stor_device->vmscsi_size_delta = sizeof(struct vmscsi_win8_extension);
> spin_lock_init(&stor_device->lock);
> hv_set_drvdata(device, stor_device);
> + dma_set_min_align_mask(&device->device, HV_HYP_PAGE_SIZE - 1);
>
> stor_device->port_number = host->host_no;
> ret = storvsc_connect_to_vsp(device, storvsc_ringbuffer_size, is_fc);
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index bb1a1519b93a..c94c534a944e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1274,6 +1274,7 @@ struct hv_device {
>
> struct vmbus_channel *channel;
> struct kset *channels_kset;
> + struct device_dma_parameters dma_parms;
>
> /* place holder to keep track of the dir for hv device in debugfs */
> struct dentry *debug_dir;
> --
> 2.25.1
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> hyperv Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
>
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
>
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
>
> Swiotlb bounce buffer code calls set_memory_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space via memremap. Populate the shared_gpa_boundary
> (vTOM) via swiotlb_unencrypted_base variable.
>
> The map function memremap() can't work in the early place
> hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
> buffer in the hyperv_iommu_swiotlb_later_init().
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * Use swiotlb_unencrypted_base variable to pass shared_gpa_
> boundary and map bounce buffer inside swiotlb code.
>
> Change since v3:
> * Get hyperv bounce bufffer size via default swiotlb
> bounce buffer size function and keep default size as
> same as the one in the AMD SEV VM.
> ---
> arch/x86/include/asm/mshyperv.h | 2 ++
> arch/x86/mm/mem_encrypt.c | 3 +-
> arch/x86/xen/pci-swiotlb-xen.c | 3 +-
> drivers/hv/vmbus_drv.c | 3 ++
> drivers/iommu/hyperv-iommu.c | 60 +++++++++++++++++++++++++++++++++
> include/linux/hyperv.h | 1 +
> 6 files changed, 70 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 165423e8b67a..2d22f29f90c9 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -182,6 +182,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
> struct hv_interrupt_entry *entry);
> int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
> int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);
> +void *hv_map_memory(void *addr, unsigned long size);
> +void hv_unmap_memory(void *addr);
Aren't these two declarations now spurious?
> void hv_ghcb_msr_write(u64 msr, u64 value);
> void hv_ghcb_msr_read(u64 msr, u64 *value);
> #else /* CONFIG_HYPERV */
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index ff08dc463634..e2db0b8ed938 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -30,6 +30,7 @@
> #include <asm/processor-flags.h>
> #include <asm/msr.h>
> #include <asm/cmdline.h>
> +#include <asm/mshyperv.h>
>
> #include "mm_internal.h"
>
> @@ -202,7 +203,7 @@ void __init sev_setup_arch(void)
> phys_addr_t total_mem = memblock_phys_mem_size();
> unsigned long size;
>
> - if (!sev_active())
> + if (!sev_active() && !hv_is_isolation_supported())
> return;
>
> /*
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 54f9aa7e8457..43bd031aa332 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
>
> #include <linux/dma-map-ops.h>
> #include <linux/pci.h>
> +#include <linux/hyperv.h>
> #include <xen/swiotlb-xen.h>
>
> #include <asm/xen/hypervisor.h>
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
> EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
>
> IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> - NULL,
> + hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 392c1ac4f819..b0be287e9a32 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -23,6 +23,7 @@
> #include <linux/cpu.h>
> #include <linux/sched/task_stack.h>
>
> +#include <linux/dma-map-ops.h>
> #include <linux/delay.h>
> #include <linux/notifier.h>
> #include <linux/panic_notifier.h>
> @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,
> return child_device_obj;
> }
>
> +static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
> /*
> * vmbus_device_register - Register the child device
> */
> @@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> }
> hv_debug_add_dev_dir(child_device_obj);
>
> + child_device_obj->device.dma_mask = &vmbus_dma_mask;
> return 0;
>
> err_kset_unregister:
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index e285a220c913..a8ac2239de0f 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -13,14 +13,22 @@
> #include <linux/irq.h>
> #include <linux/iommu.h>
> #include <linux/module.h>
> +#include <linux/hyperv.h>
> +#include <linux/io.h>
>
> #include <asm/apic.h>
> #include <asm/cpu.h>
> #include <asm/hw_irq.h>
> #include <asm/io_apic.h>
> +#include <asm/iommu.h>
> +#include <asm/iommu_table.h>
> #include <asm/irq_remapping.h>
> #include <asm/hypervisor.h>
> #include <asm/mshyperv.h>
> +#include <asm/swiotlb.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/dma-direct.h>
> +#include <linux/set_memory.h>
>
> #include "irq_remapping.h"
>
> @@ -36,6 +44,9 @@
> static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
> static struct irq_domain *ioapic_ir_domain;
>
> +static unsigned long hyperv_io_tlb_size;
> +static void *hyperv_io_tlb_start;
> +
> static int hyperv_ir_set_affinity(struct irq_data *data,
> const struct cpumask *mask, bool force)
> {
> @@ -337,4 +348,53 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
> .free = hyperv_root_irq_remapping_free,
> };
>
> +static void __init hyperv_iommu_swiotlb_init(void)
> +{
> + /*
> + * Allocate Hyper-V swiotlb bounce buffer at early place
> + * to reserve large contiguous memory.
> + */
> + hyperv_io_tlb_size = swiotlb_size_or_default();
> + hyperv_io_tlb_start = memblock_alloc(
> + hyperv_io_tlb_size, PAGE_SIZE);
> +
> + if (!hyperv_io_tlb_start) {
> + pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");
> + return;
> + }
> +}
> +
> +int __init hyperv_swiotlb_detect(void)
> +{
> + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> + return 0;
> +
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + /*
> + * Enable swiotlb force mode in Isolation VM to
> + * use swiotlb bounce buffer for dma transaction.
> + */
> + swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
> + swiotlb_force = SWIOTLB_FORCE;
> + return 1;
> +}
> +
> +static void __init hyperv_iommu_swiotlb_later_init(void)
> +{
> + /*
> + * Swiotlb bounce buffer needs to be mapped in extra address
> + * space. Map function doesn't work in the early place and so
> + * call swiotlb_late_init_with_tbl() here.
> + */
> + if (swiotlb_late_init_with_tbl(hyperv_io_tlb_start,
> + hyperv_io_tlb_size >> IO_TLB_SHIFT))
> + panic("Fail to initialize hyperv swiotlb.\n");
> +}
> +
> +IOMMU_INIT_FINISH(hyperv_swiotlb_detect,
> + NULL, hyperv_iommu_swiotlb_init,
> + hyperv_iommu_swiotlb_later_init);
> +
> #endif
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index a9e0bc3b1511..bb1a1519b93a 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1739,6 +1739,7 @@ int hyperv_write_cfg_blk(struct pci_dev *dev, void *buf, unsigned int len,
> int hyperv_reg_block_invalidate(struct pci_dev *dev, void *context,
> void (*block_invalidate)(void *context,
> u64 block_mask));
> +int __init hyperv_swiotlb_detect(void);
>
> struct hyperv_pci_block_ops {
> int (*read_block)(struct pci_dev *dev, void *buf, unsigned int buf_len,
> --
> 2.25.1
From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma address will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
>
> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
> these pages via vmap(). After calling vmbus_establish_gpadl() which
> marks these pages visible to host, unmap these pages to release the
> virtual address mapped with physical address below shared_gpa_boundary
> and map them in the extra address space via vmap_pfn().
>
> Signed-off-by: Tianyu Lan <[email protected]>
> ---
> Change since v4:
> * Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
> * Map pages after calling vmbus_establish_gpadl().
> * set dma_set_min_align_mask for netvsc driver.
>
> Change since v3:
> * Add comment to explain why not to use dma_map_sg()
> * Fix some error handle.
> ---
> drivers/net/hyperv/hyperv_net.h | 7 +
> drivers/net/hyperv/netvsc.c | 287 +++++++++++++++++++++++++++++-
> drivers/net/hyperv/netvsc_drv.c | 1 +
> drivers/net/hyperv/rndis_filter.c | 2 +
> include/linux/hyperv.h | 5 +
> 5 files changed, 296 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 315278a7cf88..87e8c74398a5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
> u32 total_bytes;
> u32 send_buf_index;
> u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
> };
>
> #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,8 @@ struct netvsc_device {
>
> /* Receive buffer allocated by us but manages by NetVSP */
> void *recv_buf;
> + struct page **recv_pages;
> + u32 recv_page_count;
> u32 recv_buf_size; /* allocated bytes */
> struct vmbus_gpadl recv_buf_gpadl_handle;
> u32 recv_section_cnt;
> @@ -1082,6 +1085,8 @@ struct netvsc_device {
>
> /* Send buffer allocated by us */
> void *send_buf;
> + struct page **send_pages;
> + u32 send_page_count;
> u32 send_buf_size;
> struct vmbus_gpadl send_buf_gpadl_handle;
> u32 send_section_cnt;
> @@ -1731,4 +1736,6 @@ struct rndis_message {
> #define RETRY_US_HI 10000
> #define RETRY_MAX 2000 /* >10 sec */
>
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> + struct hv_netvsc_packet *packet);
> #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 1f87e570ed2b..7d5254bf043e 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -20,6 +20,7 @@
> #include <linux/vmalloc.h>
> #include <linux/rtnetlink.h>
> #include <linux/prefetch.h>
> +#include <linux/gfp.h>
>
> #include <asm/sync_bitops.h>
> #include <asm/mshyperv.h>
> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head *head)
> {
> struct netvsc_device *nvdev
> = container_of(head, struct netvsc_device, rcu);
> + unsigned int alloc_unit;
> int i;
>
> kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_pages) {
> + alloc_unit = (nvdev->recv_buf_size /
> + nvdev->recv_page_count) >> PAGE_SHIFT;
> +
> + vunmap(nvdev->recv_buf);
> + for (i = 0; i < nvdev->recv_page_count; i++)
> + __free_pages(nvdev->recv_pages[i], alloc_unit);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_pages) {
> + alloc_unit = (nvdev->send_buf_size /
> + nvdev->send_page_count) >> PAGE_SHIFT;
> +
> + vunmap(nvdev->send_buf);
> + for (i = 0; i < nvdev->send_page_count; i++)
> + __free_pages(nvdev->send_pages[i], alloc_unit);
> + } else {
> + vfree(nvdev->send_buf);
> + }
> +
> kfree(nvdev->send_section_map);
>
> for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
> return nvchan->mrc.slots ? 0 : -ENOMEM;
> }
>
> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
> + unsigned long size)
> +{
> + struct page *page, **pages, **vmap_pages;
> + unsigned long pg_count = size >> PAGE_SHIFT;
> + int alloc_unit = MAX_ORDER_NR_PAGES;
> + int i, j, vmap_page_index = 0;
> + void *vaddr;
> +
> + if (pg_count < alloc_unit)
> + alloc_unit = 1;
> +
> + /* vmap() accepts page array with PAGE_SIZE as unit while try to
> + * allocate high order pages here in order to save page array space.
> + * vmap_pages[] is used as input parameter of vmap(). pages[] is to
> + * store allocated pages and map them later.
> + */
> + vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), GFP_KERNEL);
> + if (!vmap_pages)
> + return NULL;
> +
> +retry:
> + *array_len = pg_count / alloc_unit;
> + pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
> + if (!pages)
> + goto cleanup;
> +
> + for (i = 0; i < *array_len; i++) {
> + page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> + get_order(alloc_unit << PAGE_SHIFT));
> + if (!page) {
> + /* Try allocating small pages if high order pages are not available. */
> + if (alloc_unit == 1) {
> + goto cleanup;
> + } else {
The "else" clause isn't really needed because of the goto cleanup above. Then
the indentation of the code below could be reduced by one level.
> + memset(vmap_pages, 0,
> + sizeof(*vmap_pages) * vmap_page_index);
> + vmap_page_index = 0;
> +
> + for (j = 0; j < i; j++)
> + __free_pages(pages[j], alloc_unit);
> +
> + kfree(pages);
> + alloc_unit = 1;
This is the case where a large enough contiguous physical memory chunk could
not be found. But rather than dropping all the way down to single pages,
would it make sense to try something smaller, but not 1? For example,
cut the alloc_unit in half and try again. But I'm not sure of all the implications.
> + goto retry;
> + }
> + }
> +
> + pages[i] = page;
> + for (j = 0; j < alloc_unit; j++)
> + vmap_pages[vmap_page_index++] = page++;
> + }
> +
> + vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
> + kfree(vmap_pages);
> +
> + *pages_array = pages;
> + return vaddr;
> +
> +cleanup:
> + for (j = 0; j < i; j++)
> + __free_pages(pages[i], alloc_unit);
> +
> + kfree(pages);
> + kfree(vmap_pages);
> + return NULL;
> +}
> +
> +static void *netvsc_map_pages(struct page **pages, int count, int alloc_unit)
> +{
> + int pg_count = count * alloc_unit;
> + struct page *page;
> + unsigned long *pfns;
> + int pfn_index = 0;
> + void *vaddr;
> + int i, j;
> +
> + if (!pages)
> + return NULL;
> +
> + pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
> + if (!pfns)
> + return NULL;
> +
> + for (i = 0; i < count; i++) {
> + page = pages[i];
> + if (!page) {
> + pr_warn("page is not available %d.\n", i);
> + return NULL;
> + }
> +
> + for (j = 0; j < alloc_unit; j++) {
> + pfns[pfn_index++] = page_to_pfn(page++) +
> + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> + }
> + }
> +
> + vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
> + kfree(pfns);
> + return vaddr;
> +}
> +
I think you are proposing this approach to allocating memory for the send
and receive buffers so that you can avoid having two virtual mappings for
the memory, per comments from Christop Hellwig. But overall, the approach
seems a bit complex and I wonder if it is worth it. If allocating large contiguous
chunks of physical memory is successful, then there is some memory savings
in that the data structures needed to keep track of the physical pages is
smaller than the equivalent page tables might be. But if you have to revert
to allocating individual pages, then the memory savings is reduced.
Ultimately, the list of actual PFNs has to be kept somewhere. Another approach
would be to do the reverse of what hv_map_memory() from the v4 patch
series does. I.e., you could do virt_to_phys() on each virtual address that
maps above VTOM, and subtract out the shared_gpa_boundary to get the
list of actual PFNs that need to be freed. This way you don't have two copies
of the list of PFNs -- one with and one without the shared_gpa_boundary added.
But it comes at the cost of additional code so that may not be a great idea.
I think what you have here works, and I don't have a clearly better solution
at the moment except perhaps to revert to the v4 solution and just have two
virtual mappings. I'll keep thinking about it. Maybe Christop has other
thoughts.
> static int netvsc_init_buf(struct hv_device *device,
> struct netvsc_device *net_device,
> const struct netvsc_device_info *device_info)
> @@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
> struct nvsp_1_message_send_receive_buffer_complete *resp;
> struct net_device *ndev = hv_get_drvdata(device);
> struct nvsp_message *init_packet;
> - unsigned int buf_size;
> + unsigned int buf_size, alloc_unit;
> size_t map_words;
> int i, ret = 0;
>
> @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device *device,
> buf_size = min_t(unsigned int, buf_size,
> NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
>
> - net_device->recv_buf = vzalloc(buf_size);
> + if (hv_isolation_type_snp())
> + net_device->recv_buf =
> + netvsc_alloc_pages(&net_device->recv_pages,
> + &net_device->recv_page_count,
> + buf_size);
> + else
> + net_device->recv_buf = vzalloc(buf_size);
> +
I wonder if it is necessary to have two different code paths here. The
allocating and freeing of the send and receive buffers is not perf
sensitive, and it seems like netvsc_alloc_pages() could be used
regardless of whether SNP Isolation is in effect. To my thinking,
one code path is better than two code paths unless there's a
compelling reason to have two.
> if (!net_device->recv_buf) {
> netdev_err(ndev,
> "unable to allocate receive buffer of size %u\n",
> @@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device *device,
> goto cleanup;
> }
>
> + if (hv_isolation_type_snp()) {
> + alloc_unit = (buf_size / net_device->recv_page_count)
> + >> PAGE_SHIFT;
> +
> + /* Unmap previous virtual address and map pages in the extra
> + * address space(above shared gpa boundary) in Isolation VM.
> + */
> + vunmap(net_device->recv_buf);
> + net_device->recv_buf =
> + netvsc_map_pages(net_device->recv_pages,
> + net_device->recv_page_count,
> + alloc_unit);
> + if (!net_device->recv_buf) {
> + netdev_err(ndev,
> + "unable to allocate receive buffer of size %u\n",
> + buf_size);
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> + }
> +
> /* Notify the NetVsp of the gpadl handle */
> init_packet = &net_device->channel_init_pkt;
> memset(init_packet, 0, sizeof(struct nvsp_message));
> @@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device *device,
> buf_size = device_info->send_sections * device_info->send_section_size;
> buf_size = round_up(buf_size, PAGE_SIZE);
>
> - net_device->send_buf = vzalloc(buf_size);
> + if (hv_isolation_type_snp())
> + net_device->send_buf =
> + netvsc_alloc_pages(&net_device->send_pages,
> + &net_device->send_page_count,
> + buf_size);
> + else
> + net_device->send_buf = vzalloc(buf_size);
> +
> if (!net_device->send_buf) {
> netdev_err(ndev, "unable to allocate send buffer of size %u\n",
> buf_size);
> ret = -ENOMEM;
> goto cleanup;
> }
> +
> net_device->send_buf_size = buf_size;
>
> /* Establish the gpadl handle for this buffer on this
> @@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device *device,
> goto cleanup;
> }
>
> + if (hv_isolation_type_snp()) {
> + alloc_unit = (buf_size / net_device->send_page_count)
> + >> PAGE_SHIFT;
> +
> + /* Unmap previous virtual address and map pages in the extra
> + * address space(above shared gpa boundary) in Isolation VM.
> + */
> + vunmap(net_device->send_buf);
> + net_device->send_buf =
> + netvsc_map_pages(net_device->send_pages,
> + net_device->send_page_count,
> + alloc_unit);
> + if (!net_device->send_buf) {
> + netdev_err(ndev,
> + "unable to allocate receive buffer of size %u\n",
> + buf_size);
> + ret = -ENOMEM;
> + goto cleanup;
> + }
> + }
> +
> /* Notify the NetVsp of the gpadl handle */
> init_packet = &net_device->channel_init_pkt;
> memset(init_packet, 0, sizeof(struct nvsp_message));
> @@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
>
> /* Notify the layer above us */
> if (likely(skb)) {
> - const struct hv_netvsc_packet *packet
> + struct hv_netvsc_packet *packet
> = (struct hv_netvsc_packet *)skb->cb;
> u32 send_index = packet->send_buf_index;
> struct netvsc_stats *tx_stats;
> @@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
> tx_stats->bytes += packet->total_bytes;
> u64_stats_update_end(&tx_stats->syncp);
>
> + netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
> napi_consume_skb(skb, budget);
> }
>
> @@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
> memset(dest, 0, padding);
> }
>
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> + struct hv_netvsc_packet *packet)
> +{
> + u32 page_count = packet->cp_partial ?
> + packet->page_buf_cnt - packet->rmsg_pgcnt :
> + packet->page_buf_cnt;
> + int i;
> +
> + if (!hv_is_isolation_supported())
> + return;
> +
> + if (!packet->dma_range)
> + return;
> +
> + for (i = 0; i < page_count; i++)
> + dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
> + packet->dma_range[i].mapping_size,
> + DMA_TO_DEVICE);
> +
> + kfree(packet->dma_range);
> +}
> +
> +/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
> + * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
> + * VM.
> + *
> + * In isolation VM, netvsc send buffer has been marked visible to
> + * host and so the data copied to send buffer doesn't need to use
> + * bounce buffer. The data pages handled by vmbus_sendpacket_pagebuffer()
> + * may not be copied to send buffer and so these pages need to be
> + * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
> + * that. The pfns in the struct hv_page_buffer need to be converted
> + * to bounce buffer's pfn. The loop here is necessary because the
> + * entries in the page buffer array are not necessarily full
> + * pages of data. Each entry in the array has a separate offset and
> + * len that may be non-zero, even for entries in the middle of the
> + * array. And the entries are not physically contiguous. So each
> + * entry must be individually mapped rather than as a contiguous unit.
> + * So not use dma_map_sg() here.
> + */
> +static int netvsc_dma_map(struct hv_device *hv_dev,
> + struct hv_netvsc_packet *packet,
> + struct hv_page_buffer *pb)
> +{
> + u32 page_count = packet->cp_partial ?
> + packet->page_buf_cnt - packet->rmsg_pgcnt :
> + packet->page_buf_cnt;
> + dma_addr_t dma;
> + int i;
> +
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + packet->dma_range = kcalloc(page_count,
> + sizeof(*packet->dma_range),
> + GFP_KERNEL);
> + if (!packet->dma_range)
> + return -ENOMEM;
> +
> + for (i = 0; i < page_count; i++) {
> + char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
> + + pb[i].offset);
> + u32 len = pb[i].len;
> +
> + dma = dma_map_single(&hv_dev->device, src, len,
> + DMA_TO_DEVICE);
> + if (dma_mapping_error(&hv_dev->device, dma)) {
> + kfree(packet->dma_range);
> + return -ENOMEM;
> + }
> +
> + packet->dma_range[i].dma = dma;
> + packet->dma_range[i].mapping_size = len;
> + pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
> + pb[i].offset = offset_in_hvpage(dma);
With the DMA min align mask now being set, the offset within
the Hyper-V page won't be changed by dma_map_single(). So I
think the above statement can be removed.
> + pb[i].len = len;
A few lines above, the value of "len" is set from pb[i].len. Neither
"len" nor "i" is changed in the loop, so this statement can also be
removed.
> + }
> +
> + return 0;
> +}
> +
> static inline int netvsc_send_pkt(
> struct hv_device *device,
> struct hv_netvsc_packet *packet,
> @@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
>
> trace_nvsp_send_pkt(ndev, out_channel, rpkt);
>
> + packet->dma_range = NULL;
> if (packet->page_buf_cnt) {
> if (packet->cp_partial)
> pb += packet->rmsg_pgcnt;
>
> + ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
> + if (ret) {
> + ret = -EAGAIN;
> + goto exit;
> + }
> +
> ret = vmbus_sendpacket_pagebuffer(out_channel,
> pb, packet->page_buf_cnt,
> &nvmsg, sizeof(nvmsg),
> req_id);
> +
> + if (ret)
> + netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
> } else {
> ret = vmbus_sendpacket(out_channel,
> &nvmsg, sizeof(nvmsg),
> @@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
> VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
> }
>
> +exit:
> if (ret == 0) {
> atomic_inc_return(&nvchan->queue_sends);
>
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 382bebc2420d..c3dc884b31e3 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
> list_add(&net_device_ctx->list, &netvsc_dev_list);
> rtnl_unlock();
>
> + dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
> netvsc_devinfo_put(device_info);
> return 0;
>
> diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
> index f6c9c2a670f9..448fcc325ed7 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct net_device *ndev,
> }
> }
>
> + netvsc_dma_unmap(((struct net_device_context *)
> + netdev_priv(ndev))->device_ctx, &request->pkt);
> complete(&request->wait_event);
> } else {
> netdev_err(ndev,
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index c94c534a944e..81e58dd582dc 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
> void (*callback)(void *context);
> };
>
> +struct hv_dma_range {
> + dma_addr_t dma;
> + u32 mapping_size;
> +};
> +
> #define MAX_SRV_VER 0x7ffffff
> extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, u8 *buf, u32 buflen,
> const int *fw_version, int fw_vercnt,
> --
> 2.25.1
> -----Original Message-----
> From: Michael Kelley <[email protected]>
> Sent: Wednesday, September 15, 2021 12:22 PM
> To: Tianyu Lan <[email protected]>; KY Srinivasan <[email protected]>;
> > + memset(vmap_pages, 0,
> > + sizeof(*vmap_pages) * vmap_page_index);
> > + vmap_page_index = 0;
> > +
> > + for (j = 0; j < i; j++)
> > + __free_pages(pages[j], alloc_unit);
> > +
> > + kfree(pages);
> > + alloc_unit = 1;
>
> This is the case where a large enough contiguous physical memory chunk
> could not be found. But rather than dropping all the way down to single
> pages, would it make sense to try something smaller, but not 1? For
> example, cut the alloc_unit in half and try again. But I'm not sure of
> all the implications.
I had the same question. But probably gradually decrementing uses too much
time?
>
> > + goto retry;
> > + }
> > + }
> > +
> > + pages[i] = page;
> > + for (j = 0; j < alloc_unit; j++)
> > + vmap_pages[vmap_page_index++] = page++;
> > + }
> > +
> > + vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
> > + kfree(vmap_pages);
> > +
> > + *pages_array = pages;
> > + return vaddr;
> > +
> > +cleanup:
> > + for (j = 0; j < i; j++)
> > + __free_pages(pages[i], alloc_unit);
> > +
> > + kfree(pages);
> > + kfree(vmap_pages);
> > + return NULL;
> > +}
> > +
> > +static void *netvsc_map_pages(struct page **pages, int count, int
> > +alloc_unit) {
> > + int pg_count = count * alloc_unit;
> > + struct page *page;
> > + unsigned long *pfns;
> > + int pfn_index = 0;
> > + void *vaddr;
> > + int i, j;
> > +
> > + if (!pages)
> > + return NULL;
> > +
> > + pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
> > + if (!pfns)
> > + return NULL;
> > +
> > + for (i = 0; i < count; i++) {
> > + page = pages[i];
> > + if (!page) {
> > + pr_warn("page is not available %d.\n", i);
> > + return NULL;
> > + }
> > +
> > + for (j = 0; j < alloc_unit; j++) {
> > + pfns[pfn_index++] = page_to_pfn(page++) +
> > + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> > + }
> > + }
> > +
> > + vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
> > + kfree(pfns);
> > + return vaddr;
> > +}
> > +
>
> I think you are proposing this approach to allocating memory for the
> send and receive buffers so that you can avoid having two virtual
> mappings for the memory, per comments from Christop Hellwig. But
> overall, the approach seems a bit complex and I wonder if it is worth it.
> If allocating large contiguous chunks of physical memory is successful,
> then there is some memory savings in that the data structures needed to
> keep track of the physical pages is smaller than the equivalent page
> tables might be. But if you have to revert to allocating individual
> pages, then the memory savings is reduced.
>
> Ultimately, the list of actual PFNs has to be kept somewhere. Another
> approach would be to do the reverse of what hv_map_memory() from the v4
> patch series does. I.e., you could do virt_to_phys() on each virtual
> address that maps above VTOM, and subtract out the shared_gpa_boundary
> to get the
> list of actual PFNs that need to be freed. This way you don't have two
> copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary
> added.
> But it comes at the cost of additional code so that may not be a great
> idea.
>
> I think what you have here works, and I don't have a clearly better
> solution at the moment except perhaps to revert to the v4 solution and
> just have two virtual mappings. I'll keep thinking about it. Maybe
> Christop has other thoughts.
>
> > static int netvsc_init_buf(struct hv_device *device,
> > struct netvsc_device *net_device,
> > const struct netvsc_device_info *device_info) @@ -
> 337,7 +462,7
> > @@ static int netvsc_init_buf(struct hv_device *device,
> > struct nvsp_1_message_send_receive_buffer_complete *resp;
> > struct net_device *ndev = hv_get_drvdata(device);
> > struct nvsp_message *init_packet;
> > - unsigned int buf_size;
> > + unsigned int buf_size, alloc_unit;
> > size_t map_words;
> > int i, ret = 0;
> >
> > @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device
> *device,
> > buf_size = min_t(unsigned int, buf_size,
> > NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
> >
> > - net_device->recv_buf = vzalloc(buf_size);
> > + if (hv_isolation_type_snp())
> > + net_device->recv_buf =
> > + netvsc_alloc_pages(&net_device->recv_pages,
> > + &net_device->recv_page_count,
> > + buf_size);
> > + else
> > + net_device->recv_buf = vzalloc(buf_size);
> > +
>
> I wonder if it is necessary to have two different code paths here. The
> allocating and freeing of the send and receive buffers is not perf
> sensitive, and it seems like netvsc_alloc_pages() could be used
> regardless of whether SNP Isolation is in effect. To my thinking, one
> code path is better than two code paths unless there's a compelling
> reason to have two.
I still prefer keeping the simple vzalloc for the non isolated VMs, because
simple code path usually means more robust.
I don't know how much time difference between the two, but in some cases
we really care about boot time?
Also in the multi vPort case for MANA, we potentially support hundreds of
vPorts, and there will be the same number of synthetic NICs associated with
them. So even small time difference in the initialization time may add up.
Thanks,
- Haiyang
On 9/15/2021 11:41 PM, Michael Kelley wrote:
>> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
>> index 42f3d9d123a1..560cba916d1d 100644
>> --- a/drivers/hv/hyperv_vmbus.h
>> +++ b/drivers/hv/hyperv_vmbus.h
>> @@ -240,6 +240,8 @@ struct vmbus_connection {
>> * is child->parent notification
>> */
>> struct hv_monitor_page *monitor_pages[2];
>> + void *monitor_pages_original[2];
>> + unsigned long monitor_pages_pa[2];
> The type of this field really should be phys_addr_t. In addition to
> just making semantic sense, then it will match the return type from
> virt_to_phys() and the input arg to memremap() since resource_size_t
> is typedef'ed as phys_addr_t.
>
OK. Will update in the next version.
Thanks.
On 9/15/2021 11:42 PM, Michael Kelley wrote:
>> @@ -196,13 +199,34 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
>> mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>> mem->slots[i].alloc_size = 0;
>> }
>> +
>> + if (set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT))
>> + return -EFAULT;
>> +
>> + /*
>> + * Map memory in the unencrypted physical address space when requested
>> + * (e.g. for Hyper-V AMD SEV-SNP Isolation VMs).
>> + */
>> + if (swiotlb_unencrypted_base) {
>> + phys_addr_t paddr = __pa(vaddr) + swiotlb_unencrypted_base;
> Nit: Use "start" instead of "__pa(vaddr)" since "start" is already the needed
> physical address.
Yes, "start" should be used here.
>
>> @@ -304,7 +332,7 @@ int
>> swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
>> {
>> struct io_tlb_mem *mem = &io_tlb_default_mem;
>> - unsigned long bytes = nslabs << IO_TLB_SHIFT;
>> + int ret;
>>
>> if (swiotlb_force == SWIOTLB_NO_FORCE)
>> return 0;
>> @@ -318,8 +346,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
>> if (!mem->slots)
>> return -ENOMEM;
>>
>> - set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
>> - swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
>> + ret = swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
>> + if (ret)
> Before returning the error, free the pages obtained from the earlier call
> to __get_free_pages()?
>
Yes, will fix.
Thanks.
On 9/16/2021 12:46 AM, Haiyang Zhang wrote:
>>> + memset(vmap_pages, 0,
>>> + sizeof(*vmap_pages) * vmap_page_index);
>>> + vmap_page_index = 0;
>>> +
>>> + for (j = 0; j < i; j++)
>>> + __free_pages(pages[j], alloc_unit);
>>> +
>>> + kfree(pages);
>>> + alloc_unit = 1;
>> This is the case where a large enough contiguous physical memory chunk
>> could not be found. But rather than dropping all the way down to single
>> pages, would it make sense to try something smaller, but not 1? For
>> example, cut the alloc_unit in half and try again. But I'm not sure of
>> all the implications.
> I had the same question. But probably gradually decrementing uses too much
> time?
>
This version is to propose the solution. We may optimize this to try
smaller size until to single page if this is right direction.
On 9/16/2021 12:21 AM, Michael Kelley wrote:
> I think you are proposing this approach to allocating memory for the send
> and receive buffers so that you can avoid having two virtual mappings for
> the memory, per comments from Christop Hellwig. But overall, the approach
> seems a bit complex and I wonder if it is worth it. If allocating large contiguous
> chunks of physical memory is successful, then there is some memory savings
> in that the data structures needed to keep track of the physical pages is
> smaller than the equivalent page tables might be. But if you have to revert
> to allocating individual pages, then the memory savings is reduced.
>
Yes, this version follows idea from Christop in the previous
discussion.(https://lkml.org/lkml/2021/9/2/112)
This patch shows the implementation and check whether this is a right
direction.
> Ultimately, the list of actual PFNs has to be kept somewhere. Another approach
> would be to do the reverse of what hv_map_memory() from the v4 patch
> series does. I.e., you could do virt_to_phys() on each virtual address that
> maps above VTOM, and subtract out the shared_gpa_boundary to get the
> list of actual PFNs that need to be freed.
virt_to_phys() doesn't work for virtual address returned by
vmap/vmap_pfn() (just like it doesn't work for va returned by
vmalloc()). The pfn above vTom doesn't have struct page backing and
vmap_pfn() populates the pfn directly in the pte.(Please see the
vmap_pfn_apply()). So it's not easy to convert the va to pa.
> This way you don't have two copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary added.
> But it comes at the cost of additional code so that may not be a great idea.
>
> I think what you have here works, and I don't have a clearly better solution
> at the moment except perhaps to revert to the v4 solution and just have two
> virtual mappings. I'll keep thinking about it. Maybe Christop has other
> thoughts.
Hi Christoph:
This patch follows your purposal in the previous discussion.
Could you have a look?
"use vmap_pfn as in the current series. But in that case I think
we should get rid of the other mapping created by vmalloc. I
though a bit about finding a way to apply the offset in vmalloc
itself, but I think it would be too invasive to the normal fast
path. So the other sub-option would be to allocate the pages
manually (maybe even using high order allocations to reduce TLB
pressure) and then remap them(https://lkml.org/lkml/2021/9/2/112)
Otherwise, I merge your previous change for swiotlb into patch 9
“x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM”
You previous change
link.(http://git.infradead.org/users/hch/misc.git/commit/8248f295928aded3364a1e54a4e0022e93d3610c)
Please have a look.
Thanks.
On 9/16/2021 12:21 AM, Michael Kelley wrote:
> From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14, 2021 6:39 AM
>>
>> In Isolation VM, all shared memory with host needs to mark visible
>> to host via hvcall. vmbus_establish_gpadl() has already done it for
>> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
>> pagebuffer() stills need to be handled. Use DMA API to map/umap
>> these memory during sending/receiving packet and Hyper-V swiotlb
>> bounce buffer dma address will be returned. The swiotlb bounce buffer
>> has been masked to be visible to host during boot up.
>>
>> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
>> these pages via vmap(). After calling vmbus_establish_gpadl() which
>> marks these pages visible to host, unmap these pages to release the
>> virtual address mapped with physical address below shared_gpa_boundary
>> and map them in the extra address space via vmap_pfn().
>>
>> Signed-off-by: Tianyu Lan <[email protected]>
>> ---
>> Change since v4:
>> * Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
>> * Map pages after calling vmbus_establish_gpadl().
>> * set dma_set_min_align_mask for netvsc driver.
>>
>> Change since v3:
>> * Add comment to explain why not to use dma_map_sg()
>> * Fix some error handle.
>> ---
>> drivers/net/hyperv/hyperv_net.h | 7 +
>> drivers/net/hyperv/netvsc.c | 287 +++++++++++++++++++++++++++++-
>> drivers/net/hyperv/netvsc_drv.c | 1 +
>> drivers/net/hyperv/rndis_filter.c | 2 +
>> include/linux/hyperv.h | 5 +
>> 5 files changed, 296 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
>> index 315278a7cf88..87e8c74398a5 100644
>> --- a/drivers/net/hyperv/hyperv_net.h
>> +++ b/drivers/net/hyperv/hyperv_net.h
>> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>> u32 total_bytes;
>> u32 send_buf_index;
>> u32 total_data_buflen;
>> + struct hv_dma_range *dma_range;
>> };
>>
>> #define NETVSC_HASH_KEYLEN 40
>> @@ -1074,6 +1075,8 @@ struct netvsc_device {
>>
>> /* Receive buffer allocated by us but manages by NetVSP */
>> void *recv_buf;
>> + struct page **recv_pages;
>> + u32 recv_page_count;
>> u32 recv_buf_size; /* allocated bytes */
>> struct vmbus_gpadl recv_buf_gpadl_handle;
>> u32 recv_section_cnt;
>> @@ -1082,6 +1085,8 @@ struct netvsc_device {
>>
>> /* Send buffer allocated by us */
>> void *send_buf;
>> + struct page **send_pages;
>> + u32 send_page_count;
>> u32 send_buf_size;
>> struct vmbus_gpadl send_buf_gpadl_handle;
>> u32 send_section_cnt;
>> @@ -1731,4 +1736,6 @@ struct rndis_message {
>> #define RETRY_US_HI 10000
>> #define RETRY_MAX 2000 /* >10 sec */
>>
>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>> + struct hv_netvsc_packet *packet);
>> #endif /* _HYPERV_NET_H */
>> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
>> index 1f87e570ed2b..7d5254bf043e 100644
>> --- a/drivers/net/hyperv/netvsc.c
>> +++ b/drivers/net/hyperv/netvsc.c
>> @@ -20,6 +20,7 @@
>> #include <linux/vmalloc.h>
>> #include <linux/rtnetlink.h>
>> #include <linux/prefetch.h>
>> +#include <linux/gfp.h>
>>
>> #include <asm/sync_bitops.h>
>> #include <asm/mshyperv.h>
>> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head *head)
>> {
>> struct netvsc_device *nvdev
>> = container_of(head, struct netvsc_device, rcu);
>> + unsigned int alloc_unit;
>> int i;
>>
>> kfree(nvdev->extension);
>> - vfree(nvdev->recv_buf);
>> - vfree(nvdev->send_buf);
>> +
>> + if (nvdev->recv_pages) {
>> + alloc_unit = (nvdev->recv_buf_size /
>> + nvdev->recv_page_count) >> PAGE_SHIFT;
>> +
>> + vunmap(nvdev->recv_buf);
>> + for (i = 0; i < nvdev->recv_page_count; i++)
>> + __free_pages(nvdev->recv_pages[i], alloc_unit);
>> + } else {
>> + vfree(nvdev->recv_buf);
>> + }
>> +
>> + if (nvdev->send_pages) {
>> + alloc_unit = (nvdev->send_buf_size /
>> + nvdev->send_page_count) >> PAGE_SHIFT;
>> +
>> + vunmap(nvdev->send_buf);
>> + for (i = 0; i < nvdev->send_page_count; i++)
>> + __free_pages(nvdev->send_pages[i], alloc_unit);
>> + } else {
>> + vfree(nvdev->send_buf);
>> + }
>> +
>> kfree(nvdev->send_section_map);
>>
>> for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
>> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
>> return nvchan->mrc.slots ? 0 : -ENOMEM;
>> }
>>
>> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int *array_len,
>> + unsigned long size)
>> +{
>> + struct page *page, **pages, **vmap_pages;
>> + unsigned long pg_count = size >> PAGE_SHIFT;
>> + int alloc_unit = MAX_ORDER_NR_PAGES;
>> + int i, j, vmap_page_index = 0;
>> + void *vaddr;
>> +
>> + if (pg_count < alloc_unit)
>> + alloc_unit = 1;
>> +
>> + /* vmap() accepts page array with PAGE_SIZE as unit while try to
>> + * allocate high order pages here in order to save page array space.
>> + * vmap_pages[] is used as input parameter of vmap(). pages[] is to
>> + * store allocated pages and map them later.
>> + */
>> + vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages), GFP_KERNEL);
>> + if (!vmap_pages)
>> + return NULL;
>> +
>> +retry:
>> + *array_len = pg_count / alloc_unit;
>> + pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
>> + if (!pages)
>> + goto cleanup;
>> +
>> + for (i = 0; i < *array_len; i++) {
>> + page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
>> + get_order(alloc_unit << PAGE_SHIFT));
>> + if (!page) {
>> + /* Try allocating small pages if high order pages are not available. */
>> + if (alloc_unit == 1) {
>> + goto cleanup;
>> + } else {
>
> The "else" clause isn't really needed because of the goto cleanup above. Then
> the indentation of the code below could be reduced by one level.
>
>> + memset(vmap_pages, 0,
>> + sizeof(*vmap_pages) * vmap_page_index);
>> + vmap_page_index = 0;
>> +
>> + for (j = 0; j < i; j++)
>> + __free_pages(pages[j], alloc_unit);
>> +
>> + kfree(pages);
>> + alloc_unit = 1;
>
> This is the case where a large enough contiguous physical memory chunk could
> not be found. But rather than dropping all the way down to single pages,
> would it make sense to try something smaller, but not 1? For example,
> cut the alloc_unit in half and try again. But I'm not sure of all the implications.
>
>> + goto retry;
>> + }
>> + }
>> +
>> + pages[i] = page;
>> + for (j = 0; j < alloc_unit; j++)
>> + vmap_pages[vmap_page_index++] = page++;
>> + }
>> +
>> + vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
>> + kfree(vmap_pages);
>> +
>> + *pages_array = pages;
>> + return vaddr;
>> +
>> +cleanup:
>> + for (j = 0; j < i; j++)
>> + __free_pages(pages[i], alloc_unit);
>> +
>> + kfree(pages);
>> + kfree(vmap_pages);
>> + return NULL;
>> +}
>> +
>> +static void *netvsc_map_pages(struct page **pages, int count, int alloc_unit)
>> +{
>> + int pg_count = count * alloc_unit;
>> + struct page *page;
>> + unsigned long *pfns;
>> + int pfn_index = 0;
>> + void *vaddr;
>> + int i, j;
>> +
>> + if (!pages)
>> + return NULL;
>> +
>> + pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
>> + if (!pfns)
>> + return NULL;
>> +
>> + for (i = 0; i < count; i++) {
>> + page = pages[i];
>> + if (!page) {
>> + pr_warn("page is not available %d.\n", i);
>> + return NULL;
>> + }
>> +
>> + for (j = 0; j < alloc_unit; j++) {
>> + pfns[pfn_index++] = page_to_pfn(page++) +
>> + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
>> + }
>> + }
>> +
>> + vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
>> + kfree(pfns);
>> + return vaddr;
>> +}
>> +
>
> I think you are proposing this approach to allocating memory for the send
> and receive buffers so that you can avoid having two virtual mappings for
> the memory, per comments from Christop Hellwig. But overall, the approach
> seems a bit complex and I wonder if it is worth it. If allocating large contiguous
> chunks of physical memory is successful, then there is some memory savings
> in that the data structures needed to keep track of the physical pages is
> smaller than the equivalent page tables might be. But if you have to revert
> to allocating individual pages, then the memory savings is reduced.
>
> Ultimately, the list of actual PFNs has to be kept somewhere. Another approach
> would be to do the reverse of what hv_map_memory() from the v4 patch
> series does. I.e., you could do virt_to_phys() on each virtual address that
> maps above VTOM, and subtract out the shared_gpa_boundary to get the
> list of actual PFNs that need to be freed. This way you don't have two copies
> of the list of PFNs -- one with and one without the shared_gpa_boundary added.
> But it comes at the cost of additional code so that may not be a great idea.
>
> I think what you have here works, and I don't have a clearly better solution
> at the moment except perhaps to revert to the v4 solution and just have two
> virtual mappings. I'll keep thinking about it. Maybe Christop has other
> thoughts.
>
>> static int netvsc_init_buf(struct hv_device *device,
>> struct netvsc_device *net_device,
>> const struct netvsc_device_info *device_info)
>> @@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
>> struct nvsp_1_message_send_receive_buffer_complete *resp;
>> struct net_device *ndev = hv_get_drvdata(device);
>> struct nvsp_message *init_packet;
>> - unsigned int buf_size;
>> + unsigned int buf_size, alloc_unit;
>> size_t map_words;
>> int i, ret = 0;
>>
>> @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device *device,
>> buf_size = min_t(unsigned int, buf_size,
>> NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
>>
>> - net_device->recv_buf = vzalloc(buf_size);
>> + if (hv_isolation_type_snp())
>> + net_device->recv_buf =
>> + netvsc_alloc_pages(&net_device->recv_pages,
>> + &net_device->recv_page_count,
>> + buf_size);
>> + else
>> + net_device->recv_buf = vzalloc(buf_size);
>> +
>
> I wonder if it is necessary to have two different code paths here. The
> allocating and freeing of the send and receive buffers is not perf
> sensitive, and it seems like netvsc_alloc_pages() could be used
> regardless of whether SNP Isolation is in effect. To my thinking,
> one code path is better than two code paths unless there's a
> compelling reason to have two.
>
>> if (!net_device->recv_buf) {
>> netdev_err(ndev,
>> "unable to allocate receive buffer of size %u\n",
>> @@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device *device,
>> goto cleanup;
>> }
>>
>> + if (hv_isolation_type_snp()) {
>> + alloc_unit = (buf_size / net_device->recv_page_count)
>> + >> PAGE_SHIFT;
>> +
>> + /* Unmap previous virtual address and map pages in the extra
>> + * address space(above shared gpa boundary) in Isolation VM.
>> + */
>> + vunmap(net_device->recv_buf);
>> + net_device->recv_buf =
>> + netvsc_map_pages(net_device->recv_pages,
>> + net_device->recv_page_count,
>> + alloc_unit);
>> + if (!net_device->recv_buf) {
>> + netdev_err(ndev,
>> + "unable to allocate receive buffer of size %u\n",
>> + buf_size);
>> + ret = -ENOMEM;
>> + goto cleanup;
>> + }
>> + }
>> +
>> /* Notify the NetVsp of the gpadl handle */
>> init_packet = &net_device->channel_init_pkt;
>> memset(init_packet, 0, sizeof(struct nvsp_message));
>> @@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device *device,
>> buf_size = device_info->send_sections * device_info->send_section_size;
>> buf_size = round_up(buf_size, PAGE_SIZE);
>>
>> - net_device->send_buf = vzalloc(buf_size);
>> + if (hv_isolation_type_snp())
>> + net_device->send_buf =
>> + netvsc_alloc_pages(&net_device->send_pages,
>> + &net_device->send_page_count,
>> + buf_size);
>> + else
>> + net_device->send_buf = vzalloc(buf_size);
>> +
>> if (!net_device->send_buf) {
>> netdev_err(ndev, "unable to allocate send buffer of size %u\n",
>> buf_size);
>> ret = -ENOMEM;
>> goto cleanup;
>> }
>> +
>> net_device->send_buf_size = buf_size;
>>
>> /* Establish the gpadl handle for this buffer on this
>> @@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device *device,
>> goto cleanup;
>> }
>>
>> + if (hv_isolation_type_snp()) {
>> + alloc_unit = (buf_size / net_device->send_page_count)
>> + >> PAGE_SHIFT;
>> +
>> + /* Unmap previous virtual address and map pages in the extra
>> + * address space(above shared gpa boundary) in Isolation VM.
>> + */
>> + vunmap(net_device->send_buf);
>> + net_device->send_buf =
>> + netvsc_map_pages(net_device->send_pages,
>> + net_device->send_page_count,
>> + alloc_unit);
>> + if (!net_device->send_buf) {
>> + netdev_err(ndev,
>> + "unable to allocate receive buffer of size %u\n",
>> + buf_size);
>> + ret = -ENOMEM;
>> + goto cleanup;
>> + }
>> + }
>> +
>> /* Notify the NetVsp of the gpadl handle */
>> init_packet = &net_device->channel_init_pkt;
>> memset(init_packet, 0, sizeof(struct nvsp_message));
>> @@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
>>
>> /* Notify the layer above us */
>> if (likely(skb)) {
>> - const struct hv_netvsc_packet *packet
>> + struct hv_netvsc_packet *packet
>> = (struct hv_netvsc_packet *)skb->cb;
>> u32 send_index = packet->send_buf_index;
>> struct netvsc_stats *tx_stats;
>> @@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct net_device *ndev,
>> tx_stats->bytes += packet->total_bytes;
>> u64_stats_update_end(&tx_stats->syncp);
>>
>> + netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>> napi_consume_skb(skb, budget);
>> }
>>
>> @@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
>> memset(dest, 0, padding);
>> }
>>
>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>> + struct hv_netvsc_packet *packet)
>> +{
>> + u32 page_count = packet->cp_partial ?
>> + packet->page_buf_cnt - packet->rmsg_pgcnt :
>> + packet->page_buf_cnt;
>> + int i;
>> +
>> + if (!hv_is_isolation_supported())
>> + return;
>> +
>> + if (!packet->dma_range)
>> + return;
>> +
>> + for (i = 0; i < page_count; i++)
>> + dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
>> + packet->dma_range[i].mapping_size,
>> + DMA_TO_DEVICE);
>> +
>> + kfree(packet->dma_range);
>> +}
>> +
>> +/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
>> + * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
>> + * VM.
>> + *
>> + * In isolation VM, netvsc send buffer has been marked visible to
>> + * host and so the data copied to send buffer doesn't need to use
>> + * bounce buffer. The data pages handled by vmbus_sendpacket_pagebuffer()
>> + * may not be copied to send buffer and so these pages need to be
>> + * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
>> + * that. The pfns in the struct hv_page_buffer need to be converted
>> + * to bounce buffer's pfn. The loop here is necessary because the
>> + * entries in the page buffer array are not necessarily full
>> + * pages of data. Each entry in the array has a separate offset and
>> + * len that may be non-zero, even for entries in the middle of the
>> + * array. And the entries are not physically contiguous. So each
>> + * entry must be individually mapped rather than as a contiguous unit.
>> + * So not use dma_map_sg() here.
>> + */
>> +static int netvsc_dma_map(struct hv_device *hv_dev,
>> + struct hv_netvsc_packet *packet,
>> + struct hv_page_buffer *pb)
>> +{
>> + u32 page_count = packet->cp_partial ?
>> + packet->page_buf_cnt - packet->rmsg_pgcnt :
>> + packet->page_buf_cnt;
>> + dma_addr_t dma;
>> + int i;
>> +
>> + if (!hv_is_isolation_supported())
>> + return 0;
>> +
>> + packet->dma_range = kcalloc(page_count,
>> + sizeof(*packet->dma_range),
>> + GFP_KERNEL);
>> + if (!packet->dma_range)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < page_count; i++) {
>> + char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
>> + + pb[i].offset);
>> + u32 len = pb[i].len;
>> +
>> + dma = dma_map_single(&hv_dev->device, src, len,
>> + DMA_TO_DEVICE);
>> + if (dma_mapping_error(&hv_dev->device, dma)) {
>> + kfree(packet->dma_range);
>> + return -ENOMEM;
>> + }
>> +
>> + packet->dma_range[i].dma = dma;
>> + packet->dma_range[i].mapping_size = len;
>> + pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
>> + pb[i].offset = offset_in_hvpage(dma);
>
> With the DMA min align mask now being set, the offset within
> the Hyper-V page won't be changed by dma_map_single(). So I
> think the above statement can be removed.
>
>> + pb[i].len = len;
>
> A few lines above, the value of "len" is set from pb[i].len. Neither
> "len" nor "i" is changed in the loop, so this statement can also be
> removed.
>
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static inline int netvsc_send_pkt(
>> struct hv_device *device,
>> struct hv_netvsc_packet *packet,
>> @@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
>>
>> trace_nvsp_send_pkt(ndev, out_channel, rpkt);
>>
>> + packet->dma_range = NULL;
>> if (packet->page_buf_cnt) {
>> if (packet->cp_partial)
>> pb += packet->rmsg_pgcnt;
>>
>> + ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
>> + if (ret) {
>> + ret = -EAGAIN;
>> + goto exit;
>> + }
>> +
>> ret = vmbus_sendpacket_pagebuffer(out_channel,
>> pb, packet->page_buf_cnt,
>> &nvmsg, sizeof(nvmsg),
>> req_id);
>> +
>> + if (ret)
>> + netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>> } else {
>> ret = vmbus_sendpacket(out_channel,
>> &nvmsg, sizeof(nvmsg),
>> @@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
>> VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
>> }
>>
>> +exit:
>> if (ret == 0) {
>> atomic_inc_return(&nvchan->queue_sends);
>>
>> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
>> index 382bebc2420d..c3dc884b31e3 100644
>> --- a/drivers/net/hyperv/netvsc_drv.c
>> +++ b/drivers/net/hyperv/netvsc_drv.c
>> @@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
>> list_add(&net_device_ctx->list, &netvsc_dev_list);
>> rtnl_unlock();
>>
>> + dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
>> netvsc_devinfo_put(device_info);
>> return 0;
>>
>> diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
>> index f6c9c2a670f9..448fcc325ed7 100644
>> --- a/drivers/net/hyperv/rndis_filter.c
>> +++ b/drivers/net/hyperv/rndis_filter.c
>> @@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct net_device *ndev,
>> }
>> }
>>
>> + netvsc_dma_unmap(((struct net_device_context *)
>> + netdev_priv(ndev))->device_ctx, &request->pkt);
>> complete(&request->wait_event);
>> } else {
>> netdev_err(ndev,
>> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
>> index c94c534a944e..81e58dd582dc 100644
>> --- a/include/linux/hyperv.h
>> +++ b/include/linux/hyperv.h
>> @@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
>> void (*callback)(void *context);
>> };
>>
>> +struct hv_dma_range {
>> + dma_addr_t dma;
>> + u32 mapping_size;
>> +};
>> +
>> #define MAX_SRV_VER 0x7ffffff
>> extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, u8 *buf, u32 buflen,
>> const int *fw_version, int fw_vercnt,
>> --
>> 2.25.1
>
Hi Christoph:
Gentile ping. The swiotlb and shared memory mapping changes in this
patchset needs your reivew. Could you have a look?
Thanks.
On 9/22/2021 6:34 PM, Tianyu Lan wrote:
> Hi Christoph:
> This patch follows your purposal in the previous discussion.
> Could you have a look?
> "use vmap_pfn as in the current series. But in that case I think
> we should get rid of the other mapping created by vmalloc. I
> though a bit about finding a way to apply the offset in vmalloc
> itself, but I think it would be too invasive to the normal fast
> path. So the other sub-option would be to allocate the pages
> manually (maybe even using high order allocations to reduce TLB
> pressure) and then remap them(https://lkml.org/lkml/2021/9/2/112)
>
> Otherwise, I merge your previous change for swiotlb into patch 9
> “x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM”
> You previous change
> link.(http://git.infradead.org/users/hch/misc.git/commit/8248f295928aded3364a1e54a4e0022e93d3610c)
> Please have a look.
>
>
> Thanks.
>
>
> On 9/16/2021 12:21 AM, Michael Kelley wrote:
>> From: Tianyu Lan <[email protected]> Sent: Tuesday, September 14,
>> 2021 6:39 AM
>>>
>>> In Isolation VM, all shared memory with host needs to mark visible
>>> to host via hvcall. vmbus_establish_gpadl() has already done it for
>>> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
>>> pagebuffer() stills need to be handled. Use DMA API to map/umap
>>> these memory during sending/receiving packet and Hyper-V swiotlb
>>> bounce buffer dma address will be returned. The swiotlb bounce buffer
>>> has been masked to be visible to host during boot up.
>>>
>>> Allocate rx/tx ring buffer via alloc_pages() in Isolation VM and map
>>> these pages via vmap(). After calling vmbus_establish_gpadl() which
>>> marks these pages visible to host, unmap these pages to release the
>>> virtual address mapped with physical address below shared_gpa_boundary
>>> and map them in the extra address space via vmap_pfn().
>>>
>>> Signed-off-by: Tianyu Lan <[email protected]>
>>> ---
>>> Change since v4:
>>> * Allocate rx/tx ring buffer via alloc_pages() in Isolation VM
>>> * Map pages after calling vmbus_establish_gpadl().
>>> * set dma_set_min_align_mask for netvsc driver.
>>>
>>> Change since v3:
>>> * Add comment to explain why not to use dma_map_sg()
>>> * Fix some error handle.
>>> ---
>>> drivers/net/hyperv/hyperv_net.h | 7 +
>>> drivers/net/hyperv/netvsc.c | 287 +++++++++++++++++++++++++++++-
>>> drivers/net/hyperv/netvsc_drv.c | 1 +
>>> drivers/net/hyperv/rndis_filter.c | 2 +
>>> include/linux/hyperv.h | 5 +
>>> 5 files changed, 296 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/net/hyperv/hyperv_net.h
>>> b/drivers/net/hyperv/hyperv_net.h
>>> index 315278a7cf88..87e8c74398a5 100644
>>> --- a/drivers/net/hyperv/hyperv_net.h
>>> +++ b/drivers/net/hyperv/hyperv_net.h
>>> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>>> u32 total_bytes;
>>> u32 send_buf_index;
>>> u32 total_data_buflen;
>>> + struct hv_dma_range *dma_range;
>>> };
>>>
>>> #define NETVSC_HASH_KEYLEN 40
>>> @@ -1074,6 +1075,8 @@ struct netvsc_device {
>>>
>>> /* Receive buffer allocated by us but manages by NetVSP */
>>> void *recv_buf;
>>> + struct page **recv_pages;
>>> + u32 recv_page_count;
>>> u32 recv_buf_size; /* allocated bytes */
>>> struct vmbus_gpadl recv_buf_gpadl_handle;
>>> u32 recv_section_cnt;
>>> @@ -1082,6 +1085,8 @@ struct netvsc_device {
>>>
>>> /* Send buffer allocated by us */
>>> void *send_buf;
>>> + struct page **send_pages;
>>> + u32 send_page_count;
>>> u32 send_buf_size;
>>> struct vmbus_gpadl send_buf_gpadl_handle;
>>> u32 send_section_cnt;
>>> @@ -1731,4 +1736,6 @@ struct rndis_message {
>>> #define RETRY_US_HI 10000
>>> #define RETRY_MAX 2000 /* >10 sec */
>>>
>>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>>> + struct hv_netvsc_packet *packet);
>>> #endif /* _HYPERV_NET_H */
>>> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
>>> index 1f87e570ed2b..7d5254bf043e 100644
>>> --- a/drivers/net/hyperv/netvsc.c
>>> +++ b/drivers/net/hyperv/netvsc.c
>>> @@ -20,6 +20,7 @@
>>> #include <linux/vmalloc.h>
>>> #include <linux/rtnetlink.h>
>>> #include <linux/prefetch.h>
>>> +#include <linux/gfp.h>
>>>
>>> #include <asm/sync_bitops.h>
>>> #include <asm/mshyperv.h>
>>> @@ -150,11 +151,33 @@ static void free_netvsc_device(struct rcu_head
>>> *head)
>>> {
>>> struct netvsc_device *nvdev
>>> = container_of(head, struct netvsc_device, rcu);
>>> + unsigned int alloc_unit;
>>> int i;
>>>
>>> kfree(nvdev->extension);
>>> - vfree(nvdev->recv_buf);
>>> - vfree(nvdev->send_buf);
>>> +
>>> + if (nvdev->recv_pages) {
>>> + alloc_unit = (nvdev->recv_buf_size /
>>> + nvdev->recv_page_count) >> PAGE_SHIFT;
>>> +
>>> + vunmap(nvdev->recv_buf);
>>> + for (i = 0; i < nvdev->recv_page_count; i++)
>>> + __free_pages(nvdev->recv_pages[i], alloc_unit);
>>> + } else {
>>> + vfree(nvdev->recv_buf);
>>> + }
>>> +
>>> + if (nvdev->send_pages) {
>>> + alloc_unit = (nvdev->send_buf_size /
>>> + nvdev->send_page_count) >> PAGE_SHIFT;
>>> +
>>> + vunmap(nvdev->send_buf);
>>> + for (i = 0; i < nvdev->send_page_count; i++)
>>> + __free_pages(nvdev->send_pages[i], alloc_unit);
>>> + } else {
>>> + vfree(nvdev->send_buf);
>>> + }
>>> +
>>> kfree(nvdev->send_section_map);
>>>
>>> for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
>>> @@ -330,6 +353,108 @@ int netvsc_alloc_recv_comp_ring(struct
>>> netvsc_device *net_device, u32 q_idx)
>>> return nvchan->mrc.slots ? 0 : -ENOMEM;
>>> }
>>>
>>> +void *netvsc_alloc_pages(struct page ***pages_array, unsigned int
>>> *array_len,
>>> + unsigned long size)
>>> +{
>>> + struct page *page, **pages, **vmap_pages;
>>> + unsigned long pg_count = size >> PAGE_SHIFT;
>>> + int alloc_unit = MAX_ORDER_NR_PAGES;
>>> + int i, j, vmap_page_index = 0;
>>> + void *vaddr;
>>> +
>>> + if (pg_count < alloc_unit)
>>> + alloc_unit = 1;
>>> +
>>> + /* vmap() accepts page array with PAGE_SIZE as unit while try to
>>> + * allocate high order pages here in order to save page array
>>> space.
>>> + * vmap_pages[] is used as input parameter of vmap(). pages[] is to
>>> + * store allocated pages and map them later.
>>> + */
>>> + vmap_pages = kmalloc_array(pg_count, sizeof(*vmap_pages),
>>> GFP_KERNEL);
>>> + if (!vmap_pages)
>>> + return NULL;
>>> +
>>> +retry:
>>> + *array_len = pg_count / alloc_unit;
>>> + pages = kmalloc_array(*array_len, sizeof(*pages), GFP_KERNEL);
>>> + if (!pages)
>>> + goto cleanup;
>>> +
>>> + for (i = 0; i < *array_len; i++) {
>>> + page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
>>> + get_order(alloc_unit << PAGE_SHIFT));
>>> + if (!page) {
>>> + /* Try allocating small pages if high order pages are
>>> not available. */
>>> + if (alloc_unit == 1) {
>>> + goto cleanup;
>>> + } else {
>>
>> The "else" clause isn't really needed because of the goto cleanup
>> above. Then
>> the indentation of the code below could be reduced by one level.
>>
>>> + memset(vmap_pages, 0,
>>> + sizeof(*vmap_pages) * vmap_page_index);
>>> + vmap_page_index = 0;
>>> +
>>> + for (j = 0; j < i; j++)
>>> + __free_pages(pages[j], alloc_unit);
>>> +
>>> + kfree(pages);
>>> + alloc_unit = 1;
>>
>> This is the case where a large enough contiguous physical memory chunk
>> could
>> not be found. But rather than dropping all the way down to single pages,
>> would it make sense to try something smaller, but not 1? For example,
>> cut the alloc_unit in half and try again. But I'm not sure of all the
>> implications.
>>
>>> + goto retry;
>>> + }
>>> + }
>>> +
>>> + pages[i] = page;
>>> + for (j = 0; j < alloc_unit; j++)
>>> + vmap_pages[vmap_page_index++] = page++;
>>> + }
>>> +
>>> + vaddr = vmap(vmap_pages, vmap_page_index, VM_MAP, PAGE_KERNEL);
>>> + kfree(vmap_pages);
>>> +
>>> + *pages_array = pages;
>>> + return vaddr;
>>> +
>>> +cleanup:
>>> + for (j = 0; j < i; j++)
>>> + __free_pages(pages[i], alloc_unit);
>>> +
>>> + kfree(pages);
>>> + kfree(vmap_pages);
>>> + return NULL;
>>> +}
>>> +
>>> +static void *netvsc_map_pages(struct page **pages, int count, int
>>> alloc_unit)
>>> +{
>>> + int pg_count = count * alloc_unit;
>>> + struct page *page;
>>> + unsigned long *pfns;
>>> + int pfn_index = 0;
>>> + void *vaddr;
>>> + int i, j;
>>> +
>>> + if (!pages)
>>> + return NULL;
>>> +
>>> + pfns = kcalloc(pg_count, sizeof(*pfns), GFP_KERNEL);
>>> + if (!pfns)
>>> + return NULL;
>>> +
>>> + for (i = 0; i < count; i++) {
>>> + page = pages[i];
>>> + if (!page) {
>>> + pr_warn("page is not available %d.\n", i);
>>> + return NULL;
>>> + }
>>> +
>>> + for (j = 0; j < alloc_unit; j++) {
>>> + pfns[pfn_index++] = page_to_pfn(page++) +
>>> + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
>>> + }
>>> + }
>>> +
>>> + vaddr = vmap_pfn(pfns, pg_count, PAGE_KERNEL_IO);
>>> + kfree(pfns);
>>> + return vaddr;
>>> +}
>>> +
>>
>> I think you are proposing this approach to allocating memory for the send
>> and receive buffers so that you can avoid having two virtual mappings for
>> the memory, per comments from Christop Hellwig. But overall, the
>> approach
>> seems a bit complex and I wonder if it is worth it. If allocating
>> large contiguous
>> chunks of physical memory is successful, then there is some memory
>> savings
>> in that the data structures needed to keep track of the physical pages is
>> smaller than the equivalent page tables might be. But if you have to
>> revert
>> to allocating individual pages, then the memory savings is reduced.
>>
>> Ultimately, the list of actual PFNs has to be kept somewhere. Another
>> approach
>> would be to do the reverse of what hv_map_memory() from the v4 patch
>> series does. I.e., you could do virt_to_phys() on each virtual
>> address that
>> maps above VTOM, and subtract out the shared_gpa_boundary to get the
>> list of actual PFNs that need to be freed. This way you don't have
>> two copies
>> of the list of PFNs -- one with and one without the
>> shared_gpa_boundary added.
>> But it comes at the cost of additional code so that may not be a great
>> idea.
>>
>> I think what you have here works, and I don't have a clearly better
>> solution
>> at the moment except perhaps to revert to the v4 solution and just
>> have two
>> virtual mappings. I'll keep thinking about it. Maybe Christop has other
>> thoughts.
>>
>>> static int netvsc_init_buf(struct hv_device *device,
>>> struct netvsc_device *net_device,
>>> const struct netvsc_device_info *device_info)
>>> @@ -337,7 +462,7 @@ static int netvsc_init_buf(struct hv_device *device,
>>> struct nvsp_1_message_send_receive_buffer_complete *resp;
>>> struct net_device *ndev = hv_get_drvdata(device);
>>> struct nvsp_message *init_packet;
>>> - unsigned int buf_size;
>>> + unsigned int buf_size, alloc_unit;
>>> size_t map_words;
>>> int i, ret = 0;
>>>
>>> @@ -350,7 +475,14 @@ static int netvsc_init_buf(struct hv_device
>>> *device,
>>> buf_size = min_t(unsigned int, buf_size,
>>> NETVSC_RECEIVE_BUFFER_SIZE_LEGACY);
>>>
>>> - net_device->recv_buf = vzalloc(buf_size);
>>> + if (hv_isolation_type_snp())
>>> + net_device->recv_buf =
>>> + netvsc_alloc_pages(&net_device->recv_pages,
>>> + &net_device->recv_page_count,
>>> + buf_size);
>>> + else
>>> + net_device->recv_buf = vzalloc(buf_size);
>>> +
>>
>> I wonder if it is necessary to have two different code paths here. The
>> allocating and freeing of the send and receive buffers is not perf
>> sensitive, and it seems like netvsc_alloc_pages() could be used
>> regardless of whether SNP Isolation is in effect. To my thinking,
>> one code path is better than two code paths unless there's a
>> compelling reason to have two.
>>
>>> if (!net_device->recv_buf) {
>>> netdev_err(ndev,
>>> "unable to allocate receive buffer of size %u\n",
>>> @@ -375,6 +507,27 @@ static int netvsc_init_buf(struct hv_device
>>> *device,
>>> goto cleanup;
>>> }
>>>
>>> + if (hv_isolation_type_snp()) {
>>> + alloc_unit = (buf_size / net_device->recv_page_count)
>>> + >> PAGE_SHIFT;
>>> +
>>> + /* Unmap previous virtual address and map pages in the extra
>>> + * address space(above shared gpa boundary) in Isolation VM.
>>> + */
>>> + vunmap(net_device->recv_buf);
>>> + net_device->recv_buf =
>>> + netvsc_map_pages(net_device->recv_pages,
>>> + net_device->recv_page_count,
>>> + alloc_unit);
>>> + if (!net_device->recv_buf) {
>>> + netdev_err(ndev,
>>> + "unable to allocate receive buffer of size %u\n",
>>> + buf_size);
>>> + ret = -ENOMEM;
>>> + goto cleanup;
>>> + }
>>> + }
>>> +
>>> /* Notify the NetVsp of the gpadl handle */
>>> init_packet = &net_device->channel_init_pkt;
>>> memset(init_packet, 0, sizeof(struct nvsp_message));
>>> @@ -456,13 +609,21 @@ static int netvsc_init_buf(struct hv_device
>>> *device,
>>> buf_size = device_info->send_sections *
>>> device_info->send_section_size;
>>> buf_size = round_up(buf_size, PAGE_SIZE);
>>>
>>> - net_device->send_buf = vzalloc(buf_size);
>>> + if (hv_isolation_type_snp())
>>> + net_device->send_buf =
>>> + netvsc_alloc_pages(&net_device->send_pages,
>>> + &net_device->send_page_count,
>>> + buf_size);
>>> + else
>>> + net_device->send_buf = vzalloc(buf_size);
>>> +
>>> if (!net_device->send_buf) {
>>> netdev_err(ndev, "unable to allocate send buffer of size
>>> %u\n",
>>> buf_size);
>>> ret = -ENOMEM;
>>> goto cleanup;
>>> }
>>> +
>>> net_device->send_buf_size = buf_size;
>>>
>>> /* Establish the gpadl handle for this buffer on this
>>> @@ -478,6 +639,27 @@ static int netvsc_init_buf(struct hv_device
>>> *device,
>>> goto cleanup;
>>> }
>>>
>>> + if (hv_isolation_type_snp()) {
>>> + alloc_unit = (buf_size / net_device->send_page_count)
>>> + >> PAGE_SHIFT;
>>> +
>>> + /* Unmap previous virtual address and map pages in the extra
>>> + * address space(above shared gpa boundary) in Isolation VM.
>>> + */
>>> + vunmap(net_device->send_buf);
>>> + net_device->send_buf =
>>> + netvsc_map_pages(net_device->send_pages,
>>> + net_device->send_page_count,
>>> + alloc_unit);
>>> + if (!net_device->send_buf) {
>>> + netdev_err(ndev,
>>> + "unable to allocate receive buffer of size %u\n",
>>> + buf_size);
>>> + ret = -ENOMEM;
>>> + goto cleanup;
>>> + }
>>> + }
>>> +
>>> /* Notify the NetVsp of the gpadl handle */
>>> init_packet = &net_device->channel_init_pkt;
>>> memset(init_packet, 0, sizeof(struct nvsp_message));
>>> @@ -768,7 +950,7 @@ static void netvsc_send_tx_complete(struct
>>> net_device *ndev,
>>>
>>> /* Notify the layer above us */
>>> if (likely(skb)) {
>>> - const struct hv_netvsc_packet *packet
>>> + struct hv_netvsc_packet *packet
>>> = (struct hv_netvsc_packet *)skb->cb;
>>> u32 send_index = packet->send_buf_index;
>>> struct netvsc_stats *tx_stats;
>>> @@ -784,6 +966,7 @@ static void netvsc_send_tx_complete(struct
>>> net_device *ndev,
>>> tx_stats->bytes += packet->total_bytes;
>>> u64_stats_update_end(&tx_stats->syncp);
>>>
>>> + netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>>> napi_consume_skb(skb, budget);
>>> }
>>>
>>> @@ -948,6 +1131,87 @@ static void netvsc_copy_to_send_buf(struct
>>> netvsc_device *net_device,
>>> memset(dest, 0, padding);
>>> }
>>>
>>> +void netvsc_dma_unmap(struct hv_device *hv_dev,
>>> + struct hv_netvsc_packet *packet)
>>> +{
>>> + u32 page_count = packet->cp_partial ?
>>> + packet->page_buf_cnt - packet->rmsg_pgcnt :
>>> + packet->page_buf_cnt;
>>> + int i;
>>> +
>>> + if (!hv_is_isolation_supported())
>>> + return;
>>> +
>>> + if (!packet->dma_range)
>>> + return;
>>> +
>>> + for (i = 0; i < page_count; i++)
>>> + dma_unmap_single(&hv_dev->device, packet->dma_range[i].dma,
>>> + packet->dma_range[i].mapping_size,
>>> + DMA_TO_DEVICE);
>>> +
>>> + kfree(packet->dma_range);
>>> +}
>>> +
>>> +/* netvsc_dma_map - Map swiotlb bounce buffer with data page of
>>> + * packet sent by vmbus_sendpacket_pagebuffer() in the Isolation
>>> + * VM.
>>> + *
>>> + * In isolation VM, netvsc send buffer has been marked visible to
>>> + * host and so the data copied to send buffer doesn't need to use
>>> + * bounce buffer. The data pages handled by
>>> vmbus_sendpacket_pagebuffer()
>>> + * may not be copied to send buffer and so these pages need to be
>>> + * mapped with swiotlb bounce buffer. netvsc_dma_map() is to do
>>> + * that. The pfns in the struct hv_page_buffer need to be converted
>>> + * to bounce buffer's pfn. The loop here is necessary because the
>>> + * entries in the page buffer array are not necessarily full
>>> + * pages of data. Each entry in the array has a separate offset and
>>> + * len that may be non-zero, even for entries in the middle of the
>>> + * array. And the entries are not physically contiguous. So each
>>> + * entry must be individually mapped rather than as a contiguous unit.
>>> + * So not use dma_map_sg() here.
>>> + */
>>> +static int netvsc_dma_map(struct hv_device *hv_dev,
>>> + struct hv_netvsc_packet *packet,
>>> + struct hv_page_buffer *pb)
>>> +{
>>> + u32 page_count = packet->cp_partial ?
>>> + packet->page_buf_cnt - packet->rmsg_pgcnt :
>>> + packet->page_buf_cnt;
>>> + dma_addr_t dma;
>>> + int i;
>>> +
>>> + if (!hv_is_isolation_supported())
>>> + return 0;
>>> +
>>> + packet->dma_range = kcalloc(page_count,
>>> + sizeof(*packet->dma_range),
>>> + GFP_KERNEL);
>>> + if (!packet->dma_range)
>>> + return -ENOMEM;
>>> +
>>> + for (i = 0; i < page_count; i++) {
>>> + char *src = phys_to_virt((pb[i].pfn << HV_HYP_PAGE_SHIFT)
>>> + + pb[i].offset);
>>> + u32 len = pb[i].len;
>>> +
>>> + dma = dma_map_single(&hv_dev->device, src, len,
>>> + DMA_TO_DEVICE);
>>> + if (dma_mapping_error(&hv_dev->device, dma)) {
>>> + kfree(packet->dma_range);
>>> + return -ENOMEM;
>>> + }
>>> +
>>> + packet->dma_range[i].dma = dma;
>>> + packet->dma_range[i].mapping_size = len;
>>> + pb[i].pfn = dma >> HV_HYP_PAGE_SHIFT;
>>> + pb[i].offset = offset_in_hvpage(dma);
>>
>> With the DMA min align mask now being set, the offset within
>> the Hyper-V page won't be changed by dma_map_single(). So I
>> think the above statement can be removed.
>>
>>> + pb[i].len = len;
>>
>> A few lines above, the value of "len" is set from pb[i].len. Neither
>> "len" nor "i" is changed in the loop, so this statement can also be
>> removed.
>>
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> static inline int netvsc_send_pkt(
>>> struct hv_device *device,
>>> struct hv_netvsc_packet *packet,
>>> @@ -988,14 +1252,24 @@ static inline int netvsc_send_pkt(
>>>
>>> trace_nvsp_send_pkt(ndev, out_channel, rpkt);
>>>
>>> + packet->dma_range = NULL;
>>> if (packet->page_buf_cnt) {
>>> if (packet->cp_partial)
>>> pb += packet->rmsg_pgcnt;
>>>
>>> + ret = netvsc_dma_map(ndev_ctx->device_ctx, packet, pb);
>>> + if (ret) {
>>> + ret = -EAGAIN;
>>> + goto exit;
>>> + }
>>> +
>>> ret = vmbus_sendpacket_pagebuffer(out_channel,
>>> pb, packet->page_buf_cnt,
>>> &nvmsg, sizeof(nvmsg),
>>> req_id);
>>> +
>>> + if (ret)
>>> + netvsc_dma_unmap(ndev_ctx->device_ctx, packet);
>>> } else {
>>> ret = vmbus_sendpacket(out_channel,
>>> &nvmsg, sizeof(nvmsg),
>>> @@ -1003,6 +1277,7 @@ static inline int netvsc_send_pkt(
>>> VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
>>> }
>>>
>>> +exit:
>>> if (ret == 0) {
>>> atomic_inc_return(&nvchan->queue_sends);
>>>
>>> diff --git a/drivers/net/hyperv/netvsc_drv.c
>>> b/drivers/net/hyperv/netvsc_drv.c
>>> index 382bebc2420d..c3dc884b31e3 100644
>>> --- a/drivers/net/hyperv/netvsc_drv.c
>>> +++ b/drivers/net/hyperv/netvsc_drv.c
>>> @@ -2577,6 +2577,7 @@ static int netvsc_probe(struct hv_device *dev,
>>> list_add(&net_device_ctx->list, &netvsc_dev_list);
>>> rtnl_unlock();
>>>
>>> + dma_set_min_align_mask(&dev->device, HV_HYP_PAGE_SIZE - 1);
>>> netvsc_devinfo_put(device_info);
>>> return 0;
>>>
>>> diff --git a/drivers/net/hyperv/rndis_filter.c
>>> b/drivers/net/hyperv/rndis_filter.c
>>> index f6c9c2a670f9..448fcc325ed7 100644
>>> --- a/drivers/net/hyperv/rndis_filter.c
>>> +++ b/drivers/net/hyperv/rndis_filter.c
>>> @@ -361,6 +361,8 @@ static void rndis_filter_receive_response(struct
>>> net_device *ndev,
>>> }
>>> }
>>>
>>> + netvsc_dma_unmap(((struct net_device_context *)
>>> + netdev_priv(ndev))->device_ctx, &request->pkt);
>>> complete(&request->wait_event);
>>> } else {
>>> netdev_err(ndev,
>>> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
>>> index c94c534a944e..81e58dd582dc 100644
>>> --- a/include/linux/hyperv.h
>>> +++ b/include/linux/hyperv.h
>>> @@ -1597,6 +1597,11 @@ struct hyperv_service_callback {
>>> void (*callback)(void *context);
>>> };
>>>
>>> +struct hv_dma_range {
>>> + dma_addr_t dma;
>>> + u32 mapping_size;
>>> +};
>>> +
>>> #define MAX_SRV_VER 0x7ffffff
>>> extern bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp,
>>> u8 *buf, u32 buflen,
>>> const int *fw_version, int fw_vercnt,
>>> --
>>> 2.25.1
>>
On Mon, Sep 27, 2021 at 10:26:43PM +0800, Tianyu Lan wrote:
> Hi Christoph:
> Gentile ping. The swiotlb and shared memory mapping changes in this
> patchset needs your reivew. Could you have a look?
I'm a little too busy for a review of such a huge patchset right now.
That being said here are my comments from a very quick review:
- the bare memremap usage in swiotlb looks strange and I'd
definitively expect a well documented wrapper.
- given that we can now hand out swiotlb memory for coherent mappings
we need to carefully audit what happens when this memremaped
memory gets mmaped or used through dma_get_sgtable
- the netscv changes I'm not happy with at all. A large part of it
is that the driver already has a bad structure, but this series
is making it significantly worse. We'll need to find a way
to use the proper dma mapping abstractions here. One option
if you want to stick to the double vmapped buffer would be something
like using dma_alloc_noncontigous plus a variant of
dma_vmap_noncontiguous that takes the shared_gpa_boundary into
account.
On 9/28/2021 1:39 PM, Christoph Hellwig wrote:
> On Mon, Sep 27, 2021 at 10:26:43PM +0800, Tianyu Lan wrote:
>> Hi Christoph:
>> Gentile ping. The swiotlb and shared memory mapping changes in this
>> patchset needs your reivew. Could you have a look? >
> I'm a little too busy for a review of such a huge patchset right now.
> That being said here are my comments from a very quick review:
Hi Christoph:
Thanks for your comments. Most patches in the series are Hyper-V
change. I will split patchset and make it easy to review.
>
> - the bare memremap usage in swiotlb looks strange and I'd
> definitively expect a well documented wrapper.
OK. Should the wrapper in the DMA code? How about dma_map_decrypted()
introduced in the V4?
https://lkml.org/lkml/2021/8/27/605
> - given that we can now hand out swiotlb memory for coherent mappings
> we need to carefully audit what happens when this memremaped
> memory gets mmaped or used through dma_get_sgtable
OK. I check that.
> - the netscv changes I'm not happy with at all. A large part of it
> is that the driver already has a bad structure, but this series
> is making it significantly worse. We'll need to find a way
> to use the proper dma mapping abstractions here. One option
> if you want to stick to the double vmapped buffer would be something
> like using dma_alloc_noncontigous plus a variant of
> dma_vmap_noncontiguous that takes the shared_gpa_boundary into
> account.
>
OK. I will do that.
On Tue, Sep 28, 2021 at 05:23:31PM +0800, Tianyu Lan wrote:
>>
>> - the bare memremap usage in swiotlb looks strange and I'd
>> definitively expect a well documented wrapper.
>
> OK. Should the wrapper in the DMA code? How about dma_map_decrypted()
> introduced in the V4?
A mentioned then the name is a pretty bad choice as it touches the dma_map*
namespace that it is not related to. I suspect just a little helper
in the swiotlb code that explains how it is used might be enogh for now.