Hi all
Here we propose this patch series to make Linux run as the root partition [0]
on Microsoft Hypervisor [1]. There will be a subsequent patch series to provide a
device node (/dev/mshv) such that userspace programs can create and run virtual
machines. We've also ported Cloud Hypervisor [3] over and have been able to
boot a Linux guest with Virtio devices since late July.
Being an RFC sereis, this implements only the absolutely necessary
components to get things running. I will break down this series a bit.
A large portion of this series consists of patches that augment hyperv-tlfs.h.
They should be rather uncontroversial and can be applied right away.
A few key things other than the changes to hyperv-tlfs.h:
1. Linux needs to setup existing Hyper-V facilities differently.
2. Linux needs to make a few hypercalls to bring up APs.
3. Interrupts are remapped by IOMMU, which is controlled by the hypervisor.
Linux needs to make hypercalls to map and unmap interrupts. This is
done by introducing a new MSI irqdomain and new irqchips.
This series is now based on 5.10-rc1. And thanks to tglx's overhaul of
the MSI code, our implementation of the MSI irq domain is shorter.
Comments and suggestions are welcome.
Thanks,
Wei.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Changes since v2:
1. Address more comments from Vitaly.
2. Fix and test 32bit build.
Changes since v1:
1. Simplify MSI IRQ domain implementation.
2. Address Vitaly's comments.
Wei Liu (17):
asm-generic/hyperv: change HV_CPU_POWER_MANAGEMENT to
HV_CPU_MANAGEMENT
x86/hyperv: detect if Linux is the root partition
Drivers: hv: vmbus: skip VMBus initialization if Linux is root
iommu/hyperv: don't setup IRQ remapping when running as root
clocksource/hyperv: use MSR-based access if running as root
x86/hyperv: allocate output arg pages if required
x86/hyperv: extract partition ID from Microsoft Hypervisor if
necessary
x86/hyperv: handling hypercall page setup for root
x86/hyperv: provide a bunch of helper functions
x86/hyperv: implement and use hv_smp_prepare_cpus
asm-generic/hyperv: update hv_msi_entry
asm-generic/hyperv: update hv_interrupt_entry
asm-generic/hyperv: introduce hv_device_id and auxiliary structures
asm-generic/hyperv: import data structures for mapping device
interrupts
x86/hyperv: implement an MSI domain for root partition
x86/ioapic: export a few functions and data structures via io_apic.h
x86/hyperv: handle IO-APIC when running as root
arch/x86/hyperv/Makefile | 4 +-
arch/x86/hyperv/hv_init.c | 121 +++++-
arch/x86/hyperv/hv_proc.c | 215 +++++++++++
arch/x86/hyperv/irqdomain.c | 556 ++++++++++++++++++++++++++++
arch/x86/include/asm/hyperv-tlfs.h | 23 ++
arch/x86/include/asm/io_apic.h | 21 ++
arch/x86/include/asm/mshyperv.h | 13 +-
arch/x86/kernel/apic/io_apic.c | 28 +-
arch/x86/kernel/cpu/mshyperv.c | 49 +++
drivers/clocksource/hyperv_timer.c | 3 +
drivers/hv/vmbus_drv.c | 3 +
drivers/iommu/hyperv-iommu.c | 3 +-
drivers/pci/controller/pci-hyperv.c | 2 +-
include/asm-generic/hyperv-tlfs.h | 254 ++++++++++++-
14 files changed, 1257 insertions(+), 38 deletions(-)
create mode 100644 arch/x86/hyperv/hv_proc.c
create mode 100644 arch/x86/hyperv/irqdomain.c
--
2.20.1
For now we can use the privilege flag to check. Stash the value to be
used later.
Put in a bunch of defines for future use when we want to have more
fine-grained detection.
Signed-off-by: Wei Liu <[email protected]>
---
v3: move hv_root_partition to mshyperv.c
---
arch/x86/include/asm/hyperv-tlfs.h | 10 ++++++++++
arch/x86/include/asm/mshyperv.h | 2 ++
arch/x86/kernel/cpu/mshyperv.c | 20 ++++++++++++++++++++
3 files changed, 32 insertions(+)
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 0ed20e8bba9e..41b628b9fb15 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -21,6 +21,7 @@
#define HYPERV_CPUID_FEATURES 0x40000003
#define HYPERV_CPUID_ENLIGHTMENT_INFO 0x40000004
#define HYPERV_CPUID_IMPLEMENT_LIMITS 0x40000005
+#define HYPERV_CPUID_CPU_MANAGEMENT_FEATURES 0x40000007
#define HYPERV_CPUID_NESTED_FEATURES 0x4000000A
#define HYPERV_HYPERVISOR_PRESENT_BIT 0x80000000
@@ -103,6 +104,15 @@
/* Recommend using enlightened VMCS */
#define HV_X64_ENLIGHTENED_VMCS_RECOMMENDED BIT(14)
+/*
+ * CPU management features identification.
+ * These are HYPERV_CPUID_CPU_MANAGEMENT_FEATURES.EAX bits.
+ */
+#define HV_X64_START_LOGICAL_PROCESSOR BIT(0)
+#define HV_X64_CREATE_ROOT_VIRTUAL_PROCESSOR BIT(1)
+#define HV_X64_PERFORMANCE_COUNTER_SYNC BIT(2)
+#define HV_X64_RESERVED_IDENTITY_BIT BIT(31)
+
/*
* Virtual processor will never share a physical core with another virtual
* processor, except for virtual processors that are reported as sibling SMT
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..ac2b0d110f03 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -237,6 +237,8 @@ int hyperv_fill_flush_guest_mapping_list(
struct hv_guest_mapping_flush_list *flush,
u64 start_gfn, u64 end_gfn);
+extern bool hv_root_partition;
+
#ifdef CONFIG_X86_64
void hv_apic_init(void);
void __init hv_init_spinlocks(void);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 05ef1f4550cb..f0b8c702c858 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -32,6 +32,10 @@
#include <asm/nmi.h>
#include <clocksource/hyperv_timer.h>
+/* Is Linux running as the root partition? */
+bool hv_root_partition;
+EXPORT_SYMBOL_GPL(hv_root_partition);
+
struct ms_hyperv_info ms_hyperv;
EXPORT_SYMBOL_GPL(ms_hyperv);
@@ -237,6 +241,22 @@ static void __init ms_hyperv_init_platform(void)
pr_debug("Hyper-V: max %u virtual processors, %u logical processors\n",
ms_hyperv.max_vp_index, ms_hyperv.max_lp_index);
+ /*
+ * Check CPU management privilege.
+ *
+ * To mirror what Windows does we should extract CPU management
+ * features and use the ReservedIdentityBit to detect if Linux is the
+ * root partition. But that requires negotiating CPU management
+ * interface (a process to be finalized).
+ *
+ * For now, use the privilege flag as the indicator for running as
+ * root.
+ */
+ if (cpuid_ebx(HYPERV_CPUID_FEATURES) & HV_CPU_MANAGEMENT) {
+ hv_root_partition = true;
+ pr_info("Hyper-V: running as root partition\n");
+ }
+
/*
* Extract host information.
*/
--
2.20.1
We will soon need to access fields inside the MSI address and MSI data
fields. Introduce hv_msi_address_register and hv_msi_data_register.
Fix up one user of hv_msi_entry in mshyperv.h.
No functional change expected.
Signed-off-by: Wei Liu <[email protected]>
---
arch/x86/include/asm/mshyperv.h | 4 ++--
include/asm-generic/hyperv-tlfs.h | 28 ++++++++++++++++++++++++++--
2 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 4e590a167160..cbee72550a12 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -257,8 +257,8 @@ static inline void hv_apic_init(void) {}
static inline void hv_set_msi_entry_from_desc(union hv_msi_entry *msi_entry,
struct msi_desc *msi_desc)
{
- msi_entry->address = msi_desc->msg.address_lo;
- msi_entry->data = msi_desc->msg.data;
+ msi_entry->address.as_uint32 = msi_desc->msg.address_lo;
+ msi_entry->data.as_uint32 = msi_desc->msg.data;
}
#else /* CONFIG_HYPERV */
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index ec53570102f0..7e103be42799 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -480,12 +480,36 @@ struct hv_create_vp {
u64 flags;
} __packed;
+union hv_msi_address_register {
+ u32 as_uint32;
+ struct {
+ u32 reserved1:2;
+ u32 destination_mode:1;
+ u32 redirection_hint:1;
+ u32 reserved2:8;
+ u32 destination_id:8;
+ u32 msi_base:12;
+ };
+} __packed;
+
+union hv_msi_data_register {
+ u32 as_uint32;
+ struct {
+ u32 vector:8;
+ u32 delivery_mode:3;
+ u32 reserved1:3;
+ u32 level_assert:1;
+ u32 trigger_mode:1;
+ u32 reserved2:16;
+ };
+} __packed;
+
/* HvRetargetDeviceInterrupt hypercall */
union hv_msi_entry {
u64 as_uint64;
struct {
- u32 address;
- u32 data;
+ union hv_msi_address_register address;
+ union hv_msi_data_register data;
} __packed;
};
--
2.20.1
We will soon use the same structure to handle IO-APIC interrupts as
well. Introduce an enum to identify the source and a data structure for
IO-APIC RTE.
While at it, update pci-hyperv.c to use the enum.
No functional change.
Signed-off-by: Wei Liu <[email protected]>
Acked-by: Rob Herring <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 2 +-
include/asm-generic/hyperv-tlfs.h | 36 +++++++++++++++++++++++++++--
2 files changed, 35 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 03ed5cb1c4b2..59edc0bf00fe 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1216,7 +1216,7 @@ static void hv_irq_unmask(struct irq_data *data)
params = &hbus->retarget_msi_interrupt_params;
memset(params, 0, sizeof(*params));
params->partition_id = HV_PARTITION_ID_SELF;
- params->int_entry.source = 1; /* MSI(-X) */
+ params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
hv_set_msi_entry_from_desc(¶ms->int_entry.msi_entry, msi_desc);
params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
(hbus->hdev->dev_instance.b[4] << 16) |
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 7e103be42799..8423bf53c237 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -480,6 +480,11 @@ struct hv_create_vp {
u64 flags;
} __packed;
+enum hv_interrupt_source {
+ HV_INTERRUPT_SOURCE_MSI = 1, /* MSI and MSI-X */
+ HV_INTERRUPT_SOURCE_IOAPIC,
+};
+
union hv_msi_address_register {
u32 as_uint32;
struct {
@@ -513,10 +518,37 @@ union hv_msi_entry {
} __packed;
};
+union hv_ioapic_rte {
+ u64 as_uint64;
+
+ struct {
+ u32 vector:8;
+ u32 delivery_mode:3;
+ u32 destination_mode:1;
+ u32 delivery_status:1;
+ u32 interrupt_polarity:1;
+ u32 remote_irr:1;
+ u32 trigger_mode:1;
+ u32 interrupt_mask:1;
+ u32 reserved1:15;
+
+ u32 reserved2:24;
+ u32 destination_id:8;
+ };
+
+ struct {
+ u32 low_uint32;
+ u32 high_uint32;
+ };
+} __packed;
+
struct hv_interrupt_entry {
- u32 source; /* 1 for MSI(-X) */
+ u32 source;
u32 reserved1;
- union hv_msi_entry msi_entry;
+ union {
+ union hv_msi_entry msi_entry;
+ union hv_ioapic_rte ioapic_rte;
+ };
} __packed;
/*
--
2.20.1
When Linux is running as the root partition, the hypercall page will
have already been setup by Hyper-V. Copy the content over to the
allocated page.
Add checks to hv_suspend & co to bail early because they are not
supported in this setup yet.
Signed-off-by: Lillian Grassin-Drake <[email protected]>
Signed-off-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Nuno Das Neves <[email protected]>
Co-Developed-by: Lillian Grassin-Drake <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Nuno Das Neves <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
v3:
1. Use HV_HYP_PAGE_SIZE.
2. Add checks to hv_suspend & co.
---
arch/x86/hyperv/hv_init.c | 37 ++++++++++++++++++++++++++++++++++---
1 file changed, 34 insertions(+), 3 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index fc9941bd8653..ad8e77859b32 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -25,6 +25,7 @@
#include <linux/cpuhotplug.h>
#include <linux/syscore_ops.h>
#include <clocksource/hyperv_timer.h>
+#include <linux/highmem.h>
u64 hv_current_partition_id = ~0ull;
EXPORT_SYMBOL_GPL(hv_current_partition_id);
@@ -283,6 +284,9 @@ static int hv_suspend(void)
union hv_x64_msr_hypercall_contents hypercall_msr;
int ret;
+ if (hv_root_partition)
+ return -EPERM;
+
/*
* Reset the hypercall page as it is going to be invalidated
* accross hibernation. Setting hv_hypercall_pg to NULL ensures
@@ -433,8 +437,35 @@ void __init hyperv_init(void)
rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
hypercall_msr.enable = 1;
- hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
- wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
+
+ if (hv_root_partition) {
+ struct page *pg;
+ void *src, *dst;
+
+ /*
+ * For the root partition, the hypervisor will set up its
+ * hypercall page. The hypervisor guarantees it will not show
+ * up in the root's address space. The root can't change the
+ * location of the hypercall page.
+ *
+ * Order is important here. We must enable the hypercall page
+ * so it is populated with code, then copy the code to an
+ * executable page.
+ */
+ wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
+
+ pg = vmalloc_to_page(hv_hypercall_pg);
+ dst = kmap(pg);
+ src = memremap(hypercall_msr.guest_physical_address << PAGE_SHIFT, PAGE_SIZE,
+ MEMREMAP_WB);
+ BUG_ON(!(src && dst));
+ memcpy(dst, src, HV_HYP_PAGE_SIZE);
+ memunmap(src);
+ kunmap(pg);
+ } else {
+ hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
+ wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
+ }
/*
* Ignore any errors in setting up stimer clockevents
@@ -577,6 +608,6 @@ EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
bool hv_is_hibernation_supported(void)
{
- return acpi_sleep_state_supported(ACPI_STATE_S4);
+ return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
}
EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
--
2.20.1
Just like MSI/MSI-X, IO-APIC interrupts are remapped by Microsoft
Hypervisor when Linux runs as the root partition. Implement an IRQ chip
to handle mapping and unmapping of IO-APIC interrupts.
Use custom functions for mapping and unmapping ACPI GSIs. They will
issue Microsoft Hypervisor specific hypercalls on top of the native
routines.
Signed-off-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
arch/x86/hyperv/hv_init.c | 13 +++
arch/x86/hyperv/irqdomain.c | 226 ++++++++++++++++++++++++++++++++++++
2 files changed, 239 insertions(+)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index b58f958439a2..1962c5c609cf 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -263,10 +263,23 @@ static int hv_cpu_die(unsigned int cpu)
return 0;
}
+int hv_acpi_register_gsi(struct device *dev, u32 gsi, int trigger, int polarity);
+void hv_acpi_unregister_gsi(u32 gsi);
+
+extern int (*native_acpi_register_gsi)(struct device *dev, u32 gsi, int trigger, int polarity);
+extern void (*native_acpi_unregister_gsi)(u32 gsi);
+
static int __init hv_pci_init(void)
{
int gen2vm = efi_enabled(EFI_BOOT);
+ if (hv_root_partition) {
+ native_acpi_register_gsi = __acpi_register_gsi;
+ native_acpi_unregister_gsi = __acpi_unregister_gsi;
+ __acpi_register_gsi = hv_acpi_register_gsi;
+ __acpi_unregister_gsi = hv_acpi_unregister_gsi;
+ }
+
/*
* For Generation-2 VM, we exit from pci_arch_init() by returning 0.
* The purpose is to suppress the harmless warning:
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 80109e3cbf8f..297f74e86af8 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -9,6 +9,8 @@
#include <linux/pci.h>
#include <linux/irq.h>
#include <asm/mshyperv.h>
+#include <asm/apic.h>
+#include <asm/io_apic.h>
struct rid_data {
struct pci_dev *bridge;
@@ -328,3 +330,227 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
return d;
}
+/* Copied from io_apic.c */
+union entry_union {
+ struct { u32 w1, w2; };
+ struct IO_APIC_route_entry entry;
+};
+
+static int hv_unmap_ioapic_interrupt(int gsi)
+{
+ union hv_device_id device_id;
+ int ioapic, ioapic_id;
+ u8 ioapic_pin;
+ struct IO_APIC_route_entry ire;
+ union entry_union eu;
+ struct hv_interrupt_entry entry;
+
+ ioapic = mp_find_ioapic(gsi);
+ ioapic_pin = mp_find_ioapic_pin(ioapic, gsi);
+ ioapic_id = mpc_ioapic_id(ioapic);
+ ire = ioapic_read_entry(ioapic, ioapic_pin);
+
+ eu.entry = ire;
+
+ /*
+ * Polarity may have been set by us, but Hyper-V expects the exact same
+ * entry. See the mapping routine.
+ */
+ eu.entry.polarity = 0;
+
+ memset(&entry, 0, sizeof(entry));
+ entry.source = HV_INTERRUPT_SOURCE_IOAPIC;
+ entry.ioapic_rte.low_uint32 = eu.w1;
+ entry.ioapic_rte.high_uint32 = eu.w2;
+
+ device_id.as_uint64 = 0;
+ device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
+ device_id.ioapic.ioapic_id = (u8)ioapic_id;
+
+ return hv_unmap_interrupt(device_id.as_uint64, &entry) & HV_HYPERCALL_RESULT_MASK;
+}
+
+static int hv_map_ioapic_interrupt(int ioapic_id, int trigger, int vcpu, int vector,
+ struct hv_interrupt_entry *out_entry)
+{
+ unsigned long flags;
+ struct hv_input_map_device_interrupt *input;
+ struct hv_output_map_device_interrupt *output;
+ union hv_device_id device_id;
+ struct hv_device_interrupt_descriptor *intr_desc;
+ u16 status;
+
+ device_id.as_uint64 = 0;
+ device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
+ device_id.ioapic.ioapic_id = (u8)ioapic_id;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+ memset(input, 0, sizeof(*input));
+ intr_desc = &input->interrupt_descriptor;
+ input->partition_id = hv_current_partition_id;
+ input->device_id = device_id.as_uint64;
+ intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
+ intr_desc->target.vector = vector;
+ intr_desc->vector_count = 1;
+
+ if (trigger)
+ intr_desc->trigger_mode = HV_INTERRUPT_TRIGGER_MODE_LEVEL;
+ else
+ intr_desc->trigger_mode = HV_INTERRUPT_TRIGGER_MODE_EDGE;
+
+ __set_bit(vcpu, (unsigned long *)&intr_desc->target.vp_mask);
+
+ status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, 0, input, output) &
+ HV_HYPERCALL_RESULT_MASK;
+ local_irq_restore(flags);
+
+ *out_entry = output->interrupt_entry;
+
+ return status;
+}
+
+static unsigned int hv_ioapic_startup_irq(struct irq_data *data)
+{
+ u16 status;
+ struct IO_APIC_route_entry ire;
+ u32 vector;
+ struct irq_cfg *cfg;
+ int ioapic;
+ u8 ioapic_pin;
+ int ioapic_id;
+ int gsi;
+ union entry_union eu;
+ struct cpumask *affinity;
+ int cpu, vcpu;
+ struct hv_interrupt_entry entry;
+ struct mp_chip_data *mp_data = data->chip_data;
+
+ gsi = data->irq;
+ cfg = irqd_cfg(data);
+ affinity = irq_data_get_effective_affinity_mask(data);
+ cpu = cpumask_first_and(affinity, cpu_online_mask);
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+
+ vector = cfg->vector;
+
+ ioapic = mp_find_ioapic(gsi);
+ ioapic_pin = mp_find_ioapic_pin(ioapic, gsi);
+ ioapic_id = mpc_ioapic_id(ioapic);
+ ire = ioapic_read_entry(ioapic, ioapic_pin);
+
+ /*
+ * Always try unmapping. We do not have visibility into which whether
+ * an IO-APIC has been mapped or not. We can't use chip_data because it
+ * already points to mp_data.
+ *
+ * We don't use retarget interrupt hypercalls here because Hyper-V
+ * doens't allow root to change the vector or specify VPs outside of
+ * the set that is initially used during mapping.
+ */
+ status = hv_unmap_ioapic_interrupt(gsi);
+
+ if (!(status == HV_STATUS_SUCCESS || status == HV_STATUS_INVALID_PARAMETER)) {
+ pr_debug("%s: unexpected unmap status %d\n", __func__, status);
+ return -EINVAL;
+ }
+
+ status = hv_map_ioapic_interrupt(ioapic_id, ire.trigger, vcpu, vector, &entry);
+
+ if (status != HV_STATUS_SUCCESS) {
+ pr_err("%s: map hypercall failed, status %d\n", __func__, status);
+ return -EINVAL;
+ }
+
+ /* Update the entry in mp_chip_data. It is used in other places. */
+ mp_data->entry = *(struct IO_APIC_route_entry *)&entry.ioapic_rte;
+
+ /* Sync polarity -- Hyper-V's returned polarity is always 0... */
+ mp_data->entry.polarity = ire.polarity;
+
+ eu.w1 = entry.ioapic_rte.low_uint32;
+ eu.w2 = entry.ioapic_rte.high_uint32;
+ ioapic_write_entry(ioapic, ioapic_pin, eu.entry);
+
+ return 0;
+}
+
+static void hv_ioapic_mask_irq(struct irq_data *data)
+{
+ mask_ioapic_irq(data);
+}
+
+static void hv_ioapic_unmask_irq(struct irq_data *data)
+{
+ unmask_ioapic_irq(data);
+}
+
+static int hv_ioapic_set_affinity(struct irq_data *data,
+ const struct cpumask *mask, bool force)
+{
+ /*
+ * We only update the affinity mask here. Programming the hardware is
+ * done in irq_startup.
+ */
+ return ioapic_set_affinity(data, mask, force);
+}
+
+void hv_ioapic_ack_level(struct irq_data *irq_data)
+{
+ /*
+ * Per email exchange with Hyper-V team, all is needed is write to
+ * LAPIC's EOI register. They don't support directed EOI to IO-APIC.
+ * Hyper-V handles it for us.
+ */
+ apic_ack_irq(irq_data);
+}
+
+struct irq_chip hv_ioapic_chip __read_mostly = {
+ .name = "HV-IO-APIC",
+ .irq_startup = hv_ioapic_startup_irq,
+ .irq_mask = hv_ioapic_mask_irq,
+ .irq_unmask = hv_ioapic_unmask_irq,
+ .irq_ack = irq_chip_ack_parent,
+ .irq_eoi = hv_ioapic_ack_level,
+ .irq_set_affinity = hv_ioapic_set_affinity,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_get_irqchip_state = ioapic_irq_get_chip_state,
+ .flags = IRQCHIP_SKIP_SET_WAKE,
+};
+
+
+int (*native_acpi_register_gsi)(struct device *dev, u32 gsi, int trigger, int polarity);
+void (*native_acpi_unregister_gsi)(u32 gsi);
+
+int hv_acpi_register_gsi(struct device *dev, u32 gsi, int trigger, int polarity)
+{
+ int irq = gsi;
+
+#ifdef CONFIG_X86_IO_APIC
+ irq = native_acpi_register_gsi(dev, gsi, trigger, polarity);
+ if (irq < 0) {
+ pr_err("native_acpi_register_gsi failed %d\n", irq);
+ return irq;
+ }
+
+ if (trigger) {
+ irq_set_status_flags(irq, IRQ_LEVEL);
+ irq_set_chip_and_handler_name(irq, &hv_ioapic_chip,
+ handle_fasteoi_irq, "ioapic-fasteoi");
+ } else {
+ irq_clear_status_flags(irq, IRQ_LEVEL);
+ irq_set_chip_and_handler_name(irq, &hv_ioapic_chip,
+ handle_edge_irq, "ioapic-edge");
+ }
+#endif
+ return irq;
+}
+
+void hv_acpi_unregister_gsi(u32 gsi)
+{
+#ifdef CONFIG_X86_IO_APIC
+ (void)hv_unmap_ioapic_interrupt(gsi);
+ native_acpi_unregister_gsi(gsi);
+#endif
+}
--
2.20.1
We are about to implement an irqchip for IO-APIC when Linux runs as root
on Microsoft Hypervisor. At the same time we would like to reuse
existing code as much as possible.
Move mp_chip_data to io_apic.h and make a few helper functions
non-static.
No functional change.
Signed-off-by: Wei Liu <[email protected]>
---
arch/x86/include/asm/io_apic.h | 21 +++++++++++++++++++++
arch/x86/kernel/apic/io_apic.c | 28 +++++++++-------------------
2 files changed, 30 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index a1a26f6d3aa4..1375983a6028 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -152,6 +152,15 @@ extern unsigned long io_apic_irqs;
#define io_apic_assign_pci_irqs \
(mp_irq_entries && !skip_ioapic_setup && io_apic_irqs)
+struct mp_chip_data {
+ struct list_head irq_2_pin;
+ struct IO_APIC_route_entry entry;
+ int trigger;
+ int polarity;
+ u32 count;
+ bool isa_irq;
+};
+
struct irq_cfg;
extern void ioapic_insert_resources(void);
extern int arch_early_ioapic_init(void);
@@ -195,6 +204,18 @@ extern void clear_IO_APIC(void);
extern void restore_boot_irq_mode(void);
extern int IO_APIC_get_PCI_irq_vector(int bus, int devfn, int pin);
extern void print_IO_APICs(void);
+
+struct irq_data;
+extern struct IO_APIC_route_entry ioapic_read_entry(int apic, int pin);
+extern void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e);
+extern void mask_ioapic_irq(struct irq_data *irq_data);
+extern void unmask_ioapic_irq(struct irq_data *irq_data);
+extern int ioapic_set_affinity(struct irq_data *irq_data, const struct cpumask *mask, bool force);
+extern struct irq_domain *mp_ioapic_irqdomain(int ioapic);
+enum irqchip_irq_state;
+extern int ioapic_irq_get_chip_state(struct irq_data *irqd,
+ enum irqchip_irq_state which,
+ bool *state);
#else /* !CONFIG_X86_IO_APIC */
#define IO_APIC_IRQ(x) 0
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 7b3c7e0d4a09..23047f98b5e4 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -88,15 +88,6 @@ struct irq_pin_list {
int apic, pin;
};
-struct mp_chip_data {
- struct list_head irq_2_pin;
- struct IO_APIC_route_entry entry;
- int trigger;
- int polarity;
- u32 count;
- bool isa_irq;
-};
-
struct mp_ioapic_gsi {
u32 gsi_base;
u32 gsi_end;
@@ -154,7 +145,7 @@ static inline bool mp_is_legacy_irq(int irq)
return irq >= 0 && irq < nr_legacy_irqs();
}
-static inline struct irq_domain *mp_ioapic_irqdomain(int ioapic)
+struct irq_domain *mp_ioapic_irqdomain(int ioapic)
{
return ioapics[ioapic].irqdomain;
}
@@ -301,7 +292,7 @@ static struct IO_APIC_route_entry __ioapic_read_entry(int apic, int pin)
return eu.entry;
}
-static struct IO_APIC_route_entry ioapic_read_entry(int apic, int pin)
+struct IO_APIC_route_entry ioapic_read_entry(int apic, int pin)
{
union entry_union eu;
unsigned long flags;
@@ -328,7 +319,7 @@ static void __ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e
io_apic_write(apic, 0x10 + 2*pin, eu.w1);
}
-static void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e)
+void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e)
{
unsigned long flags;
@@ -453,7 +444,7 @@ static void io_apic_sync(struct irq_pin_list *entry)
readl(&io_apic->data);
}
-static void mask_ioapic_irq(struct irq_data *irq_data)
+void mask_ioapic_irq(struct irq_data *irq_data)
{
struct mp_chip_data *data = irq_data->chip_data;
unsigned long flags;
@@ -468,7 +459,7 @@ static void __unmask_ioapic(struct mp_chip_data *data)
io_apic_modify_irq(data, ~IO_APIC_REDIR_MASKED, 0, NULL);
}
-static void unmask_ioapic_irq(struct irq_data *irq_data)
+void unmask_ioapic_irq(struct irq_data *irq_data)
{
struct mp_chip_data *data = irq_data->chip_data;
unsigned long flags;
@@ -1868,8 +1859,7 @@ static void ioapic_configure_entry(struct irq_data *irqd)
__ioapic_write_entry(entry->apic, entry->pin, mpd->entry);
}
-static int ioapic_set_affinity(struct irq_data *irq_data,
- const struct cpumask *mask, bool force)
+int ioapic_set_affinity(struct irq_data *irq_data, const struct cpumask *mask, bool force)
{
struct irq_data *parent = irq_data->parent_data;
unsigned long flags;
@@ -1898,9 +1888,9 @@ static int ioapic_set_affinity(struct irq_data *irq_data,
*
* Verify that the corresponding Remote-IRR bits are clear.
*/
-static int ioapic_irq_get_chip_state(struct irq_data *irqd,
- enum irqchip_irq_state which,
- bool *state)
+int ioapic_irq_get_chip_state(struct irq_data *irqd,
+ enum irqchip_irq_state which,
+ bool *state)
{
struct mp_chip_data *mcd = irqd->chip_data;
struct IO_APIC_route_entry rentry;
--
2.20.1
They are used to deposit pages into Microsoft Hypervisor and bring up
logical and virtual processors.
Signed-off-by: Lillian Grassin-Drake <[email protected]>
Signed-off-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Nuno Das Neves <[email protected]>
Co-Developed-by: Lillian Grassin-Drake <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Nuno Das Neves <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
v3:
1. Add __packed to structures.
2. Drop unnecessary exports.
v2:
1. Adapt to hypervisor side changes
2. Address Vitaly's comments
---
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_proc.c | 215 ++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 4 +
include/asm-generic/hyperv-tlfs.h | 67 ++++++++++
4 files changed, 287 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/hyperv/hv_proc.c
diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 89b1f74d3225..565358020921 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y := hv_init.o mmu.o nested.o
-obj-$(CONFIG_X86_64) += hv_apic.o
+obj-$(CONFIG_X86_64) += hv_apic.o hv_proc.o
ifdef CONFIG_X86_64
obj-$(CONFIG_PARAVIRT_SPINLOCKS) += hv_spinlock.o
diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
new file mode 100644
index 000000000000..212692b15fa8
--- /dev/null
+++ b/arch/x86/hyperv/hv_proc.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/version.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/clockchips.h>
+#include <linux/acpi.h>
+#include <linux/hyperv.h>
+#include <linux/slab.h>
+#include <linux/cpuhotplug.h>
+#include <linux/minmax.h>
+#include <asm/hypervisor.h>
+#include <asm/mshyperv.h>
+#include <asm/apic.h>
+
+#include <asm/trace/hyperv.h>
+
+#define HV_DEPOSIT_MAX_ORDER (8)
+#define HV_DEPOSIT_MAX (1 << HV_DEPOSIT_MAX_ORDER)
+
+/*
+ * Deposits exact number of pages
+ * Must be called with interrupts enabled
+ * Max 256 pages
+ */
+int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
+{
+ struct page **pages;
+ int *counts;
+ int num_allocations;
+ int i, j, page_count;
+ int order;
+ int desired_order;
+ u16 status;
+ int ret;
+ u64 base_pfn;
+ struct hv_deposit_memory *input_page;
+ unsigned long flags;
+
+ if (num_pages > HV_DEPOSIT_MAX)
+ return -E2BIG;
+ if (!num_pages)
+ return 0;
+
+ /* One buffer for page pointers and counts */
+ pages = page_address(alloc_page(GFP_KERNEL));
+ if (!pages)
+ return -ENOMEM;
+
+ counts = kcalloc(HV_DEPOSIT_MAX, sizeof(int), GFP_KERNEL);
+ if (!counts) {
+ free_page((unsigned long)pages);
+ return -ENOMEM;
+ }
+
+ /* Allocate all the pages before disabling interrupts */
+ num_allocations = 0;
+ i = 0;
+ order = HV_DEPOSIT_MAX_ORDER;
+
+ while (num_pages) {
+ /* Find highest order we can actually allocate */
+ desired_order = 31 - __builtin_clz(num_pages);
+ order = min(desired_order, order);
+ do {
+ pages[i] = alloc_pages_node(node, GFP_KERNEL, order);
+ if (!pages[i]) {
+ if (!order) {
+ ret = -ENOMEM;
+ goto err_free_allocations;
+ }
+ --order;
+ }
+ } while (!pages[i]);
+
+ split_page(pages[i], order);
+ counts[i] = 1 << order;
+ num_pages -= counts[i];
+ i++;
+ num_allocations++;
+ }
+
+ local_irq_save(flags);
+
+ input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+ input_page->partition_id = partition_id;
+
+ /* Populate gpa_page_list - these will fit on the input page */
+ for (i = 0, page_count = 0; i < num_allocations; ++i) {
+ base_pfn = page_to_pfn(pages[i]);
+ for (j = 0; j < counts[i]; ++j, ++page_count)
+ input_page->gpa_page_list[page_count] = base_pfn + j;
+ }
+ status = hv_do_rep_hypercall(HVCALL_DEPOSIT_MEMORY,
+ page_count, 0, input_page,
+ NULL) & HV_HYPERCALL_RESULT_MASK;
+ local_irq_restore(flags);
+
+ if (status != HV_STATUS_SUCCESS) {
+ pr_err("Failed to deposit pages: %d\n", status);
+ ret = status;
+ goto err_free_allocations;
+ }
+
+ ret = 0;
+ goto free_buf;
+
+err_free_allocations:
+ for (i = 0; i < num_allocations; ++i) {
+ base_pfn = page_to_pfn(pages[i]);
+ for (j = 0; j < counts[i]; ++j)
+ __free_page(pfn_to_page(base_pfn + j));
+ }
+
+free_buf:
+ free_page((unsigned long)pages);
+ kfree(counts);
+ return ret;
+}
+
+int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
+{
+ struct hv_add_logical_processor_in *input;
+ struct hv_add_logical_processor_out *output;
+ int status;
+ unsigned long flags;
+ int ret = 0;
+
+ /*
+ * When adding a logical processor, the hypervisor may return
+ * HV_STATUS_INSUFFICIENT_MEMORY. When that happens, we deposit more
+ * pages and retry.
+ */
+ do {
+ local_irq_save(flags);
+
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ /* We don't do anything with the output right now */
+ output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+ input->lp_index = lp_index;
+ input->apic_id = apic_id;
+ input->flags = 0;
+ input->proximity_domain_info.domain_id = node_to_pxm(node);
+ input->proximity_domain_info.flags.reserved = 0;
+ input->proximity_domain_info.flags.proximity_info_valid = 1;
+ input->proximity_domain_info.flags.proximity_preferred = 1;
+ status = hv_do_hypercall(HVCALL_ADD_LOGICAL_PROCESSOR,
+ input, output);
+ local_irq_restore(flags);
+
+ if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+ if (status != HV_STATUS_SUCCESS) {
+ pr_err("%s: cpu %u apic ID %u, %d\n", __func__,
+ lp_index, apic_id, status);
+ ret = status;
+ }
+ break;
+ }
+ ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
+ } while (!ret);
+
+ return ret;
+}
+
+int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
+{
+ struct hv_create_vp *input;
+ u16 status;
+ unsigned long irq_flags;
+ int ret = 0;
+
+ /* Root VPs don't seem to need pages deposited */
+ if (partition_id != hv_current_partition_id) {
+ ret = hv_call_deposit_pages(node, partition_id, 90);
+ if (ret)
+ return ret;
+ }
+
+ do {
+ local_irq_save(irq_flags);
+
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+ input->partition_id = partition_id;
+ input->vp_index = vp_index;
+ input->flags = flags;
+ input->subnode_type = HvSubnodeAny;
+ if (node != NUMA_NO_NODE) {
+ input->proximity_domain_info.domain_id = node_to_pxm(node);
+ input->proximity_domain_info.flags.reserved = 0;
+ input->proximity_domain_info.flags.proximity_info_valid = 1;
+ input->proximity_domain_info.flags.proximity_preferred = 1;
+ } else {
+ input->proximity_domain_info.as_uint64 = 0;
+ }
+ status = hv_do_hypercall(HVCALL_CREATE_VP, input, NULL);
+ local_irq_restore(irq_flags);
+
+ if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+ if (status != HV_STATUS_SUCCESS) {
+ pr_err("%s: vcpu %u, lp %u, %d\n", __func__,
+ vp_index, flags, status);
+ ret = status;
+ }
+ break;
+ }
+ ret = hv_call_deposit_pages(node, partition_id, 1);
+
+ } while (!ret);
+
+ return ret;
+}
+
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 67f5d35a73d3..4e590a167160 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -80,6 +80,10 @@ extern void __percpu **hyperv_pcpu_output_arg;
extern u64 hv_current_partition_id;
+int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
+int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
+int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
+
static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
{
u64 input_address = input ? virt_to_phys(input) : 0;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 87b1a79b19eb..ec53570102f0 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -142,6 +142,8 @@ struct ms_hyperv_tsc_page {
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX 0x0014
#define HVCALL_SEND_IPI_EX 0x0015
#define HVCALL_GET_PARTITION_ID 0x0046
+#define HVCALL_DEPOSIT_MEMORY 0x0048
+#define HVCALL_CREATE_VP 0x004e
#define HVCALL_GET_VP_REGISTERS 0x0050
#define HVCALL_SET_VP_REGISTERS 0x0051
#define HVCALL_POST_MESSAGE 0x005c
@@ -149,6 +151,7 @@ struct ms_hyperv_tsc_page {
#define HVCALL_POST_DEBUG_DATA 0x0069
#define HVCALL_RETRIEVE_DEBUG_DATA 0x006a
#define HVCALL_RESET_DEBUG_SESSION 0x006b
+#define HVCALL_ADD_LOGICAL_PROCESSOR 0x0076
#define HVCALL_RETARGET_INTERRUPT 0x007e
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
@@ -413,6 +416,70 @@ struct hv_get_partition_id {
u64 partition_id;
} __packed;
+/* HvDepositMemory hypercall */
+struct hv_deposit_memory {
+ u64 partition_id;
+ u64 gpa_page_list[];
+} __packed;
+
+struct hv_proximity_domain_flags {
+ u32 proximity_preferred : 1;
+ u32 reserved : 30;
+ u32 proximity_info_valid : 1;
+} __packed;
+
+/* Not a union in windows but useful for zeroing */
+union hv_proximity_domain_info {
+ struct {
+ u32 domain_id;
+ struct hv_proximity_domain_flags flags;
+ };
+ u64 as_uint64;
+} __packed;
+
+struct hv_lp_startup_status {
+ u64 hv_status;
+ u64 substatus1;
+ u64 substatus2;
+ u64 substatus3;
+ u64 substatus4;
+ u64 substatus5;
+ u64 substatus6;
+} __packed;
+
+/* HvAddLogicalProcessor hypercall */
+struct hv_add_logical_processor_in {
+ u32 lp_index;
+ u32 apic_id;
+ union hv_proximity_domain_info proximity_domain_info;
+ u64 flags;
+};
+
+struct hv_add_logical_processor_out {
+ struct hv_lp_startup_status startup_status;
+} __packed;
+
+enum HV_SUBNODE_TYPE
+{
+ HvSubnodeAny = 0,
+ HvSubnodeSocket,
+ HvSubnodeAmdNode,
+ HvSubnodeL3,
+ HvSubnodeCount,
+ HvSubnodeInvalid = -1
+};
+
+/* HvCreateVp hypercall */
+struct hv_create_vp {
+ u64 partition_id;
+ u32 vp_index;
+ u8 padding[3];
+ u8 subnode_type;
+ u64 subnode_id;
+ union hv_proximity_domain_info proximity_domain_info;
+ u64 flags;
+} __packed;
+
/* HvRetargetDeviceInterrupt hypercall */
union hv_msi_entry {
u64 as_uint64;
--
2.20.1
When Linux runs as the root partition, the setup required for TSC page
is different. Luckily Linux also has access to the MSR based
clocksource. We can just disable the TSC page clocksource if Linux is
the root partition.
Signed-off-by: Wei Liu <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
---
drivers/clocksource/hyperv_timer.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index ba04cb381cd3..269a691bd2c4 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -426,6 +426,9 @@ static bool __init hv_init_tsc_clocksource(void)
if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
return false;
+ if (hv_root_partition)
+ return false;
+
hv_read_reference_counter = read_hv_clock_tsc;
phys_addr = virt_to_phys(hv_get_tsc_page());
--
2.20.1
When Linux runs as the root partition, it will need to make hypercalls
which return data from the hypervisor.
Allocate pages for storing results when Linux runs as the root
partition.
Signed-off-by: Lillian Grassin-Drake <[email protected]>
Co-Developed-by: Lillian Grassin-Drake <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
v3: Fix hv_cpu_die to use free_pages.
v2: Address Vitaly's comments
---
arch/x86/hyperv/hv_init.c | 35 ++++++++++++++++++++++++++++-----
arch/x86/include/asm/mshyperv.h | 1 +
2 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27..6f4cb40e53fe 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -41,6 +41,9 @@ EXPORT_SYMBOL_GPL(hv_vp_assist_page);
void __percpu **hyperv_pcpu_input_arg;
EXPORT_SYMBOL_GPL(hyperv_pcpu_input_arg);
+void __percpu **hyperv_pcpu_output_arg;
+EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
+
u32 hv_max_vp_index;
EXPORT_SYMBOL_GPL(hv_max_vp_index);
@@ -73,12 +76,19 @@ static int hv_cpu_init(unsigned int cpu)
void **input_arg;
struct page *pg;
- input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
- pg = alloc_page(irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL);
+ pg = alloc_pages(irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL, hv_root_partition ? 1 : 0);
if (unlikely(!pg))
return -ENOMEM;
+
+ input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
*input_arg = page_address(pg);
+ if (hv_root_partition) {
+ void **output_arg;
+
+ output_arg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
+ *output_arg = page_address(pg + 1);
+ }
hv_get_vp_index(msr_vp_index);
@@ -205,14 +215,23 @@ static int hv_cpu_die(unsigned int cpu)
unsigned int new_cpu;
unsigned long flags;
void **input_arg;
- void *input_pg = NULL;
+ void *pg;
local_irq_save(flags);
input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
- input_pg = *input_arg;
+ pg = *input_arg;
*input_arg = NULL;
+
+ if (hv_root_partition) {
+ void **output_arg;
+
+ output_arg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
+ *output_arg = NULL;
+ }
+
local_irq_restore(flags);
- free_page((unsigned long)input_pg);
+
+ free_pages((unsigned long)pg, hv_root_partition ? 1 : 0);
if (hv_vp_assist_page && hv_vp_assist_page[cpu])
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
@@ -346,6 +365,12 @@ void __init hyperv_init(void)
BUG_ON(hyperv_pcpu_input_arg == NULL);
+ /* Allocate the per-CPU state for output arg for root */
+ if (hv_root_partition) {
+ hyperv_pcpu_output_arg = alloc_percpu(void *);
+ BUG_ON(hyperv_pcpu_output_arg == NULL);
+ }
+
/* Allocate percpu VP index */
hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
GFP_KERNEL);
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ac2b0d110f03..62d9390f1ddf 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -76,6 +76,7 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {}
#if IS_ENABLED(CONFIG_HYPERV)
extern void *hv_hypercall_pg;
extern void __percpu **hyperv_pcpu_input_arg;
+extern void __percpu **hyperv_pcpu_output_arg;
static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
{
--
2.20.1
Signed-off-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
arch/x86/include/asm/hyperv-tlfs.h | 13 +++++++++++
include/asm-generic/hyperv-tlfs.h | 36 ++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 41b628b9fb15..592c75e51e0f 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -526,6 +526,19 @@ struct hv_partition_assist_pg {
u32 tlb_lock_count;
};
+enum hv_interrupt_type {
+ HV_X64_INTERRUPT_TYPE_FIXED = 0x0000,
+ HV_X64_INTERRUPT_TYPE_LOWESTPRIORITY = 0x0001,
+ HV_X64_INTERRUPT_TYPE_SMI = 0x0002,
+ HV_X64_INTERRUPT_TYPE_REMOTEREAD = 0x0003,
+ HV_X64_INTERRUPT_TYPE_NMI = 0x0004,
+ HV_X64_INTERRUPT_TYPE_INIT = 0x0005,
+ HV_X64_INTERRUPT_TYPE_SIPI = 0x0006,
+ HV_X64_INTERRUPT_TYPE_EXTINT = 0x0007,
+ HV_X64_INTERRUPT_TYPE_LOCALINT0 = 0x0008,
+ HV_X64_INTERRUPT_TYPE_LOCALINT1 = 0x0009,
+ HV_X64_INTERRUPT_TYPE_MAXIMUM = 0x000A,
+};
#include <asm-generic/hyperv-tlfs.h>
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 42ff1326c6bd..07efe0131fe3 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -152,6 +152,8 @@ struct ms_hyperv_tsc_page {
#define HVCALL_RETRIEVE_DEBUG_DATA 0x006a
#define HVCALL_RESET_DEBUG_SESSION 0x006b
#define HVCALL_ADD_LOGICAL_PROCESSOR 0x0076
+#define HVCALL_MAP_DEVICE_INTERRUPT 0x007c
+#define HVCALL_UNMAP_DEVICE_INTERRUPT 0x007d
#define HVCALL_RETARGET_INTERRUPT 0x007e
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
@@ -702,4 +704,38 @@ union hv_device_id {
} acpi;
} __packed;
+enum hv_interrupt_trigger_mode {
+ HV_INTERRUPT_TRIGGER_MODE_EDGE = 0,
+ HV_INTERRUPT_TRIGGER_MODE_LEVEL = 1,
+};
+
+struct hv_device_interrupt_descriptor {
+ u32 interrupt_type;
+ u32 trigger_mode;
+ u32 vector_count;
+ u32 reserved;
+ struct hv_device_interrupt_target target;
+} __packed;
+
+struct hv_input_map_device_interrupt {
+ u64 partition_id;
+ u64 device_id;
+ u64 flags;
+ struct hv_interrupt_entry logical_interrupt_entry;
+ struct hv_device_interrupt_descriptor interrupt_descriptor;
+} __packed;
+
+struct hv_output_map_device_interrupt {
+ struct hv_interrupt_entry interrupt_entry;
+} __packed;
+
+struct hv_input_unmap_device_interrupt {
+ u64 partition_id;
+ u64 device_id;
+ struct hv_interrupt_entry interrupt_entry;
+} __packed;
+
+#define HV_SOURCE_SHADOW_NONE 0x0
+#define HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE 0x1
+
#endif
--
2.20.1
We will need the partition ID for executing some hypercalls later.
Signed-off-by: Lillian Grassin-Drake <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
v3:
1. Make hv_get_partition_id static.
2. Change code structure a bit.
---
arch/x86/hyperv/hv_init.c | 27 +++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 2 ++
include/asm-generic/hyperv-tlfs.h | 6 ++++++
3 files changed, 35 insertions(+)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 6f4cb40e53fe..fc9941bd8653 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -26,6 +26,9 @@
#include <linux/syscore_ops.h>
#include <clocksource/hyperv_timer.h>
+u64 hv_current_partition_id = ~0ull;
+EXPORT_SYMBOL_GPL(hv_current_partition_id);
+
void *hv_hypercall_pg;
EXPORT_SYMBOL_GPL(hv_hypercall_pg);
@@ -331,6 +334,25 @@ static struct syscore_ops hv_syscore_ops = {
.resume = hv_resume,
};
+static void __init hv_get_partition_id(void)
+{
+ struct hv_get_partition_id *output_page;
+ u16 status;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
+ status = hv_do_hypercall(HVCALL_GET_PARTITION_ID, NULL, output_page) &
+ HV_HYPERCALL_RESULT_MASK;
+ if (status != HV_STATUS_SUCCESS) {
+ /* No point in proceeding if this failed */
+ pr_err("Failed to get partition ID: %d\n", status);
+ BUG();
+ }
+ hv_current_partition_id = output_page->partition_id;
+ local_irq_restore(flags);
+}
+
/*
* This function is to be invoked early in the boot sequence after the
* hypervisor has been detected.
@@ -426,6 +448,11 @@ void __init hyperv_init(void)
register_syscore_ops(&hv_syscore_ops);
+ if (cpuid_ebx(HYPERV_CPUID_FEATURES) & HV_ACCESS_PARTITION_ID)
+ hv_get_partition_id();
+
+ BUG_ON(hv_root_partition && hv_current_partition_id == ~0ull);
+
return;
remove_cpuhp_state:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 62d9390f1ddf..67f5d35a73d3 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -78,6 +78,8 @@ extern void *hv_hypercall_pg;
extern void __percpu **hyperv_pcpu_input_arg;
extern void __percpu **hyperv_pcpu_output_arg;
+extern u64 hv_current_partition_id;
+
static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
{
u64 input_address = input ? virt_to_phys(input) : 0;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index e6903589a82a..87b1a79b19eb 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -141,6 +141,7 @@ struct ms_hyperv_tsc_page {
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX 0x0013
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX 0x0014
#define HVCALL_SEND_IPI_EX 0x0015
+#define HVCALL_GET_PARTITION_ID 0x0046
#define HVCALL_GET_VP_REGISTERS 0x0050
#define HVCALL_SET_VP_REGISTERS 0x0051
#define HVCALL_POST_MESSAGE 0x005c
@@ -407,6 +408,11 @@ struct hv_tlb_flush_ex {
u64 gva_list[];
} __packed;
+/* HvGetPartitionId hypercall (output only) */
+struct hv_get_partition_id {
+ u64 partition_id;
+} __packed;
+
/* HvRetargetDeviceInterrupt hypercall */
union hv_msi_entry {
u64 as_uint64;
--
2.20.1
Microsoft Hypervisor requires the root partition to make a few
hypercalls to setup application processors before they can be used.
Signed-off-by: Lillian Grassin-Drake <[email protected]>
Signed-off-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Lillian Grassin-Drake <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
CPU hotplug and unplug is not yet supported in this setup, so those
paths remain untouched.
v3: Always call native SMP preparation function.
---
arch/x86/kernel/cpu/mshyperv.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f0b8c702c858..956007d2bf0d 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -31,6 +31,7 @@
#include <asm/reboot.h>
#include <asm/nmi.h>
#include <clocksource/hyperv_timer.h>
+#include <asm/numa.h>
/* Is Linux running as the root partition? */
bool hv_root_partition;
@@ -212,6 +213,32 @@ static void __init hv_smp_prepare_boot_cpu(void)
hv_init_spinlocks();
#endif
}
+
+static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
+{
+#ifdef CONFIG_X86_64
+ int i;
+ int ret;
+#endif
+
+ native_smp_prepare_cpus(max_cpus);
+
+#ifdef CONFIG_X86_64
+ for_each_present_cpu(i) {
+ if (i == 0)
+ continue;
+ ret = hv_call_add_logical_proc(numa_cpu_node(i), i, cpu_physical_id(i));
+ BUG_ON(ret);
+ }
+
+ for_each_present_cpu(i) {
+ if (i == 0)
+ continue;
+ ret = hv_call_create_vp(numa_cpu_node(i), hv_current_partition_id, i, i);
+ BUG_ON(ret);
+ }
+#endif
+}
#endif
static void __init ms_hyperv_init_platform(void)
@@ -368,6 +395,8 @@ static void __init ms_hyperv_init_platform(void)
# ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = hv_smp_prepare_boot_cpu;
+ if (hv_root_partition)
+ smp_ops.smp_prepare_cpus = hv_smp_prepare_cpus;
# endif
/*
--
2.20.1
We will need to identify the device we want Microsoft Hypervisor to
manipulate. Introduce the data structures for that purpose.
They will be used in a later patch.
Signed-off-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
include/asm-generic/hyperv-tlfs.h | 79 +++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 8423bf53c237..42ff1326c6bd 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -623,4 +623,83 @@ struct hv_set_vp_registers_input {
} element[];
} __packed;
+enum hv_device_type {
+ HV_DEVICE_TYPE_LOGICAL = 0,
+ HV_DEVICE_TYPE_PCI = 1,
+ HV_DEVICE_TYPE_IOAPIC = 2,
+ HV_DEVICE_TYPE_ACPI = 3,
+};
+
+typedef u16 hv_pci_rid;
+typedef u16 hv_pci_segment;
+typedef u64 hv_logical_device_id;
+union hv_pci_bdf {
+ u16 as_uint16;
+
+ struct {
+ u8 function:3;
+ u8 device:5;
+ u8 bus;
+ };
+} __packed;
+
+union hv_pci_bus_range {
+ u16 as_uint16;
+
+ struct {
+ u8 subordinate_bus;
+ u8 secondary_bus;
+ };
+} __packed;
+
+union hv_device_id {
+ u64 as_uint64;
+
+ struct {
+ u64 :62;
+ u64 device_type:2;
+ };
+
+ /* HV_DEVICE_TYPE_LOGICAL */
+ struct {
+ u64 id:62;
+ u64 device_type:2;
+ } logical;
+
+ /* HV_DEVICE_TYPE_PCI */
+ struct {
+ union {
+ hv_pci_rid rid;
+ union hv_pci_bdf bdf;
+ };
+
+ hv_pci_segment segment;
+ union hv_pci_bus_range shadow_bus_range;
+
+ u16 phantom_function_bits:2;
+ u16 source_shadow:1;
+
+ u16 rsvdz0:11;
+ u16 device_type:2;
+ } pci;
+
+ /* HV_DEVICE_TYPE_IOAPIC */
+ struct {
+ u8 ioapic_id;
+ u8 rsvdz0;
+ u16 rsvdz1;
+ u16 rsvdz2;
+
+ u16 rsvdz3:14;
+ u16 device_type:2;
+ } ioapic;
+
+ /* HV_DEVICE_TYPE_ACPI */
+ struct {
+ u32 input_mapping_base;
+ u32 input_mapping_count:30;
+ u32 device_type:2;
+ } acpi;
+} __packed;
+
#endif
--
2.20.1
There is no VMBus and the other infrastructures initialized in
hv_acpi_init when Linux is running as the root partition.
Signed-off-by: Wei Liu <[email protected]>
---
v3: Return 0 instead of -ENODEV.
---
drivers/hv/vmbus_drv.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 4fad3e6745e5..23f5bce8f242 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2612,6 +2612,9 @@ static int __init hv_acpi_init(void)
if (!hv_is_hyperv_initialized())
return -ENODEV;
+ if (hv_root_partition)
+ return 0;
+
init_completion(&probe_event);
/*
--
2.20.1
On Tue, 2020-11-24 at 17:07 +0000, Wei Liu wrote:
> We will soon use the same structure to handle IO-APIC interrupts as
> well. Introduce an enum to identify the source and a data structure for
> IO-APIC RTE.
>
> While at it, update pci-hyperv.c to use the enum.
>
> No functional change.
>
> Signed-off-by: Wei Liu <[email protected]>
> Acked-by: Rob Herring <[email protected]>
The I/OAPIC is just a device for generating MSIs.
Can you check if this renders your patch obsolete:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/apic&id=5d5a97133887b2dfd8e2ad0347c3a02cc7aaa0cb
This makes the name match Hyper-V TLFS.
Signed-off-by: Wei Liu <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
---
include/asm-generic/hyperv-tlfs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index e73a11850055..e6903589a82a 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -88,7 +88,7 @@
#define HV_CONNECT_PORT BIT(7)
#define HV_ACCESS_STATS BIT(8)
#define HV_DEBUGGING BIT(11)
-#define HV_CPU_POWER_MANAGEMENT BIT(12)
+#define HV_CPU_MANAGEMENT BIT(12)
/*
--
2.20.1
When Linux runs as the root partition on Microsoft Hypervisor, its
interrupts are remapped. Linux will need to explicitly map and unmap
interrupts for hardware.
Implement an MSI domain to issue the correct hypercalls. And initialize
this irqdomain as the default MSI irq domain.
Signed-off-by: Sunil Muthuswamy <[email protected]>
Co-Developed-by: Sunil Muthuswamy <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
---
v3: build irqdomain.o for 32bit as well
v2: This patch is simplified due to upstream changes.
---
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 9 +
arch/x86/hyperv/irqdomain.c | 330 ++++++++++++++++++++++++++++++++++++
3 files changed, 340 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/hyperv/irqdomain.c
diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 565358020921..48e2c51464e8 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-y := hv_init.o mmu.o nested.o
+obj-y := hv_init.o mmu.o nested.o irqdomain.o
obj-$(CONFIG_X86_64) += hv_apic.o hv_proc.o
ifdef CONFIG_X86_64
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index ad8e77859b32..b58f958439a2 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -357,6 +357,8 @@ static void __init hv_get_partition_id(void)
local_irq_restore(flags);
}
+extern struct irq_domain *hv_create_pci_msi_domain(void);
+
/*
* This function is to be invoked early in the boot sequence after the
* hypervisor has been detected.
@@ -484,6 +486,13 @@ void __init hyperv_init(void)
BUG_ON(hv_root_partition && hv_current_partition_id == ~0ull);
+ /*
+ * If we're running as root, we want to create our own PCI MSI domain.
+ * We can't set this in hv_pci_init because that would be too late.
+ */
+ if (hv_root_partition)
+ x86_init.irqs.create_pci_msi_domain = hv_create_pci_msi_domain;
+
return;
remove_cpuhp_state:
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
new file mode 100644
index 000000000000..80109e3cbf8f
--- /dev/null
+++ b/arch/x86/hyperv/irqdomain.c
@@ -0,0 +1,330 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
+//
+// Authors:
+// Sunil Muthuswamy <[email protected]>
+// Wei Liu <[email protected]>
+
+#include <linux/pci.h>
+#include <linux/irq.h>
+#include <asm/mshyperv.h>
+
+struct rid_data {
+ struct pci_dev *bridge;
+ u32 rid;
+};
+
+static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
+{
+ struct rid_data *rd = data;
+ u8 bus = PCI_BUS_NUM(rd->rid);
+
+ if (pdev->bus->number != bus || PCI_BUS_NUM(alias) != bus) {
+ rd->bridge = pdev;
+ rd->rid = alias;
+ }
+
+ return 0;
+}
+
+static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
+{
+ union hv_device_id dev_id;
+ struct rid_data data = {
+ .bridge = NULL,
+ .rid = PCI_DEVID(dev->bus->number, dev->devfn)
+ };
+
+ pci_for_each_dma_alias(dev, get_rid_cb, &data);
+
+ dev_id.as_uint64 = 0;
+ dev_id.device_type = HV_DEVICE_TYPE_PCI;
+ dev_id.pci.segment = pci_domain_nr(dev->bus);
+
+ dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
+ dev_id.pci.bdf.device = PCI_SLOT(data.rid);
+ dev_id.pci.bdf.function = PCI_FUNC(data.rid);
+ dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
+
+ if (data.bridge) {
+ int pos;
+
+ /*
+ * Microsoft Hypervisor requires a bus range when the bridge is
+ * running in PCI-X mode.
+ *
+ * To distinguish conventional vs PCI-X bridge, we can check
+ * the bridge's PCI-X Secondary Status Register, Secondary Bus
+ * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
+ * Specification Revision 1.0 5.2.2.1.3.
+ *
+ * Value zero means it is in conventional mode, otherwise it is
+ * in PCI-X mode.
+ */
+
+ pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
+ if (pos) {
+ u16 status;
+
+ pci_read_config_word(data.bridge, pos +
+ PCI_X_BRIDGE_SSTATUS, &status);
+
+ if (status & PCI_X_SSTATUS_FREQ) {
+ /* Non-zero, PCI-X mode */
+ u8 sec_bus, sub_bus;
+
+ dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
+
+ pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
+ dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
+ pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
+ dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
+ }
+ }
+ }
+
+ return dev_id;
+}
+
+static int hv_map_msi_interrupt(struct pci_dev *dev, int vcpu, int vector,
+ struct hv_interrupt_entry *entry)
+{
+ struct hv_input_map_device_interrupt *input;
+ struct hv_output_map_device_interrupt *output;
+ struct hv_device_interrupt_descriptor *intr_desc;
+ unsigned long flags;
+ u16 status;
+
+ local_irq_save(flags);
+
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+ intr_desc = &input->interrupt_descriptor;
+ memset(input, 0, sizeof(*input));
+ input->partition_id = hv_current_partition_id;
+ input->device_id = hv_build_pci_dev_id(dev).as_uint64;
+ intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
+ intr_desc->trigger_mode = HV_INTERRUPT_TRIGGER_MODE_EDGE;
+ intr_desc->vector_count = 1;
+ intr_desc->target.vector = vector;
+ __set_bit(vcpu, (unsigned long*)&intr_desc->target.vp_mask);
+
+ status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, 0, input, output) &
+ HV_HYPERCALL_RESULT_MASK;
+ *entry = output->interrupt_entry;
+
+ local_irq_restore(flags);
+
+ if (status != HV_STATUS_SUCCESS)
+ pr_err("%s: hypercall failed, status %d\n", __func__, status);
+
+ return status;
+}
+
+static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
+{
+ /* High address is always 0 */
+ msg->address_hi = 0;
+ msg->address_lo = entry->msi_entry.address.as_uint32;
+ msg->data = entry->msi_entry.data.as_uint32;
+}
+
+static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
+static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
+{
+ struct msi_desc *msidesc;
+ struct pci_dev *dev;
+ struct hv_interrupt_entry out_entry, *stored_entry;
+ struct irq_cfg *cfg = irqd_cfg(data);
+ struct cpumask *affinity;
+ int cpu, vcpu;
+ u16 status;
+
+ msidesc = irq_data_get_msi_desc(data);
+ dev = msi_desc_to_pci_dev(msidesc);
+
+ if (!cfg) {
+ pr_debug("%s: cfg is NULL", __func__);
+ return;
+ }
+
+ affinity = irq_data_get_effective_affinity_mask(data);
+ cpu = cpumask_first_and(affinity, cpu_online_mask);
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+
+ if (data->chip_data) {
+ /*
+ * This interrupt is already mapped. Let's unmap first.
+ *
+ * We don't use retarget interrupt hypercalls here because
+ * Microsoft Hypervisor doens't allow root to change the vector
+ * or specify VPs outside of the set that is initially used
+ * during mapping.
+ */
+ stored_entry = data->chip_data;
+ data->chip_data = NULL;
+
+ status = hv_unmap_msi_interrupt(dev, stored_entry);
+
+ kfree(stored_entry);
+
+ if (status != HV_STATUS_SUCCESS) {
+ pr_debug("%s: failed to unmap, status %d", __func__, status);
+ return;
+ }
+ }
+
+ stored_entry = kzalloc(sizeof(*stored_entry), GFP_ATOMIC);
+ if (!stored_entry) {
+ pr_debug("%s: failed to allocate chip data\n", __func__);
+ return;
+ }
+
+ status = hv_map_msi_interrupt(dev, vcpu, cfg->vector, &out_entry);
+ if (status != HV_STATUS_SUCCESS) {
+ kfree(stored_entry);
+ return;
+ }
+
+ *stored_entry = out_entry;
+ data->chip_data = stored_entry;
+ entry_to_msi_msg(&out_entry, msg);
+
+ return;
+}
+
+static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
+{
+ unsigned long flags;
+ struct hv_input_unmap_device_interrupt *input;
+ struct hv_interrupt_entry *intr_entry;
+ u16 status;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
+ memset(input, 0, sizeof(*input));
+ intr_entry = &input->interrupt_entry;
+ input->partition_id = hv_current_partition_id;
+ input->device_id = id;
+ *intr_entry = *old_entry;
+
+ status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, 0, 0, input, NULL) &
+ HV_HYPERCALL_RESULT_MASK;
+ local_irq_restore(flags);
+
+ return status;
+}
+
+static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
+{
+ return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry)
+ & HV_HYPERCALL_RESULT_MASK;
+}
+
+static void hv_teardown_msi_irq_common(struct pci_dev *dev, struct msi_desc *msidesc, int irq)
+{
+ u16 status;
+ struct hv_interrupt_entry old_entry;
+ struct irq_desc *desc;
+ struct irq_data *data;
+ struct msi_msg msg;
+
+ desc = irq_to_desc(irq);
+ if (!desc) {
+ pr_debug("%s: no irq desc\n", __func__);
+ return;
+ }
+
+ data = &desc->irq_data;
+ if (!data) {
+ pr_debug("%s: no irq data\n", __func__);
+ return;
+ }
+
+ if (!data->chip_data) {
+ pr_debug("%s: no chip data\n!", __func__);
+ return;
+ }
+
+ old_entry = *(struct hv_interrupt_entry *)data->chip_data;
+ entry_to_msi_msg(&old_entry, &msg);
+
+ kfree(data->chip_data);
+ data->chip_data = NULL;
+
+ status = hv_unmap_msi_interrupt(dev, &old_entry);
+
+ if (status != HV_STATUS_SUCCESS) {
+ pr_err("%s: hypercall failed, status %d\n", __func__, status);
+ return;
+ }
+}
+
+static void hv_msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+{
+ int i;
+ struct msi_desc *entry;
+ struct pci_dev *pdev;
+
+ if (WARN_ON_ONCE(!dev_is_pci(dev)))
+ return;
+
+ pdev = to_pci_dev(dev);
+
+ for_each_pci_msi_entry(entry, pdev) {
+ if (entry->irq) {
+ for (i = 0; i < entry->nvec_used; i++) {
+ hv_teardown_msi_irq_common(pdev, entry, entry->irq + i);
+ irq_domain_free_irqs(entry->irq + i, 1);
+ }
+ }
+ }
+}
+
+/*
+ * IRQ Chip for MSI PCI/PCI-X/PCI-Express Devices,
+ * which implement the MSI or MSI-X Capability Structure.
+ */
+static struct irq_chip hv_pci_msi_controller = {
+ .name = "HV-PCI-MSI",
+ .irq_unmask = pci_msi_unmask_irq,
+ .irq_mask = pci_msi_mask_irq,
+ .irq_ack = irq_chip_ack_parent,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_compose_msi_msg = hv_irq_compose_msi_msg,
+ .irq_set_affinity = msi_domain_set_affinity,
+ .flags = IRQCHIP_SKIP_SET_WAKE,
+};
+
+static struct msi_domain_ops pci_msi_domain_ops = {
+ .domain_free_irqs = hv_msi_domain_free_irqs,
+ .msi_prepare = pci_msi_prepare,
+};
+
+static struct msi_domain_info hv_pci_msi_domain_info = {
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_PCI_MSIX,
+ .ops = &pci_msi_domain_ops,
+ .chip = &hv_pci_msi_controller,
+ .handler = handle_edge_irq,
+ .handler_name = "edge",
+};
+
+struct irq_domain * __init hv_create_pci_msi_domain(void)
+{
+ struct irq_domain *d = NULL;
+ struct fwnode_handle *fn;
+
+ fn = irq_domain_alloc_named_fwnode("HV-PCI-MSI");
+ if (fn)
+ d = pci_msi_create_irq_domain(fn, &hv_pci_msi_domain_info, x86_vector_domain);
+
+ /* No point in going further if we can't get an irq domain */
+ BUG_ON(!d);
+
+ return d;
+}
+
--
2.20.1
Hi Wei,
I love your patch! Perhaps something to improve:
[auto build test WARNING on tip/x86/core]
[also build test WARNING on asm-generic/master iommu/next tip/timers/core pci/next linus/master v5.10-rc5]
[cannot apply to next-20201124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Wei-Liu/Introducing-Linux-root-partition-support-for-Microsoft-Hypervisor/20201125-011026
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 238c91115cd05c71447ea071624a4c9fe661f970
config: i386-randconfig-a015-20201124 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/ae7533bcd9667c0f23b545d941d3c68460f91ea2
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Wei-Liu/Introducing-Linux-root-partition-support-for-Microsoft-Hypervisor/20201125-011026
git checkout ae7533bcd9667c0f23b545d941d3c68460f91ea2
# save the attached .config to linux build tree
make W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
arch/x86/hyperv/irqdomain.c: In function 'hv_irq_compose_msi_msg':
arch/x86/hyperv/irqdomain.c:146:8: error: implicit declaration of function 'msi_desc_to_pci_dev'; did you mean 'msi_desc_to_dev'? [-Werror=implicit-function-declaration]
146 | dev = msi_desc_to_pci_dev(msidesc);
| ^~~~~~~~~~~~~~~~~~~
| msi_desc_to_dev
>> arch/x86/hyperv/irqdomain.c:146:6: warning: assignment to 'struct pci_dev *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
146 | dev = msi_desc_to_pci_dev(msidesc);
| ^
arch/x86/hyperv/irqdomain.c: In function 'hv_msi_domain_free_irqs':
arch/x86/hyperv/irqdomain.c:277:2: error: implicit declaration of function 'for_each_pci_msi_entry'; did you mean 'for_each_msi_entry'? [-Werror=implicit-function-declaration]
277 | for_each_pci_msi_entry(entry, pdev) {
| ^~~~~~~~~~~~~~~~~~~~~~
| for_each_msi_entry
arch/x86/hyperv/irqdomain.c:277:37: error: expected ';' before '{' token
277 | for_each_pci_msi_entry(entry, pdev) {
| ^~
| ;
arch/x86/hyperv/irqdomain.c:268:6: warning: unused variable 'i' [-Wunused-variable]
268 | int i;
| ^
arch/x86/hyperv/irqdomain.c: At top level:
arch/x86/hyperv/irqdomain.c:298:22: error: 'msi_domain_set_affinity' undeclared here (not in a function); did you mean 'irq_can_set_affinity'?
298 | .irq_set_affinity = msi_domain_set_affinity,
| ^~~~~~~~~~~~~~~~~~~~~~~
| irq_can_set_affinity
arch/x86/hyperv/irqdomain.c:302:15: error: variable 'pci_msi_domain_ops' has initializer but incomplete type
302 | static struct msi_domain_ops pci_msi_domain_ops = {
| ^~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:303:3: error: 'struct msi_domain_ops' has no member named 'domain_free_irqs'
303 | .domain_free_irqs = hv_msi_domain_free_irqs,
| ^~~~~~~~~~~~~~~~
>> arch/x86/hyperv/irqdomain.c:303:22: warning: excess elements in struct initializer
303 | .domain_free_irqs = hv_msi_domain_free_irqs,
| ^~~~~~~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:303:22: note: (near initialization for 'pci_msi_domain_ops')
arch/x86/hyperv/irqdomain.c:304:3: error: 'struct msi_domain_ops' has no member named 'msi_prepare'
304 | .msi_prepare = pci_msi_prepare,
| ^~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:304:18: error: 'pci_msi_prepare' undeclared here (not in a function)
304 | .msi_prepare = pci_msi_prepare,
| ^~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:304:18: warning: excess elements in struct initializer
arch/x86/hyperv/irqdomain.c:304:18: note: (near initialization for 'pci_msi_domain_ops')
arch/x86/hyperv/irqdomain.c:307:15: error: variable 'hv_pci_msi_domain_info' has initializer but incomplete type
307 | static struct msi_domain_info hv_pci_msi_domain_info = {
| ^~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:308:3: error: 'struct msi_domain_info' has no member named 'flags'
308 | .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
| ^~~~~
arch/x86/hyperv/irqdomain.c:308:12: error: 'MSI_FLAG_USE_DEF_DOM_OPS' undeclared here (not in a function)
308 | .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
| ^~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:308:39: error: 'MSI_FLAG_USE_DEF_CHIP_OPS' undeclared here (not in a function)
308 | .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
| ^~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:309:6: error: 'MSI_FLAG_PCI_MSIX' undeclared here (not in a function)
309 | MSI_FLAG_PCI_MSIX,
| ^~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:308:12: warning: excess elements in struct initializer
308 | .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
| ^~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:308:12: note: (near initialization for 'hv_pci_msi_domain_info')
arch/x86/hyperv/irqdomain.c:310:3: error: 'struct msi_domain_info' has no member named 'ops'
310 | .ops = &pci_msi_domain_ops,
| ^~~
arch/x86/hyperv/irqdomain.c:310:10: warning: excess elements in struct initializer
310 | .ops = &pci_msi_domain_ops,
| ^
arch/x86/hyperv/irqdomain.c:310:10: note: (near initialization for 'hv_pci_msi_domain_info')
arch/x86/hyperv/irqdomain.c:311:3: error: 'struct msi_domain_info' has no member named 'chip'
311 | .chip = &hv_pci_msi_controller,
| ^~~~
arch/x86/hyperv/irqdomain.c:311:11: warning: excess elements in struct initializer
311 | .chip = &hv_pci_msi_controller,
| ^
arch/x86/hyperv/irqdomain.c:311:11: note: (near initialization for 'hv_pci_msi_domain_info')
arch/x86/hyperv/irqdomain.c:312:3: error: 'struct msi_domain_info' has no member named 'handler'
312 | .handler = handle_edge_irq,
| ^~~~~~~
arch/x86/hyperv/irqdomain.c:312:13: warning: excess elements in struct initializer
312 | .handler = handle_edge_irq,
| ^~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:312:13: note: (near initialization for 'hv_pci_msi_domain_info')
arch/x86/hyperv/irqdomain.c:313:3: error: 'struct msi_domain_info' has no member named 'handler_name'
313 | .handler_name = "edge",
| ^~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:313:18: warning: excess elements in struct initializer
313 | .handler_name = "edge",
| ^~~~~~
arch/x86/hyperv/irqdomain.c:313:18: note: (near initialization for 'hv_pci_msi_domain_info')
>> arch/x86/hyperv/irqdomain.c:316:28: warning: no previous prototype for 'hv_create_pci_msi_domain' [-Wmissing-prototypes]
316 | struct irq_domain * __init hv_create_pci_msi_domain(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c: In function 'hv_create_pci_msi_domain':
arch/x86/hyperv/irqdomain.c:321:7: error: implicit declaration of function 'irq_domain_alloc_named_fwnode' [-Werror=implicit-function-declaration]
321 | fn = irq_domain_alloc_named_fwnode("HV-PCI-MSI");
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> arch/x86/hyperv/irqdomain.c:321:5: warning: assignment to 'struct fwnode_handle *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
321 | fn = irq_domain_alloc_named_fwnode("HV-PCI-MSI");
| ^
arch/x86/hyperv/irqdomain.c:323:7: error: implicit declaration of function 'pci_msi_create_irq_domain'; did you mean 'pci_msi_get_device_domain'? [-Werror=implicit-function-declaration]
323 | d = pci_msi_create_irq_domain(fn, &hv_pci_msi_domain_info, x86_vector_domain);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
| pci_msi_get_device_domain
arch/x86/hyperv/irqdomain.c:323:62: error: 'x86_vector_domain' undeclared (first use in this function)
323 | d = pci_msi_create_irq_domain(fn, &hv_pci_msi_domain_info, x86_vector_domain);
| ^~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:323:62: note: each undeclared identifier is reported only once for each function it appears in
arch/x86/hyperv/irqdomain.c: At top level:
arch/x86/hyperv/irqdomain.c:302:30: error: storage size of 'pci_msi_domain_ops' isn't known
302 | static struct msi_domain_ops pci_msi_domain_ops = {
| ^~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:307:31: error: storage size of 'hv_pci_msi_domain_info' isn't known
307 | static struct msi_domain_info hv_pci_msi_domain_info = {
| ^~~~~~~~~~~~~~~~~~~~~~
arch/x86/hyperv/irqdomain.c:227:13: warning: 'hv_teardown_msi_irq_common' defined but not used [-Wunused-function]
227 | static void hv_teardown_msi_irq_common(struct pci_dev *dev, struct msi_desc *msidesc, int irq)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +146 arch/x86/hyperv/irqdomain.c
133
134 static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
135 static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
136 {
137 struct msi_desc *msidesc;
138 struct pci_dev *dev;
139 struct hv_interrupt_entry out_entry, *stored_entry;
140 struct irq_cfg *cfg = irqd_cfg(data);
141 struct cpumask *affinity;
142 int cpu, vcpu;
143 u16 status;
144
145 msidesc = irq_data_get_msi_desc(data);
> 146 dev = msi_desc_to_pci_dev(msidesc);
147
148 if (!cfg) {
149 pr_debug("%s: cfg is NULL", __func__);
150 return;
151 }
152
153 affinity = irq_data_get_effective_affinity_mask(data);
154 cpu = cpumask_first_and(affinity, cpu_online_mask);
155 vcpu = hv_cpu_number_to_vp_number(cpu);
156
157 if (data->chip_data) {
158 /*
159 * This interrupt is already mapped. Let's unmap first.
160 *
161 * We don't use retarget interrupt hypercalls here because
162 * Microsoft Hypervisor doens't allow root to change the vector
163 * or specify VPs outside of the set that is initially used
164 * during mapping.
165 */
166 stored_entry = data->chip_data;
167 data->chip_data = NULL;
168
169 status = hv_unmap_msi_interrupt(dev, stored_entry);
170
171 kfree(stored_entry);
172
173 if (status != HV_STATUS_SUCCESS) {
174 pr_debug("%s: failed to unmap, status %d", __func__, status);
175 return;
176 }
177 }
178
179 stored_entry = kzalloc(sizeof(*stored_entry), GFP_ATOMIC);
180 if (!stored_entry) {
181 pr_debug("%s: failed to allocate chip data\n", __func__);
182 return;
183 }
184
185 status = hv_map_msi_interrupt(dev, vcpu, cfg->vector, &out_entry);
186 if (status != HV_STATUS_SUCCESS) {
187 kfree(stored_entry);
188 return;
189 }
190
191 *stored_entry = out_entry;
192 data->chip_data = stored_entry;
193 entry_to_msi_msg(&out_entry, msg);
194
195 return;
196 }
197
198 static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
199 {
200 unsigned long flags;
201 struct hv_input_unmap_device_interrupt *input;
202 struct hv_interrupt_entry *intr_entry;
203 u16 status;
204
205 local_irq_save(flags);
206 input = *this_cpu_ptr(hyperv_pcpu_input_arg);
207
208 memset(input, 0, sizeof(*input));
209 intr_entry = &input->interrupt_entry;
210 input->partition_id = hv_current_partition_id;
211 input->device_id = id;
212 *intr_entry = *old_entry;
213
214 status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, 0, 0, input, NULL) &
215 HV_HYPERCALL_RESULT_MASK;
216 local_irq_restore(flags);
217
218 return status;
219 }
220
221 static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
222 {
223 return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry)
224 & HV_HYPERCALL_RESULT_MASK;
225 }
226
227 static void hv_teardown_msi_irq_common(struct pci_dev *dev, struct msi_desc *msidesc, int irq)
228 {
229 u16 status;
230 struct hv_interrupt_entry old_entry;
231 struct irq_desc *desc;
232 struct irq_data *data;
233 struct msi_msg msg;
234
235 desc = irq_to_desc(irq);
236 if (!desc) {
237 pr_debug("%s: no irq desc\n", __func__);
238 return;
239 }
240
241 data = &desc->irq_data;
242 if (!data) {
243 pr_debug("%s: no irq data\n", __func__);
244 return;
245 }
246
247 if (!data->chip_data) {
248 pr_debug("%s: no chip data\n!", __func__);
249 return;
250 }
251
252 old_entry = *(struct hv_interrupt_entry *)data->chip_data;
253 entry_to_msi_msg(&old_entry, &msg);
254
255 kfree(data->chip_data);
256 data->chip_data = NULL;
257
258 status = hv_unmap_msi_interrupt(dev, &old_entry);
259
260 if (status != HV_STATUS_SUCCESS) {
261 pr_err("%s: hypercall failed, status %d\n", __func__, status);
262 return;
263 }
264 }
265
266 static void hv_msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
267 {
268 int i;
269 struct msi_desc *entry;
270 struct pci_dev *pdev;
271
272 if (WARN_ON_ONCE(!dev_is_pci(dev)))
273 return;
274
275 pdev = to_pci_dev(dev);
276
277 for_each_pci_msi_entry(entry, pdev) {
278 if (entry->irq) {
279 for (i = 0; i < entry->nvec_used; i++) {
280 hv_teardown_msi_irq_common(pdev, entry, entry->irq + i);
281 irq_domain_free_irqs(entry->irq + i, 1);
282 }
283 }
284 }
285 }
286
287 /*
288 * IRQ Chip for MSI PCI/PCI-X/PCI-Express Devices,
289 * which implement the MSI or MSI-X Capability Structure.
290 */
291 static struct irq_chip hv_pci_msi_controller = {
292 .name = "HV-PCI-MSI",
293 .irq_unmask = pci_msi_unmask_irq,
294 .irq_mask = pci_msi_mask_irq,
295 .irq_ack = irq_chip_ack_parent,
296 .irq_retrigger = irq_chip_retrigger_hierarchy,
297 .irq_compose_msi_msg = hv_irq_compose_msi_msg,
298 .irq_set_affinity = msi_domain_set_affinity,
299 .flags = IRQCHIP_SKIP_SET_WAKE,
300 };
301
302 static struct msi_domain_ops pci_msi_domain_ops = {
> 303 .domain_free_irqs = hv_msi_domain_free_irqs,
304 .msi_prepare = pci_msi_prepare,
305 };
306
307 static struct msi_domain_info hv_pci_msi_domain_info = {
308 .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
309 MSI_FLAG_PCI_MSIX,
310 .ops = &pci_msi_domain_ops,
311 .chip = &hv_pci_msi_controller,
312 .handler = handle_edge_irq,
313 .handler_name = "edge",
314 };
315
> 316 struct irq_domain * __init hv_create_pci_msi_domain(void)
317 {
318 struct irq_domain *d = NULL;
319 struct fwnode_handle *fn;
320
> 321 fn = irq_domain_alloc_named_fwnode("HV-PCI-MSI");
322 if (fn)
323 d = pci_msi_create_irq_domain(fn, &hv_pci_msi_domain_info, x86_vector_domain);
324
325 /* No point in going further if we can't get an irq domain */
326 BUG_ON(!d);
327
328 return d;
329 }
330
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Wei,
I love your patch! Perhaps something to improve:
[auto build test WARNING on tip/x86/core]
[also build test WARNING on asm-generic/master iommu/next tip/timers/core pci/next linus/master v5.10-rc5]
[cannot apply to next-20201124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Wei-Liu/Introducing-Linux-root-partition-support-for-Microsoft-Hypervisor/20201125-011026
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 238c91115cd05c71447ea071624a4c9fe661f970
config: x86_64-randconfig-a003-20201125 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 77e98eaee2e8d4b9b297b66fda5b1e51e2a69999)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# https://github.com/0day-ci/linux/commit/ae7533bcd9667c0f23b545d941d3c68460f91ea2
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Wei-Liu/Introducing-Linux-root-partition-support-for-Microsoft-Hypervisor/20201125-011026
git checkout ae7533bcd9667c0f23b545d941d3c68460f91ea2
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
arch/x86/hyperv/irqdomain.c:303:3: error: field designator 'domain_free_irqs' does not refer to any field in type 'struct msi_domain_ops'
.domain_free_irqs = hv_msi_domain_free_irqs,
^
>> arch/x86/hyperv/irqdomain.c:316:28: warning: no previous prototype for function 'hv_create_pci_msi_domain' [-Wmissing-prototypes]
struct irq_domain * __init hv_create_pci_msi_domain(void)
^
arch/x86/hyperv/irqdomain.c:316:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
struct irq_domain * __init hv_create_pci_msi_domain(void)
^
static
1 warning and 1 error generated.
vim +/hv_create_pci_msi_domain +316 arch/x86/hyperv/irqdomain.c
301
302 static struct msi_domain_ops pci_msi_domain_ops = {
> 303 .domain_free_irqs = hv_msi_domain_free_irqs,
304 .msi_prepare = pci_msi_prepare,
305 };
306
307 static struct msi_domain_info hv_pci_msi_domain_info = {
308 .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
309 MSI_FLAG_PCI_MSIX,
310 .ops = &pci_msi_domain_ops,
311 .chip = &hv_pci_msi_controller,
312 .handler = handle_edge_irq,
313 .handler_name = "edge",
314 };
315
> 316 struct irq_domain * __init hv_create_pci_msi_domain(void)
317 {
318 struct irq_domain *d = NULL;
319 struct fwnode_handle *fn;
320
321 fn = irq_domain_alloc_named_fwnode("HV-PCI-MSI");
322 if (fn)
323 d = pci_msi_create_irq_domain(fn, &hv_pci_msi_domain_info, x86_vector_domain);
324
325 /* No point in going further if we can't get an irq domain */
326 BUG_ON(!d);
327
328 return d;
329 }
330
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Wei,
I love your patch! Perhaps something to improve:
[auto build test WARNING on tip/x86/core]
[also build test WARNING on asm-generic/master iommu/next tip/timers/core pci/next linus/master v5.10-rc5]
[cannot apply to next-20201124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Wei-Liu/Introducing-Linux-root-partition-support-for-Microsoft-Hypervisor/20201125-011026
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 238c91115cd05c71447ea071624a4c9fe661f970
config: x86_64-randconfig-a003-20201125 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 77e98eaee2e8d4b9b297b66fda5b1e51e2a69999)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# https://github.com/0day-ci/linux/commit/591ad2444b6b7d63ab24ce8f16a4e367085bbb5d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Wei-Liu/Introducing-Linux-root-partition-support-for-Microsoft-Hypervisor/20201125-011026
git checkout 591ad2444b6b7d63ab24ce8f16a4e367085bbb5d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
arch/x86/hyperv/irqdomain.c:305:3: error: field designator 'domain_free_irqs' does not refer to any field in type 'struct msi_domain_ops'
.domain_free_irqs = hv_msi_domain_free_irqs,
^
arch/x86/hyperv/irqdomain.c:318:28: warning: no previous prototype for function 'hv_create_pci_msi_domain' [-Wmissing-prototypes]
struct irq_domain * __init hv_create_pci_msi_domain(void)
^
arch/x86/hyperv/irqdomain.c:318:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
struct irq_domain * __init hv_create_pci_msi_domain(void)
^
static
>> arch/x86/hyperv/irqdomain.c:499:6: warning: no previous prototype for function 'hv_ioapic_ack_level' [-Wmissing-prototypes]
void hv_ioapic_ack_level(struct irq_data *irq_data)
^
arch/x86/hyperv/irqdomain.c:499:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void hv_ioapic_ack_level(struct irq_data *irq_data)
^
static
>> arch/x86/hyperv/irqdomain.c:526:5: warning: no previous prototype for function 'hv_acpi_register_gsi' [-Wmissing-prototypes]
int hv_acpi_register_gsi(struct device *dev, u32 gsi, int trigger, int polarity)
^
arch/x86/hyperv/irqdomain.c:526:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int hv_acpi_register_gsi(struct device *dev, u32 gsi, int trigger, int polarity)
^
static
>> arch/x86/hyperv/irqdomain.c:550:6: warning: no previous prototype for function 'hv_acpi_unregister_gsi' [-Wmissing-prototypes]
void hv_acpi_unregister_gsi(u32 gsi)
^
arch/x86/hyperv/irqdomain.c:550:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void hv_acpi_unregister_gsi(u32 gsi)
^
static
4 warnings and 1 error generated.
vim +/hv_ioapic_ack_level +499 arch/x86/hyperv/irqdomain.c
498
> 499 void hv_ioapic_ack_level(struct irq_data *irq_data)
500 {
501 /*
502 * Per email exchange with Hyper-V team, all is needed is write to
503 * LAPIC's EOI register. They don't support directed EOI to IO-APIC.
504 * Hyper-V handles it for us.
505 */
506 apic_ack_irq(irq_data);
507 }
508
509 struct irq_chip hv_ioapic_chip __read_mostly = {
510 .name = "HV-IO-APIC",
511 .irq_startup = hv_ioapic_startup_irq,
512 .irq_mask = hv_ioapic_mask_irq,
513 .irq_unmask = hv_ioapic_unmask_irq,
514 .irq_ack = irq_chip_ack_parent,
515 .irq_eoi = hv_ioapic_ack_level,
516 .irq_set_affinity = hv_ioapic_set_affinity,
517 .irq_retrigger = irq_chip_retrigger_hierarchy,
518 .irq_get_irqchip_state = ioapic_irq_get_chip_state,
519 .flags = IRQCHIP_SKIP_SET_WAKE,
520 };
521
522
523 int (*native_acpi_register_gsi)(struct device *dev, u32 gsi, int trigger, int polarity);
524 void (*native_acpi_unregister_gsi)(u32 gsi);
525
> 526 int hv_acpi_register_gsi(struct device *dev, u32 gsi, int trigger, int polarity)
527 {
528 int irq = gsi;
529
530 #ifdef CONFIG_X86_IO_APIC
531 irq = native_acpi_register_gsi(dev, gsi, trigger, polarity);
532 if (irq < 0) {
533 pr_err("native_acpi_register_gsi failed %d\n", irq);
534 return irq;
535 }
536
537 if (trigger) {
538 irq_set_status_flags(irq, IRQ_LEVEL);
539 irq_set_chip_and_handler_name(irq, &hv_ioapic_chip,
540 handle_fasteoi_irq, "ioapic-fasteoi");
541 } else {
542 irq_clear_status_flags(irq, IRQ_LEVEL);
543 irq_set_chip_and_handler_name(irq, &hv_ioapic_chip,
544 handle_edge_irq, "ioapic-edge");
545 }
546 #endif
547 return irq;
548 }
549
> 550 void hv_acpi_unregister_gsi(u32 gsi)
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Wed, Nov 25, 2020 at 1:46 AM Wei Liu <[email protected]> wrote:
>
> We are about to implement an irqchip for IO-APIC when Linux runs as root
> on Microsoft Hypervisor. At the same time we would like to reuse
> existing code as much as possible.
>
> Move mp_chip_data to io_apic.h and make a few helper functions
> non-static.
> +struct mp_chip_data {
> + struct list_head irq_2_pin;
> + struct IO_APIC_route_entry entry;
> + int trigger;
> + int polarity;
> + u32 count;
> + bool isa_irq;
> +};
Since I see only this patch I am puzzled why you need to have this in
the header?
Maybe a couple of words in the commit message to elaborate?
--
With Best Regards,
Andy Shevchenko
On Tue, Nov 24, 2020 at 06:05:27PM +0000, David Woodhouse wrote:
> On Tue, 2020-11-24 at 17:07 +0000, Wei Liu wrote:
> > We will soon use the same structure to handle IO-APIC interrupts as
> > well. Introduce an enum to identify the source and a data structure for
> > IO-APIC RTE.
> >
> > While at it, update pci-hyperv.c to use the enum.
> >
> > No functional change.
> >
> > Signed-off-by: Wei Liu <[email protected]>
> > Acked-by: Rob Herring <[email protected]>
>
> The I/OAPIC is just a device for generating MSIs.
>
> Can you check if this renders your patch obsolete:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/apic&id=5d5a97133887b2dfd8e2ad0347c3a02cc7aaa0cb
David, thanks for your comment.
This patch merely copies the definitions from Microsoft Hypervisor. The
data structure is the exact one that is returned from the hypervisor.
The hypervisor doesn't return a pair of (addr,data). It translates
(addr,data) to IO-APIC RTE for the caller -- like what
ioapic_setup_msg_from_msi does in your patch.
I don't think your patch makes this patch obsolete.
Wei.
On Wed, Nov 25, 2020 at 12:26:12PM +0200, Andy Shevchenko wrote:
> On Wed, Nov 25, 2020 at 1:46 AM Wei Liu <[email protected]> wrote:
> >
> > We are about to implement an irqchip for IO-APIC when Linux runs as root
> > on Microsoft Hypervisor. At the same time we would like to reuse
> > existing code as much as possible.
> >
> > Move mp_chip_data to io_apic.h and make a few helper functions
> > non-static.
>
> > +struct mp_chip_data {
> > + struct list_head irq_2_pin;
> > + struct IO_APIC_route_entry entry;
> > + int trigger;
> > + int polarity;
> > + u32 count;
> > + bool isa_irq;
> > +};
>
> Since I see only this patch I am puzzled why you need to have this in
> the header?
> Maybe a couple of words in the commit message to elaborate?
Andy, does the following answer your question?
"The chip_data stashed in IO-APIC's irq chip is mp_chip_data. The
implementation of Microsoft Hypevisor's IO-APIC irqdomain would like to
manipulate that data structure, so move it to io_apic.h as well."
If that's good enough, I can add it to the commit message.
Wei.
On Wed, Dec 02, 2020 at 02:11:07PM +0000, Wei Liu wrote:
> On Wed, Nov 25, 2020 at 12:26:12PM +0200, Andy Shevchenko wrote:
> > On Wed, Nov 25, 2020 at 1:46 AM Wei Liu <[email protected]> wrote:
> > >
> > > We are about to implement an irqchip for IO-APIC when Linux runs as root
> > > on Microsoft Hypervisor. At the same time we would like to reuse
> > > existing code as much as possible.
> > >
> > > Move mp_chip_data to io_apic.h and make a few helper functions
> > > non-static.
> >
> > > +struct mp_chip_data {
> > > + struct list_head irq_2_pin;
> > > + struct IO_APIC_route_entry entry;
> > > + int trigger;
> > > + int polarity;
> > > + u32 count;
> > > + bool isa_irq;
> > > +};
> >
> > Since I see only this patch I am puzzled why you need to have this in
> > the header?
> > Maybe a couple of words in the commit message to elaborate?
>
> Andy, does the following answer your question?
>
> "The chip_data stashed in IO-APIC's irq chip is mp_chip_data. The
> implementation of Microsoft Hypevisor's IO-APIC irqdomain would like to
> manipulate that data structure, so move it to io_apic.h as well."
At least it sheds some light, thanks.
> If that's good enough, I can add it to the commit message.
It's good for a starter, but I think you have to wait for what Thomas and other
related people can say.
--
With Best Regards,
Andy Shevchenko
On 24.11.20 18:07, Wei Liu wrote:
Hi,
> There will be a subsequent patch series to provide a
> device node (/dev/mshv) such that userspace programs can create and run virtual
> machines.
Any chance of using the already existing /dev/kvm interface ?
--mtx
--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287
On Wed, Dec 02, 2020 at 08:51:38PM +0100, Enrico Weigelt, metux IT consult wrote:
> On 24.11.20 18:07, Wei Liu wrote:
>
> Hi,
>
> > There will be a subsequent patch series to provide a
> > device node (/dev/mshv) such that userspace programs can create and run virtual
> > machines.
>
> Any chance of using the already existing /dev/kvm interface ?
>
I don't follow. Do you mean reusing /dev/kvm but with a different set of
APIs underneath? I don't think that will work.
In any case, the first version of /dev/mshv was posted a few days ago
[0]. While we've chosen to follow closely KVM's model, Microsoft
Hypervisor has its own APIs.
Wei.
0: https://lore.kernel.org/linux-hyperv/1605918637-12192-1-git-send-email-nunodasneves@linux.microsoft.com/
> --mtx
>
> --
> ---
> Hinweis: unverschl?sselte E-Mails k?nnen leicht abgeh?rt und manipuliert
> werden ! F?r eine vertrauliche Kommunikation senden Sie bitte ihren
> GPG/PGP-Schl?ssel zu.
> ---
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> [email protected] -- +49-151-27565287
On 03.12.20 00:22, Wei Liu wrote:
Hi,
> I don't follow. Do you mean reusing /dev/kvm but with a different set of
> APIs underneath? I don't think that will work.
My idea was using the same uapi for both hypervisors, so that we can use
the same userlands for both.
Are the semantis so different that we can't provide the same API ?
> In any case, the first version of /dev/mshv was posted a few days ago
> [0]. While we've chosen to follow closely KVM's model, Microsoft
> Hypervisor has its own APIs.
I have to admit, I don't know much about hyperv - what are the main
differences (from userland perspective) between hyperv and kvm ?
--mtx
--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287
On Tue, Dec 15, 2020 at 04:25:03PM +0100, Enrico Weigelt, metux IT consult wrote:
> On 03.12.20 00:22, Wei Liu wrote:
>
> Hi,
>
> > I don't follow. Do you mean reusing /dev/kvm but with a different set of
> > APIs underneath? I don't think that will work.
>
> My idea was using the same uapi for both hypervisors, so that we can use
> the same userlands for both.
>
> Are the semantis so different that we can't provide the same API ?
We can provide some similar APIs for ease of porting, but can't provide
1:1 mappings. By definition KVM and MSHV are two different things. There
is no goal to make one ABI / API compatible with the other.
>
> > In any case, the first version of /dev/mshv was posted a few days ago
> > [0]. While we've chosen to follow closely KVM's model, Microsoft
> > Hypervisor has its own APIs.
>
> I have to admit, I don't know much about hyperv - what are the main
> differences (from userland perspective) between hyperv and kvm ?
>
They have different architecture and hence different ways to deal with
things. The difference will inevitably make its way to userland.
Without going into all the details, you can have a look how Xen and KVM
differ architecturally. That will give you a pretty good idea on the
differences.
Wei.
>
> --mtx
>
> --
> ---
> Hinweis: unverschl?sselte E-Mails k?nnen leicht abgeh?rt und manipuliert
> werden ! F?r eine vertrauliche Kommunikation senden Sie bitte ihren
> GPG/PGP-Schl?ssel zu.
> ---
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> [email protected] -- +49-151-27565287
On Tue, Nov 24, 2020 at 05:07:28PM +0000, Wei Liu wrote:
> This makes the name match Hyper-V TLFS.
>
> Signed-off-by: Wei Liu <[email protected]>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
This patch is trivially correct.
I will apply it to hyperv-next to reduce length of this series.
Wei.
> ---
> include/asm-generic/hyperv-tlfs.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index e73a11850055..e6903589a82a 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -88,7 +88,7 @@
> #define HV_CONNECT_PORT BIT(7)
> #define HV_ACCESS_STATS BIT(8)
> #define HV_DEBUGGING BIT(11)
> -#define HV_CPU_POWER_MANAGEMENT BIT(12)
> +#define HV_CPU_MANAGEMENT BIT(12)
>
>
> /*
> --
> 2.20.1
>
On Tue, 2020-12-15 at 16:42 +0000, Wei Liu wrote:
> On Tue, Dec 15, 2020 at 04:25:03PM +0100, Enrico Weigelt, metux IT consult wrote:
> > On 03.12.20 00:22, Wei Liu wrote:
> >
> > Hi,
> >
> > > I don't follow. Do you mean reusing /dev/kvm but with a different set of
> > > APIs underneath? I don't think that will work.
> >
> > My idea was using the same uapi for both hypervisors, so that we can use
> > the same userlands for both.
> >
> > Are the semantis so different that we can't provide the same API ?
>
> We can provide some similar APIs for ease of porting, but can't provide
> 1:1 mappings. By definition KVM and MSHV are two different things. There
> is no goal to make one ABI / API compatible with the other.
I'm not sure I understand.
KVM is the Linux userspace API for virtualisation. It is designed to be
versatile enough that it can support multiple implementations across
multiple architectures, including both AMD SVM and Intel VMX on x86.
Are you saying that KVM has *failed* to be versatile enough that this
can be "just another implementation"? What are the problems? Is it
unfixable?
On Tue, Feb 02, 2021 at 10:40:43AM +0000, David Woodhouse wrote:
> On Tue, 2020-12-15 at 16:42 +0000, Wei Liu wrote:
> > On Tue, Dec 15, 2020 at 04:25:03PM +0100, Enrico Weigelt, metux IT consult wrote:
> > > On 03.12.20 00:22, Wei Liu wrote:
> > >
> > > Hi,
> > >
> > > > I don't follow. Do you mean reusing /dev/kvm but with a different set of
> > > > APIs underneath? I don't think that will work.
> > >
> > > My idea was using the same uapi for both hypervisors, so that we can use
> > > the same userlands for both.
> > >
> > > Are the semantis so different that we can't provide the same API ?
> >
> > We can provide some similar APIs for ease of porting, but can't provide
> > 1:1 mappings. By definition KVM and MSHV are two different things. There
> > is no goal to make one ABI / API compatible with the other.
>
> I'm not sure I understand.
>
> KVM is the Linux userspace API for virtualisation. It is designed to be
> versatile enough that it can support multiple implementations across
> multiple architectures, including both AMD SVM and Intel VMX on x86.
>
> Are you saying that KVM has *failed* to be versatile enough that this
> can be "just another implementation"? What are the problems? Is it
> unfixable?
The KVM APIs are good enough to cover guest life cycle management. To
make MSHV another implementation of the KVM APIs, we essentially need to
massage the data structures both way.
They are There is also an aspect for controlling the hypervisor that
affect the whole virtualization system. KVM APIs don't handle those. We
would need /dev/mshv for that purpose alone.
There is another aspect for Microsoft Hypervisor specific features and
enhancements, which aren't applicable to KVM. Features make sense for a
specific type-1 hypervisor may not make sense for KVM (a type-2
hypervisor). We have no intention to pollute KVM APIs with those.
All in all the latter two points make /dev/mshv is a more viable route
in the long run.
Wei.