Hyper-V root partition needs to map MSI interrupts differently in case of
nested setup and this series addresses the case.
The following series implements...
---
Stanislav Kinsburskii (2):
x86/hyperv: Expose an helper to map PCI interrupts
PCI: hv: Deal with nested MSHV setup
arch/x86/hyperv/irqdomain.c | 40 +++++++++++++++++++++++++----------
arch/x86/include/asm/mshyperv.h | 2 ++
drivers/pci/controller/pci-hyperv.c | 11 +++++++++-
3 files changed, 40 insertions(+), 13 deletions(-)
From: Stanislav Kinsburskii <[email protected]>
This patch moves a part of currently internal logic into the new
hv_map_msi_interrupt function and makes it globally available helper,
which will be used to map PCI interrupts in case of root partition.
Signed-off-by: Stanislav Kinsburskii <[email protected]>
CC: "K. Y. Srinivasan" <[email protected]>
CC: Haiyang Zhang <[email protected]>
CC: Wei Liu <[email protected]>
CC: Dexuan Cui <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: "H. Peter Anvin" <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/hyperv/irqdomain.c | 40 +++++++++++++++++++++++++++------------
arch/x86/include/asm/mshyperv.h | 2 ++
2 files changed, 30 insertions(+), 12 deletions(-)
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 42c70d28ef27..fd9c487726e3 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -169,13 +169,35 @@ static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
return dev_id;
}
-static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
- struct hv_interrupt_entry *entry)
+/**
+ * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+ * @data: Describes the IRQ
+ * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
+ *
+ * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
+ */
+int hv_map_msi_interrupt(struct irq_data *data,
+ struct hv_interrupt_entry *out_entry)
{
- union hv_device_id device_id = hv_build_pci_dev_id(dev);
+ struct msi_desc *msidesc;
+ struct pci_dev *dev;
+ union hv_device_id device_id;
+ struct hv_interrupt_entry dummy, *entry;
+ struct irq_cfg *cfg = irqd_cfg(data);
+ const cpumask_t *affinity;
+ int cpu, vector;
+
+ msidesc = irq_data_get_msi_desc(data);
+ dev = msi_desc_to_pci_dev(msidesc);
+ device_id = hv_build_pci_dev_id(dev);
+ affinity = irq_data_get_effective_affinity_mask(data);
+ cpu = cpumask_first_and(affinity, cpu_online_mask);
+ entry = out_entry ? out_entry : &dummy;
+ vector = cfg->vector;
return hv_map_interrupt(device_id, false, cpu, vector, entry);
}
+EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
{
@@ -190,10 +212,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
{
struct msi_desc *msidesc;
struct pci_dev *dev;
- struct hv_interrupt_entry out_entry, *stored_entry;
+ struct hv_interrupt_entry *stored_entry;
struct irq_cfg *cfg = irqd_cfg(data);
- const cpumask_t *affinity;
- int cpu;
u64 status;
msidesc = irq_data_get_msi_desc(data);
@@ -204,9 +224,6 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
return;
}
- affinity = irq_data_get_effective_affinity_mask(data);
- cpu = cpumask_first_and(affinity, cpu_online_mask);
-
if (data->chip_data) {
/*
* This interrupt is already mapped. Let's unmap first.
@@ -235,15 +252,14 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
return;
}
- status = hv_map_msi_interrupt(dev, cpu, cfg->vector, &out_entry);
+ status = hv_map_msi_interrupt(data, stored_entry);
if (status != HV_STATUS_SUCCESS) {
kfree(stored_entry);
return;
}
- *stored_entry = out_entry;
data->chip_data = stored_entry;
- entry_to_msi_msg(&out_entry, msg);
+ entry_to_msi_msg(data->chip_data, msg);
return;
}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 4c4c0ec3b62e..aa0e83acacbd 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -203,6 +203,8 @@ static inline void hv_apic_init(void) {}
struct irq_domain *hv_create_pci_msi_domain(void);
+int hv_map_msi_interrupt(struct irq_data *data,
+ struct hv_interrupt_entry *out_entry);
int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
struct hv_interrupt_entry *entry);
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
From: Stanislav Kinsburskii <[email protected]>
Running Microsoft hypervisor as nested (i.e., on top of another Microsoft
hypervisor) imposes a different requirement for the PCI-hyperv controller.
In this setup, the interrupt will first come to the nested (L1) hypervisor
from the hypervisor, running on bare metal (L0), and then the L1 hypervisor
will deliver the interrupt to the appropriate CPU of the nested root
partition.
Thus, instead of issuing the RETARGET hypercall to the L0 hypervisor,
MAP_DEVICE_INTERRUPT hypercall should be issued to the L1 hypervisor to
complete the interrupt setup.
Signed-off-by: Stanislav Kinsburskii <[email protected]>
CC: "K. Y. Srinivasan" <[email protected]>
CC: Haiyang Zhang <[email protected]>
CC: Wei Liu <[email protected]>
CC: Dexuan Cui <[email protected]>
CC: Lorenzo Pieralisi <[email protected]>
CC: "Krzysztof WilczyĆski" <[email protected]>
CC: Rob Herring <[email protected]>
CC: Bjorn Helgaas <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
drivers/pci/controller/pci-hyperv.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f33370b75628..61bee8babad4 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1570,7 +1570,16 @@ static void hv_irq_mask(struct irq_data *data)
static void hv_irq_unmask(struct irq_data *data)
{
- hv_arch_irq_unmask(data);
+ if (hv_nested && hv_root_partition)
+ /*
+ * In case of the nested root partition, the nested hypervisor
+ * is taking care of interrupt remapping and thus the
+ * MAP_DEVICE_INTERRUPT hypercall is required instead of the
+ * RETARGET_INTERRUPT one.
+ */
+ (void)hv_map_msi_interrupt(data, NULL);
+ else
+ hv_arch_irq_unmask(data);
if (data->parent_data->chip->irq_unmask)
irq_chip_unmask_parent(data);
On Thu, Apr 06 2023 at 09:33, Stanislav Kinsburskii wrote:
> This patch moves
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> a part of currently internal logic into the new
> hv_map_msi_interrupt function and makes it globally available helper,
> which will be used to map PCI interrupts in case of root partition.
> -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> - struct hv_interrupt_entry *entry)
> +/**
> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
So if you need to put "" on Map then maybe your function is
misnomed. Either it maps or it does not, right?
> + * @data: Describes the IRQ
> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> + *
> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> + */
> +int hv_map_msi_interrupt(struct irq_data *data,
> + struct hv_interrupt_entry *out_entry)
> {
> - union hv_device_id device_id = hv_build_pci_dev_id(dev);
> + struct msi_desc *msidesc;
> + struct pci_dev *dev;
> + union hv_device_id device_id;
> + struct hv_interrupt_entry dummy, *entry;
> + struct irq_cfg *cfg = irqd_cfg(data);
> + const cpumask_t *affinity;
> + int cpu, vector;
> +
> + msidesc = irq_data_get_msi_desc(data);
> + dev = msi_desc_to_pci_dev(msidesc);
> + device_id = hv_build_pci_dev_id(dev);
> + affinity = irq_data_get_effective_affinity_mask(data);
> + cpu = cpumask_first_and(affinity, cpu_online_mask);
The effective affinity mask of MSI interrupts consists only of online
CPUs, to be accurate: it has exactly one online CPU set.
But even if it would have only offline CPUs then the result would be:
cpu = nr_cpu_ids
which is definitely invalid. While a disabled vector targeted to an
offline CPU is not necessarily invalid.
Thanks,
tglx
On Thu, Apr 13, 2023 at 03:51:09PM +0200, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 09:33, Stanislav Kinsburskii wrote:
> > This patch moves
>
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
>
Thanks. I'll elaborate on the reationale in the next revision.
> > a part of currently internal logic into the new
> > hv_map_msi_interrupt function and makes it globally available helper,
> > which will be used to map PCI interrupts in case of root partition.
>
> > -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> > - struct hv_interrupt_entry *entry)
> > +/**
> > + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
>
> So if you need to put "" on Map then maybe your function is
> misnomed. Either it maps or it does not, right?
>
Thanks, I'll remove the quotation marks in the next update.
> > + * @data: Describes the IRQ
> > + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> > + *
> > + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> > + */
> > +int hv_map_msi_interrupt(struct irq_data *data,
> > + struct hv_interrupt_entry *out_entry)
> > {
> > - union hv_device_id device_id = hv_build_pci_dev_id(dev);
> > + struct msi_desc *msidesc;
> > + struct pci_dev *dev;
> > + union hv_device_id device_id;
> > + struct hv_interrupt_entry dummy, *entry;
> > + struct irq_cfg *cfg = irqd_cfg(data);
> > + const cpumask_t *affinity;
> > + int cpu, vector;
> > +
> > + msidesc = irq_data_get_msi_desc(data);
> > + dev = msi_desc_to_pci_dev(msidesc);
> > + device_id = hv_build_pci_dev_id(dev);
> > + affinity = irq_data_get_effective_affinity_mask(data);
> > + cpu = cpumask_first_and(affinity, cpu_online_mask);
>
> The effective affinity mask of MSI interrupts consists only of online
> CPUs, to be accurate: it has exactly one online CPU set.
>
> But even if it would have only offline CPUs then the result would be:
>
> cpu = nr_cpu_ids
>
> which is definitely invalid. While a disabled vector targeted to an
> offline CPU is not necessarily invalid.
>
Thank you for diving into this logic.
Although this patch only tosses the code and doens't make any functional
changes, I guess if the fix for the the cpu is found has is required, it
has to be in a separated patch.
Would you mind to elaborate more of the problem(s)?
Do you mean that the result of cpumask_first_and has to be checked for not
being >= nr_cpus_ids?
Or do you mean there is no need to check the affinity against
cpu_online_mask at all ans we can simply take any first bit from the
effective affinity mask?
Also, could ou elaborate more on the disabled vector target to an
offline CPU? Is there any use case for such scenario (in this case we
might want to support it)?
I guess the goal of this code is to make sure that hypervisor on't be
configured to deliver MSI to an online CPU.
Thanks,
Stanislav
> Thanks,
>
> tglx
On Wed, Apr 12, 2023 at 09:19:51AM -0700, Stanislav Kinsburskii wrote:
> On Thu, Apr 13, 2023 at 03:51:09PM +0200, Thomas Gleixner wrote:
> > On Thu, Apr 06 2023 at 09:33, Stanislav Kinsburskii wrote:
> > > This patch moves
> >
> > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches
> > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> >
>
> Thanks. I'll elaborate on the reationale in the next revision.
>
> > > a part of currently internal logic into the new
> > > hv_map_msi_interrupt function and makes it globally available helper,
> > > which will be used to map PCI interrupts in case of root partition.
> >
> > > -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> > > - struct hv_interrupt_entry *entry)
> > > +/**
> > > + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> >
> > So if you need to put "" on Map then maybe your function is
> > misnomed. Either it maps or it does not, right?
> >
>
> Thanks, I'll remove the quotation marks in the next update.
>
> > > + * @data: Describes the IRQ
> > > + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> > > + *
> > > + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> > > + */
> > > +int hv_map_msi_interrupt(struct irq_data *data,
> > > + struct hv_interrupt_entry *out_entry)
> > > {
> > > - union hv_device_id device_id = hv_build_pci_dev_id(dev);
> > > + struct msi_desc *msidesc;
> > > + struct pci_dev *dev;
> > > + union hv_device_id device_id;
> > > + struct hv_interrupt_entry dummy, *entry;
> > > + struct irq_cfg *cfg = irqd_cfg(data);
> > > + const cpumask_t *affinity;
> > > + int cpu, vector;
> > > +
> > > + msidesc = irq_data_get_msi_desc(data);
> > > + dev = msi_desc_to_pci_dev(msidesc);
> > > + device_id = hv_build_pci_dev_id(dev);
> > > + affinity = irq_data_get_effective_affinity_mask(data);
> > > + cpu = cpumask_first_and(affinity, cpu_online_mask);
> >
> > The effective affinity mask of MSI interrupts consists only of online
> > CPUs, to be accurate: it has exactly one online CPU set.
> >
> > But even if it would have only offline CPUs then the result would be:
> >
> > cpu = nr_cpu_ids
> >
> > which is definitely invalid. While a disabled vector targeted to an
> > offline CPU is not necessarily invalid.
> >
>
> Thank you for diving into this logic.
>
> Although this patch only tosses the code and doens't make any functional
> changes, I guess if the fix for the the cpu is found has is required, it
> has to be in a separated patch.
>
> Would you mind to elaborate more of the problem(s)?
> Do you mean that the result of cpumask_first_and has to be checked for not
> being >= nr_cpus_ids?
> Or do you mean there is no need to check the affinity against
> cpu_online_mask at all ans we can simply take any first bit from the
> effective affinity mask?
>
> Also, could ou elaborate more on the disabled vector target to an
> offline CPU? Is there any use case for such scenario (in this case we
> might want to support it)?
>
> I guess the goal of this code is to make sure that hypervisor on't be
> configured to deliver MSI to an online CPU.
>
I'm sorry, there were a lot of typos.
Let me try again:
Thank you for diving into this logic.
Although this patch only tosses the code and doens't make any functional
changes, I guess if the fix for the used cpu id is required, it has to
be in a separated patch.
Would you mind to elaborate more of the problem(s)?
Do you mean that the result of cpumask_first_and has to be checked for not
being >= nr_cpus_ids?
Or do you mean that there is no need to check the irq affinity against
cpu_online_mask at all and we can simply take any first bit from the
effective affinity mask?
Also, could you elaborate more on the disabled vector targeting an
offline CPU? Is there any use case for such scenario (in this case we
might want to support it)?
I guess the goal of this code is to make sure that hypervisor won't be
configured to deliver an MSI to an offline CPU.
Thanks,
Stanislav
> Thanks,
> Stanislav
>
> > Thanks,
> >
> > tglx
Stanislav!
On Wed, Apr 12 2023 at 09:36, Stanislav Kinsburskii wrote:
> On Wed, Apr 12, 2023 at 09:19:51AM -0700, Stanislav Kinsburskii wrote:
>> > > + affinity = irq_data_get_effective_affinity_mask(data);
>> > > + cpu = cpumask_first_and(affinity, cpu_online_mask);
>> >
>> > The effective affinity mask of MSI interrupts consists only of online
>> > CPUs, to be accurate: it has exactly one online CPU set.
>> >
>> > But even if it would have only offline CPUs then the result would be:
>> >
>> > cpu = nr_cpu_ids
>> >
>> > which is definitely invalid. While a disabled vector targeted to an
>> > offline CPU is not necessarily invalid.
>
> Although this patch only tosses the code and doens't make any functional
> changes, I guess if the fix for the used cpu id is required, it has to
> be in a separated patch.
Sure.
> Would you mind to elaborate more of the problem(s)?
> Do you mean that the result of cpumask_first_and has to be checked for not
> being >= nr_cpus_ids?
> Or do you mean that there is no need to check the irq affinity against
> cpu_online_mask at all and we can simply take any first bit from the
> effective affinity mask?
As of today the effective mask of MSI interrupts contains only online
CPUs. I don't see a reason for that to change.
> Also, could you elaborate more on the disabled vector targeting an
> offline CPU? Is there any use case for such scenario (in this case we
> might want to support it)?
I'm not aware of one today. That was more a theoretical reasoning.
> I guess the goal of this code is to make sure that hypervisor won't be
> configured to deliver an MSI to an offline CPU.
Correct, but if the interrupt _is_ masked at the MSI level then the
hypervisor must not deliver an interrupt at all.
The point is that it is valid to target a masked MSI entry to an offline
CPU under the assumption that the hardware/emulation respects the
masking. Whether that's a good idea or not is a different question.
The kernel as of today does not do that. It targets unused but
configured MSI[-x] entries towards MANAGED_IRQ_SHUTDOWN_VECTOR on CPU0
for various reasons, one of them being paranoia.
But in principle there is nothing wrong with that and it should either
succeed or being rejected at the software level and not expose a
completely invalid CPU number to the hypercall in the first place.
So if you want to be defensive, then keep the _and(), but then check the
result for being valid and emit something useful like a pr_warn_once()
instead of blindly handing the invalid result to the hypercall and then
have that reject it with some undecipherable error code.
Actually it would not necessarily reach the hypercall because before
that it dereferences cpumask_of(nr_cpu_ids) here:
nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
and explode with a kernel pagefault. If not it will read some random
adjacent data and try to create a vp_set from it. Neither of that is
anywhere close to correct.
Thanks,
tglx
On Fri, Apr 14, 2023 at 09:28:52AM +0200, Thomas Gleixner wrote:
> Stanislav!
>
> On Wed, Apr 12 2023 at 09:36, Stanislav Kinsburskii wrote:
> > On Wed, Apr 12, 2023 at 09:19:51AM -0700, Stanislav Kinsburskii wrote:
> >> > > + affinity = irq_data_get_effective_affinity_mask(data);
> >> > > + cpu = cpumask_first_and(affinity, cpu_online_mask);
> >> >
> >> > The effective affinity mask of MSI interrupts consists only of online
> >> > CPUs, to be accurate: it has exactly one online CPU set.
> >> >
> >> > But even if it would have only offline CPUs then the result would be:
> >> >
> >> > cpu = nr_cpu_ids
> >> >
> >> > which is definitely invalid. While a disabled vector targeted to an
> >> > offline CPU is not necessarily invalid.
> >
> > Although this patch only tosses the code and doens't make any functional
> > changes, I guess if the fix for the used cpu id is required, it has to
> > be in a separated patch.
>
> Correct, but if the interrupt _is_ masked at the MSI level then the
> hypervisor must not deliver an interrupt at all.
>
> The point is that it is valid to target a masked MSI entry to an offline
> CPU under the assumption that the hardware/emulation respects the
> masking. Whether that's a good idea or not is a different question.
>
> The kernel as of today does not do that. It targets unused but
> configured MSI[-x] entries towards MANAGED_IRQ_SHUTDOWN_VECTOR on CPU0
> for various reasons, one of them being paranoia.
>
> But in principle there is nothing wrong with that and it should either
> succeed or being rejected at the software level and not expose a
> completely invalid CPU number to the hypercall in the first place.
>
> So if you want to be defensive, then keep the _and(), but then check the
> result for being valid and emit something useful like a pr_warn_once()
> instead of blindly handing the invalid result to the hypercall and then
> have that reject it with some undecipherable error code.
>
> Actually it would not necessarily reach the hypercall because before
> that it dereferences cpumask_of(nr_cpu_ids) here:
>
> nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
>
> and explode with a kernel pagefault. If not it will read some random
> adjacent data and try to create a vp_set from it. Neither of that is
> anywhere close to correct.
>
Thank you Thomas.
I sent a patch to address the problmes you highlighted:
"x86/hyperv: Fix IRQ effective cpu discovery for the interrupts unmasking"
I'll update this series after that patch is merged.
Thanks,
Stanislav
> Thanks,
>
> tglx