Jeffrey's 4 recent patches added Multi-MSI support to the pci-hyperv driver.
Unluckily, one of the patches, i.e., b4b77778ecc5, causes a regression to a
fio test for the Azure VM SKU Standard L64s v2 (64 AMD vCPUs, 8 NVMe drives):
when fio runs against all the 8 NVMe drives, it runs fine with a low io-depth
(e.g., 2 or 4); when fio runs with a high io-depth (e.g., 256), somehow
queue-29 of each NVMe drive suddenly no longer receives any interrupts, and
the NVMe core code has to abort the queue after a timeout of 30 seconds, and
then queue-29 starts to receive interrupts again for several seconds, and
later queue-29 no longer receives interrupts again, and this pattern repeats:
[ 223.891249] nvme nvme2: I/O 320 QID 29 timeout, aborting
[ 223.896231] nvme nvme0: I/O 320 QID 29 timeout, aborting
[ 223.898340] nvme nvme4: I/O 832 QID 29 timeout, aborting
[ 259.471309] nvme nvme2: I/O 320 QID 29 timeout, aborting
[ 259.476493] nvme nvme0: I/O 321 QID 29 timeout, aborting
[ 259.482967] nvme nvme0: I/O 322 QID 29 timeout, aborting
Some other symptoms are: the throughput of the NVMe drives drops due to
commit b4b77778ecc5. When the fio test is running, the kernel prints some
soft lock-up messages from time to time.
Commit b4b77778ecc5 itself looks good, and at the moment it's unclear where
the issue is. While the issue is being investigated, restore the old behavior
in hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
single-MSI and MSI-X. This is a stopgap for the above NVMe issue.
Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
Signed-off-by: Dexuan Cui <[email protected]>
Cc: Jeffrey Hugo <[email protected]>
Cc: Carl Vanderlip <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index db814f7b93ba..65d0dab25deb 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1701,6 +1701,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
struct compose_comp_ctxt comp;
struct tran_int_desc *int_desc;
struct msi_desc *msi_desc;
+ bool multi_msi;
u8 vector, vector_count;
struct {
struct pci_packet pci_pkt;
@@ -1714,8 +1715,16 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
u32 size;
int ret;
- /* Reuse the previous allocation */
- if (data->chip_data) {
+ msi_desc = irq_data_get_msi_desc(data);
+ multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
+ msi_desc->nvec_used > 1;
+ /*
+ * Reuse the previous allocation for Multi-MSI. This is required for
+ * Multi-MSI and is optional for single-MSI and MSI-X. Note: for now,
+ * don't reuse the previous allocation for MSI-X because this causes
+ * unreliable interrupt delivery for some NVMe devices.
+ */
+ if (data->chip_data && multi_msi) {
int_desc = data->chip_data;
msg->address_hi = int_desc->address >> 32;
msg->address_lo = int_desc->address & 0xffffffff;
@@ -1723,7 +1732,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
return;
}
- msi_desc = irq_data_get_msi_desc(data);
pdev = msi_desc_to_pci_dev(msi_desc);
dest = irq_data_get_effective_affinity_mask(data);
pbus = pdev->bus;
@@ -1733,11 +1741,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
if (!hpdev)
goto return_null_message;
+ /* Free any previous message that might have already been composed. */
+ if (data->chip_data && !multi_msi) {
+ int_desc = data->chip_data;
+ data->chip_data = NULL;
+ hv_int_desc_free(hpdev, int_desc);
+ }
+
int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
if (!int_desc)
goto drop_reference;
- if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
+ if (multi_msi) {
/*
* If this is not the first MSI of Multi MSI, we already have
* a mapping. Can exit early.
--
2.25.1
On 8/3/2022 8:51 PM, Dexuan Cui wrote:
> Jeffrey's 4 recent patches added Multi-MSI support to the pci-hyperv driver.
> Unluckily, one of the patches, i.e., b4b77778ecc5, causes a regression to a
> fio test for the Azure VM SKU Standard L64s v2 (64 AMD vCPUs, 8 NVMe drives):
>
> when fio runs against all the 8 NVMe drives, it runs fine with a low io-depth
> (e.g., 2 or 4); when fio runs with a high io-depth (e.g., 256), somehow
> queue-29 of each NVMe drive suddenly no longer receives any interrupts, and
> the NVMe core code has to abort the queue after a timeout of 30 seconds, and
> then queue-29 starts to receive interrupts again for several seconds, and
> later queue-29 no longer receives interrupts again, and this pattern repeats:
>
> [ 223.891249] nvme nvme2: I/O 320 QID 29 timeout, aborting
> [ 223.896231] nvme nvme0: I/O 320 QID 29 timeout, aborting
> [ 223.898340] nvme nvme4: I/O 832 QID 29 timeout, aborting
> [ 259.471309] nvme nvme2: I/O 320 QID 29 timeout, aborting
> [ 259.476493] nvme nvme0: I/O 321 QID 29 timeout, aborting
> [ 259.482967] nvme nvme0: I/O 322 QID 29 timeout, aborting
>
> Some other symptoms are: the throughput of the NVMe drives drops due to
> commit b4b77778ecc5. When the fio test is running, the kernel prints some
> soft lock-up messages from time to time.
>
> Commit b4b77778ecc5 itself looks good, and at the moment it's unclear where
> the issue is. While the issue is being investigated, restore the old behavior
> in hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X. This is a stopgap for the above NVMe issue.
>
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Signed-off-by: Dexuan Cui <[email protected]>
> Cc: Jeffrey Hugo <[email protected]>
> Cc: Carl Vanderlip <[email protected]>
> ---
I'm sorry a regression has been discovered. Right now, the issue
doesn't make sense to me. I'd love to know what you find out.
This stopgap solution appears reasonable to me.
Reviewed-by: Jeffrey Hugo <[email protected]>
> From: Jeffrey Hugo <[email protected]>
> Sent: Thursday, August 4, 2022 7:22 AM
> ...
> > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in
> compose_msi_msg()")
> > Signed-off-by: Dexuan Cui <[email protected]>
> > Cc: Jeffrey Hugo <[email protected]>
> > Cc: Carl Vanderlip <[email protected]>
> > ---
>
> I'm sorry a regression has been discovered. Right now, the issue
> doesn't make sense to me. I'd love to know what you find out.
>
> This stopgap solution appears reasonable to me.
>
> Reviewed-by: Jeffrey Hugo <[email protected]>
Hi Lorenzo, Bjorn and all,
Would you please take a look at the patch?
Thanks,
-- Dexuan
On Wed, Aug 10, 2022 at 09:51:23PM +0000, Dexuan Cui wrote:
> > From: Jeffrey Hugo <[email protected]>
> > Sent: Thursday, August 4, 2022 7:22 AM
> > ...
> > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in
> > compose_msi_msg()")
> > > Signed-off-by: Dexuan Cui <[email protected]>
> > > Cc: Jeffrey Hugo <[email protected]>
> > > Cc: Carl Vanderlip <[email protected]>
> > > ---
> >
> > I'm sorry a regression has been discovered. Right now, the issue
> > doesn't make sense to me. I'd love to know what you find out.
> >
> > This stopgap solution appears reasonable to me.
> >
> > Reviewed-by: Jeffrey Hugo <[email protected]>
>
> Hi Lorenzo, Bjorn and all,
> Would you please take a look at the patch?
I am not very happy with this patch, it is a temporary solution to
a potential problem (or reworded: we don't know what the problem is
but we are fixing it anyway - in a way that is potentially not
related to the bug at all).
If the commit you are fixing is the problem I'd rather revert it,
waiting to understand the problem and to rework the code accordingly.
I don't think b4b77778ecc5 is essential to Hyper-V code - it is a
rework, let's fix it and repost it when it is updated through the
debugging you are carrying out. In the meantime we can revert it
to fix the issue.
Thanks,
Lorenzo
On Wed, Aug 03, 2022 at 07:51:04PM -0700, Dexuan Cui wrote:
> Jeffrey's 4 recent patches added Multi-MSI support to the pci-hyperv driver.
> Unluckily, one of the patches, i.e., b4b77778ecc5, causes a regression to a
> fio test for the Azure VM SKU Standard L64s v2 (64 AMD vCPUs, 8 NVMe drives):
>
> when fio runs against all the 8 NVMe drives, it runs fine with a low io-depth
> (e.g., 2 or 4); when fio runs with a high io-depth (e.g., 256), somehow
> queue-29 of each NVMe drive suddenly no longer receives any interrupts, and
> the NVMe core code has to abort the queue after a timeout of 30 seconds, and
> then queue-29 starts to receive interrupts again for several seconds, and
> later queue-29 no longer receives interrupts again, and this pattern repeats:
>
> [ 223.891249] nvme nvme2: I/O 320 QID 29 timeout, aborting
> [ 223.896231] nvme nvme0: I/O 320 QID 29 timeout, aborting
> [ 223.898340] nvme nvme4: I/O 832 QID 29 timeout, aborting
> [ 259.471309] nvme nvme2: I/O 320 QID 29 timeout, aborting
> [ 259.476493] nvme nvme0: I/O 321 QID 29 timeout, aborting
> [ 259.482967] nvme nvme0: I/O 322 QID 29 timeout, aborting
>
> Some other symptoms are: the throughput of the NVMe drives drops due to
> commit b4b77778ecc5. When the fio test is running, the kernel prints some
> soft lock-up messages from time to time.
>
> Commit b4b77778ecc5 itself looks good, and at the moment it's unclear where
> the issue is. While the issue is being investigated, restore the old behavior
> in hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X. This is a stopgap for the above NVMe issue.
This has only observations with no explanations, and I don't see how
it will be useful to future readers of the git history.
I assume you bisected the problem to b4b77778ecc5? Can you just
revert that? A revert requires no more explanation than "this broke
something."
I guess this is a fine distinction, but I really don't like random
code changes that "seem to avoid a problem but we don't know how."
A revert at least has the advantage that we can cover our eyes and
pretend the commit never happened. This patch feels like future
readers will have to try to understand the code even though we
clearly don't understand why it makes a difference.
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Signed-off-by: Dexuan Cui <[email protected]>
> Cc: Jeffrey Hugo <[email protected]>
> Cc: Carl Vanderlip <[email protected]>
> ---
> drivers/pci/controller/pci-hyperv.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index db814f7b93ba..65d0dab25deb 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1701,6 +1701,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> struct compose_comp_ctxt comp;
> struct tran_int_desc *int_desc;
> struct msi_desc *msi_desc;
> + bool multi_msi;
> u8 vector, vector_count;
> struct {
> struct pci_packet pci_pkt;
> @@ -1714,8 +1715,16 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> u32 size;
> int ret;
>
> - /* Reuse the previous allocation */
> - if (data->chip_data) {
> + msi_desc = irq_data_get_msi_desc(data);
> + multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> + msi_desc->nvec_used > 1;
> + /*
> + * Reuse the previous allocation for Multi-MSI. This is required for
> + * Multi-MSI and is optional for single-MSI and MSI-X. Note: for now,
> + * don't reuse the previous allocation for MSI-X because this causes
> + * unreliable interrupt delivery for some NVMe devices.
> + */
> + if (data->chip_data && multi_msi) {
> int_desc = data->chip_data;
> msg->address_hi = int_desc->address >> 32;
> msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1723,7 +1732,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> return;
> }
>
> - msi_desc = irq_data_get_msi_desc(data);
> pdev = msi_desc_to_pci_dev(msi_desc);
> dest = irq_data_get_effective_affinity_mask(data);
> pbus = pdev->bus;
> @@ -1733,11 +1741,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> if (!hpdev)
> goto return_null_message;
>
> + /* Free any previous message that might have already been composed. */
> + if (data->chip_data && !multi_msi) {
> + int_desc = data->chip_data;
> + data->chip_data = NULL;
> + hv_int_desc_free(hpdev, int_desc);
> + }
> +
> int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
> if (!int_desc)
> goto drop_reference;
>
> - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> + if (multi_msi) {
> /*
> * If this is not the first MSI of Multi MSI, we already have
> * a mapping. Can exit early.
> --
> 2.25.1
>
> From: Bjorn Helgaas <[email protected]>
> Sent: Tuesday, August 16, 2022 8:51 AM
> To: Dexuan Cui <[email protected]>
>
> This has only observations with no explanations, and I don't see how
> it will be useful to future readers of the git history.
Please see the below.
> I assume you bisected the problem to b4b77778ecc5?
Yes.
> Can you just revert that? A revert requires no more explanation than
> "this broke something."
It's better to not revert b4b77778ecc5, which is required by Jeff's
Multi-MSI device, which doesn't seem to be affected by the interrupt
issue I described.
> I guess this is a fine distinction, but I really don't like random
> code changes that "seem to avoid a problem but we don't know how."
> A revert at least has the advantage that we can cover our eyes and
> pretend the commit never happened. This patch feels like future
> readers will have to try to understand the code even though we
> clearly don't understand why it makes a difference.
I just replied to Lorenzo's email with more details. FYI, this is the link
to my reply:
https://lwn.net/ml/linux-kernel/SA1PR21MB1335D08F987BBAE08EADF010BF6B9@SA1PR21MB1335.namprd21.prod.outlook.com/
I just felt the commit message might be too long if I had put all the
details there. :-) Can we add a Links: tag?
Thanks,
-- Dexuan
> From: Lorenzo Pieralisi <[email protected]>
> Sent: Tuesday, August 16, 2022 8:48 AM
> To: Dexuan Cui <[email protected]>
>
> On Wed, Aug 10, 2022 at 09:51:23PM +0000, Dexuan Cui wrote:
> > > From: Jeffrey Hugo <[email protected]>
> > > Sent: Thursday, August 4, 2022 7:22 AM
> > > ...
> > > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in
> > > compose_msi_msg()")
> > > > Signed-off-by: Dexuan Cui <[email protected]>
> > > > Cc: Jeffrey Hugo <[email protected]>
> > > > Cc: Carl Vanderlip <[email protected]>
> > > > ---
> > >
> > > I'm sorry a regression has been discovered. Right now, the issue
> > > doesn't make sense to me. I'd love to know what you find out.
> > >
> > > This stopgap solution appears reasonable to me.
> > >
> > > Reviewed-by: Jeffrey Hugo <[email protected]>
> >
> > Hi Lorenzo, Bjorn and all,
> > Would you please take a look at the patch?
>
> I am not very happy with this patch, it is a temporary solution to
> a potential problem (or reworded: we don't know what the problem is
> but we are fixing it anyway - in a way that is potentially not
> related to the bug at all).
Exactly. I understand your concern. We're still trying to figure out
the root cause. The problem is that so far the issue only repros with
that Azure VM SKU "L64s v2" (64 AMD vCPUs, 8 NVMe drives, 32 I/O
queues per NVMe device). Unluckily I can't find the same hardware
and software locally. I tried to repro the issue on two local Hyper-V
hosts with 4 NVMe devices (8 I/O queues per NVMe device) but failed
to repro the issue. It's very difficult to debug the issue on Azure (if
that's possible) because we don't have hypervisor debugger to examine
the hypervisor and the hardware (I'm still trying to figure out how
difficult it's to set up the debugger).
As I mentioned, b4b77778ecc5 itself looks good, but it changes the
behavior of compose_msi_msg(), that has been working for many years.
This small patch restores the old behavior for non-Multi-MSI, so it's
safe. Considering we may need quite a long time to get the root cause,
IMO this patch is a good temporary solution for non-Multi-MSI.
> If the commit you are fixing is the problem I'd rather revert it,
> waiting to understand the problem and to rework the code accordingly.
IMO we should not revert b4b77778ecc5, because Jeff's Multi-MSI-capable
PCI device needs this patch to work properly. It looks like Jeff's PCI device
doesn't suffer from the interrupt issue I described in the commit log.
BTW, Multi-MSI never worked in pci-hyperv before Jeff's 4 recent
patches. Jeff's PCI device is the first Multi-MSI PCI device we ever tested
with pci-hyperv.
> I don't think b4b77778ecc5 is essential to Hyper-V code - it is a
> rework, let's fix it and repost it when it is updated through the
> debugging you are carrying out. In the meantime we can revert it
> to fix the issue.
Thanks for sharing your thoughts. IMO b4b77778ecc5 is not a simple
rework. As I explained above, IMO we should not revert it.
Before b4b77778ecc5:
1. when a PCI device driver calls pci_alloc_irq_vectors(),
hv_compose_msi_msg() is called to create an interrupt mapping with
the dummy vector= 0xef (i.e., MANAGED_IRQ_SHUTDOWN_VECTOR);
2. when the PCI device driver calls request_irq(), hv_compose_msi_msg()
is called again with a real vector and cpu.
2.1 hv_compose_msi_msg() destroys the earlier interrupt mapping..
2.2 hv_compose_msi_msg() creates a new mapping with the real vector/cpu.
2.3 hv_arch_irq_unmask() uses a hypercall to update the mapping with the real vector/cpu.
(BTW, hv_arch_irq_unmask() is also called when irqbalance daemon is running).
Step 2.2 and 2.3 do seem duplicated, but my understanding is that this is
how Linux irq and PCI MSI code work.
With b4b77778ecc5, we no longer have step 2.1 and 2.2 (please note
the "return" in that commit). We never tested this code path before
b4b77778ecc5, and we supposed this should work before we found
the NVMe interrupt issue with the VM SKU "L64s v2".
The strangest thing is that the NVMe devices do work for low IO-depth,
though the read/write speed is a little bit lower (which is also strange);
With high IO-depth,suddenly queue-29 of each NVMe device stops
working -- after the NVMe devices are initialized, hv_compose_msi_msg()
and hv_arch_irq_unmask() are not called, meaning the MSI-X tables on
the NVMe devices are still the same, and the IOMMU's Interrupt
Remapping Table Entries should also be the same, so it's really weird
the interrupts on queue-29 are no longer happening...
My speculation is that there may be some kind of weird hypervisor bug,
but it's hard to prove it with no debugger. :-(
I appreciate your thoughts on my analysis.
To recap, my suggestion is that we use this patch as a temporary solution
and we continue the investigation, which unluckily may need a long period
of time.
Thanks,
Dexuan
On Tue, Aug 16, 2022 at 09:13:26PM +0000, Dexuan Cui wrote:
> > From: Bjorn Helgaas <[email protected]>
> > Sent: Tuesday, August 16, 2022 8:51 AM
> > To: Dexuan Cui <[email protected]>
> >
> > This has only observations with no explanations, and I don't see how
> > it will be useful to future readers of the git history.
> Please see the below.
>
> > I assume you bisected the problem to b4b77778ecc5?
> Yes.
>
> > Can you just revert that? A revert requires no more explanation than
> > "this broke something."
>
> It's better to not revert b4b77778ecc5, which is required by Jeff's
> Multi-MSI device, which doesn't seem to be affected by the interrupt
> issue I described.
You must debug it, there are no two ways about it.
We can't apply fixes on a hunch, more so given that I am not convinced
at all this patch is fixing anything, it is just papering over an
underlying bug that is still to be pinpointed.
> > I guess this is a fine distinction, but I really don't like random
> > code changes that "seem to avoid a problem but we don't know how."
> > A revert at least has the advantage that we can cover our eyes and
> > pretend the commit never happened. This patch feels like future
> > readers will have to try to understand the code even though we
> > clearly don't understand why it makes a difference.
>
> I just replied to Lorenzo's email with more details. FYI, this is the link
> to my reply:
> https://lwn.net/ml/linux-kernel/SA1PR21MB1335D08F987BBAE08EADF010BF6B9@SA1PR21MB1335.namprd21.prod.outlook.com/
>
> I just felt the commit message might be too long if I had put all the
> details there. :-) Can we add a Links: tag?
Commit logs must describe the issue you are fixing, thouroughly and
concisely. To start with "Jeffrey's 4 recent patches" is a very bad
start for a commit log, it means nothing, try to read your log as
someone who needs to understand the commit years down the line please.
Now, back to this patch: we are at -rc1, unless Bjorn is willing to
do so I am not inclined to apply this patch till next merge window
(and actually I am not inclined to merge it at all).
This gives you folks time to debug it (and it must be debugged), the
fact that it works for one multi-MSI device does not mean that the
bug isn't still there - I am worried that the issue is with
b4b77778ecc5 and the interaction with core MSI/IOMMU.
Thanks,
Lorenzo