2019-08-15 14:59:09

by Ben Luo

[permalink] [raw]
Subject: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops

Currently, VFIO takes a lot of free-then-request-irq actions whenever
a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
irq. Those actions only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only adds more latency,
but also increases the risk of losing interrupt, which may lead to a
VM hung forever in waiting for IO completion

This patchset solved the issue by:
Patch 2 introduces update_irq_devid to only update dev_id of irqaction
Patch 3 make use of update_irq_devid and optimize irq operations in VFIO

changes from v2:
- reformat to avoid quoted string split across lines and etc.

changes from v1:
- add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
- enhance error recovery code and debugging info in update_irq_devid
- use __must_check in external referencing of update_irq_devid
- use EXPORT_SYMBOL_GPL for update_irq_devid
- reformat code of patch 3 for better readability

Ben Luo (3):
genirq: enhance error recovery code in free irq
genirq: introduce update_irq_devid()
vfio_pci: make use of update_irq_devid and optimize irq ops

drivers/vfio/pci/vfio_pci_intrs.c | 101 +++++++++++++++++++++-------------
include/linux/interrupt.h | 3 ++
kernel/irq/manage.c | 110 +++++++++++++++++++++++++++++++++-----
3 files changed, 164 insertions(+), 50 deletions(-)

--
1.8.3.1


2019-08-15 16:24:32

by Ben Luo

[permalink] [raw]
Subject: [PATCH v3 1/3] genirq: enhance error recovery code in free irq

Per Thomas Gleixner's comments:
1) free_irq/free_percpu_irq returns if called from IRQ context
2) move WARN out of the locked region and print out dev_id

Signed-off-by: Ben Luo <[email protected]>
---
kernel/irq/manage.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..6f9b20f 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,7 +1690,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
struct irqaction *action, **action_ptr;
unsigned long flags;

- WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+ if (in_interrupt()) {
+ WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+ irq, dev_id);
+ return NULL;
+ }

mutex_lock(&desc->request_mutex);
chip_bus_lock(desc);
@@ -1705,10 +1709,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
action = *action_ptr;

if (!action) {
- WARN(1, "Trying to free already-free IRQ %d\n", irq);
raw_spin_unlock_irqrestore(&desc->lock, flags);
chip_bus_sync_unlock(desc);
mutex_unlock(&desc->request_mutex);
+ WARN(1, "Trying to free already-free IRQ %d (dev_id %p)\n",
+ irq, dev_id);
return NULL;
}

@@ -2286,7 +2291,11 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
struct irqaction *action;
unsigned long flags;

- WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+ if (in_interrupt()) {
+ WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+ irq, dev_id);
+ return NULL;
+ }

if (!desc)
return NULL;
@@ -2295,14 +2304,17 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_

action = desc->action;
if (!action || action->percpu_dev_id != dev_id) {
- WARN(1, "Trying to free already-free IRQ %d\n", irq);
- goto bad;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ WARN(1, "Trying to free already-free IRQ (dev_id %p) %d\n",
+ irq, dev_id);
+ return NULL;
}

if (!cpumask_empty(desc->percpu_enabled)) {
- WARN(1, "percpu IRQ %d still enabled on CPU%d!\n",
- irq, cpumask_first(desc->percpu_enabled));
- goto bad;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ WARN(1, "percpu IRQ %d (dev_id %p) still enabled on CPU%d!\n",
+ irq, dev_id, cpumask_first(desc->percpu_enabled));
+ return NULL;
}

/* Found it - now remove it from the list of entries: */
@@ -2317,10 +2329,6 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
irq_chip_pm_put(&desc->irq_data);
module_put(desc->owner);
return action;
-
-bad:
- raw_spin_unlock_irqrestore(&desc->lock, flags);
- return NULL;
}

/**
--
1.8.3.1

2019-08-15 16:24:32

by Ben Luo

[permalink] [raw]
Subject: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of update_irq_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <[email protected]>
Reviewed-by: Liu Jiang <[email protected]>
Reviewed-by: Zou Nanhai <[email protected]>
Reviewed-by: Yunsheng Lin <[email protected]>
---
drivers/vfio/pci/vfio_pci_intrs.c | 101 ++++++++++++++++++++++++--------------
1 file changed, 63 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..b2654ba 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,95 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
int vector, int fd, bool msix)
{
+ struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;
- struct eventfd_ctx *trigger;
int irq, ret;

if (vector < 0 || vector >= vdev->num_ctx)
return -EINVAL;

+ if (fd >= 0) {
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger))
+ return PTR_ERR(trigger);
+ }
+
irq = pci_irq_vector(pdev, vector);

if (vdev->ctx[vector].trigger) {
- free_irq(irq, vdev->ctx[vector].trigger);
- irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
- kfree(vdev->ctx[vector].name);
- eventfd_ctx_put(vdev->ctx[vector].trigger);
- vdev->ctx[vector].trigger = NULL;
+ if (vdev->ctx[vector].trigger == trigger) {
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+ } else if (trigger) {
+ ret = update_irq_devid(irq,
+ vdev->ctx[vector].trigger, trigger);
+ if (unlikely(ret)) {
+ dev_info(&pdev->dev,
+ "update_irq_devid %d (token %p) fails: %d\n",
+ irq, vdev->ctx[vector].trigger, ret);
+ eventfd_ctx_put(trigger);
+ return ret;
+ }
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+ eventfd_ctx_put(vdev->ctx[vector].trigger);
+ vdev->ctx[vector].producer.token = trigger;
+ vdev->ctx[vector].trigger = trigger;
+ } else {
+ free_irq(irq, vdev->ctx[vector].trigger);
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+ kfree(vdev->ctx[vector].name);
+ eventfd_ctx_put(vdev->ctx[vector].trigger);
+ vdev->ctx[vector].trigger = NULL;
+ }
}

if (fd < 0)
return 0;

- vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
- msix ? "x" : "", vector,
- pci_name(pdev));
- if (!vdev->ctx[vector].name)
- return -ENOMEM;
+ if (!vdev->ctx[vector].trigger) {
+ vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
+ "vfio-msi%s[%d](%s)",
+ msix ? "x" : "", vector,
+ pci_name(pdev));
+ if (!vdev->ctx[vector].name) {
+ eventfd_ctx_put(trigger);
+ return -ENOMEM;
+ }

- trigger = eventfd_ctx_fdget(fd);
- if (IS_ERR(trigger)) {
- kfree(vdev->ctx[vector].name);
- return PTR_ERR(trigger);
- }
+ /*
+ * The MSIx vector table resides in device memory which may be
+ * cleared via backdoor resets. We don't allow direct access to
+ * the vector table so even if a userspace driver attempts to
+ * save/restore around such a reset it would be unsuccessful.
+ * To avoid this, restore the cached value of the message prior
+ * to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;

- /*
- * The MSIx vector table resides in device memory which may be cleared
- * via backdoor resets. We don't allow direct access to the vector
- * table so even if a userspace driver attempts to save/restore around
- * such a reset it would be unsuccessful. To avoid this, restore the
- * cached value of the message prior to enabling.
- */
- if (msix) {
- struct msi_msg msg;
+ get_cached_msi_msg(irq, &msg);
+ pci_write_msi_msg(irq, &msg);
+ }

- get_cached_msi_msg(irq, &msg);
- pci_write_msi_msg(irq, &msg);
- }
+ ret = request_irq(irq, vfio_msihandler, 0,
+ vdev->ctx[vector].name, trigger);
+ if (ret) {
+ kfree(vdev->ctx[vector].name);
+ eventfd_ctx_put(trigger);
+ return ret;
+ }

- ret = request_irq(irq, vfio_msihandler, 0,
- vdev->ctx[vector].name, trigger);
- if (ret) {
- kfree(vdev->ctx[vector].name);
- eventfd_ctx_put(trigger);
- return ret;
+ vdev->ctx[vector].producer.token = trigger;
+ vdev->ctx[vector].producer.irq = irq;
+ vdev->ctx[vector].trigger = trigger;
}

- vdev->ctx[vector].producer.token = trigger;
- vdev->ctx[vector].producer.irq = irq;
+ /* always update irte for posted mode */
ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
if (unlikely(ret))
dev_info(&pdev->dev,
"irq bypass producer (token %p) registration fails: %d\n",
vdev->ctx[vector].producer.token, ret);

- vdev->ctx[vector].trigger = trigger;
-
return 0;
}

--
1.8.3.1

2019-08-15 16:25:33

by Ben Luo

[permalink] [raw]
Subject: [PATCH v3 2/3] genirq: introduce update_irq_devid()

Sometimes, only the dev_id field of irqaction need to be changed.
E.g. KVM VM with device passthru via VFIO may switch irq injection
path between KVM irqfd and userspace eventfd. These two paths
share the same irq num and handler for the same vector of a device,
only with different dev_id referencing to different fds' contexts.

So, instead of free/request irq, only update dev_id of irqaction.
This narrows the gap for setting up new irq (and irte, if has)
and also gains some performance benefit.

Signed-off-by: Ben Luo <[email protected]>
Reviewed-by: Liu Jiang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
include/linux/interrupt.h | 3 ++
kernel/irq/manage.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 81 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..6060c5a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
request_percpu_nmi(unsigned int irq, irq_handler_t handler,
const char *devname, void __percpu *dev);

+extern int __must_check
+update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id);
+
extern const void *free_irq(unsigned int, void *);
extern void free_percpu_irq(unsigned int, void __percpu *);

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 6f9b20f..a76ef61 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2064,6 +2064,84 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
EXPORT_SYMBOL(request_threaded_irq);

/**
+ * update_irq_devid - update irq dev_id to a new one
+ * @irq: Interrupt line to update
+ * @dev_id: A cookie to find the irqaction to update
+ * @new_dev_id: New cookie passed to the handler function
+ *
+ * Sometimes, only the cookie data need to be changed.
+ * Instead of free/request irq, only update dev_id here
+ * Not only to gain some performance benefit, but also
+ * reduce the risk of losing interrupt.
+ *
+ * E.g. irq affinity setting in a VM with VFIO passthru
+ * device is carried out in a free-then-request-irq way
+ * since lack of this very function. The free_irq()
+ * eventually triggers deactivation of IR domain, which
+ * will cleanup IRTE. There is a gap before request_irq()
+ * finally setup the IRTE again. In this gap, a in-flight
+ * irq buffering in hardware layer may trigger DMAR fault
+ * and be lost.
+ *
+ * On failure, it returns a negative value. On success,
+ * it returns 0
+ */
+int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct irqaction *action, **action_ptr;
+ unsigned long flags;
+
+ if (in_interrupt()) {
+ WARN(1, "Trying to update IRQ %d (dev_id %p to %p) from IRQ context!\n",
+ irq, dev_id, new_dev_id);
+ return -EPERM;
+ }
+
+ if (!desc)
+ return -EINVAL;
+
+ chip_bus_lock(desc);
+ raw_spin_lock_irqsave(&desc->lock, flags);
+
+ /*
+ * There can be multiple actions per IRQ descriptor, find the right
+ * one based on the dev_id:
+ */
+ action_ptr = &desc->action;
+ for (;;) {
+ action = *action_ptr;
+
+ if (!action) {
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ chip_bus_sync_unlock(desc);
+ WARN(1, "Trying to update already-free IRQ %d (dev_id %p to %p)\n",
+ irq, dev_id, new_dev_id);
+ return -ENXIO;
+ }
+
+ if (action->dev_id == dev_id) {
+ action->dev_id = new_dev_id;
+ break;
+ }
+ action_ptr = &action->next;
+ }
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ chip_bus_sync_unlock(desc);
+
+ /*
+ * Make sure it's not being used on another CPU:
+ * There is a risk of UAF for old *dev_id, if it is
+ * freed in a short time after this func returns
+ */
+ synchronize_irq(irq);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(update_irq_devid);
+
+/**
* request_any_context_irq - allocate an interrupt line
* @irq: Interrupt line to allocate
* @handler: Function to be called when the IRQ occurs.
--
1.8.3.1

2019-08-15 16:47:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops

On Thu, 15 Aug 2019, Ben Luo wrote:
> if (vdev->ctx[vector].trigger) {
> - free_irq(irq, vdev->ctx[vector].trigger);
> - irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> - kfree(vdev->ctx[vector].name);
> - eventfd_ctx_put(vdev->ctx[vector].trigger);
> - vdev->ctx[vector].trigger = NULL;
> + if (vdev->ctx[vector].trigger == trigger) {
> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

What's the point of unregistering the producer, just to register it again below?

> + } else if (trigger) {
> + ret = update_irq_devid(irq,
> + vdev->ctx[vector].trigger, trigger);
> + if (unlikely(ret)) {
> + dev_info(&pdev->dev,
> + "update_irq_devid %d (token %p) fails: %d\n",

s/fails/failed/

> + irq, vdev->ctx[vector].trigger, ret);
> + eventfd_ctx_put(trigger);
> + return ret;
> + }

This lacks any form of comment why this is correct. dev_id is updated and
the producer with the old token is still registered.

> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

Now it's unregistered.

> + eventfd_ctx_put(vdev->ctx[vector].trigger);
> + vdev->ctx[vector].producer.token = trigger;

The token is updated and then it's newly registered below.

> + vdev->ctx[vector].trigger = trigger;
> - vdev->ctx[vector].producer.token = trigger;
> - vdev->ctx[vector].producer.irq = irq;
> + /* always update irte for posted mode */
> ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
> if (unlikely(ret))
> dev_info(&pdev->dev,
> "irq bypass producer (token %p) registration fails: %d\n",
> vdev->ctx[vector].producer.token, ret);

I know this code already existed, but again this looks inconsistent. If the
registration fails then a subsequent update will try to unregister a not
registered producer. Does not make any sense to me.

Thanks,

tglx

2019-08-19 20:54:28

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops

On Thu, 15 Aug 2019 21:02:58 +0800
Ben Luo <[email protected]> wrote:

> Currently, VFIO takes a lot of free-then-request-irq actions whenever
> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
> irq. Those actions only change the cookie data of irqaction or even
> change nothing. The free-then-request-irq not only adds more latency,
> but also increases the risk of losing interrupt, which may lead to a
> VM hung forever in waiting for IO completion

What guest environment is generating this? Typically I don't see that
Windows or Linux guests bounce the interrupt configuration much.
Thanks,

Alex

>
> This patchset solved the issue by:
> Patch 2 introduces update_irq_devid to only update dev_id of irqaction
> Patch 3 make use of update_irq_devid and optimize irq operations in VFIO
>
> changes from v2:
> - reformat to avoid quoted string split across lines and etc.
>
> changes from v1:
> - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
> - enhance error recovery code and debugging info in update_irq_devid
> - use __must_check in external referencing of update_irq_devid
> - use EXPORT_SYMBOL_GPL for update_irq_devid
> - reformat code of patch 3 for better readability
>
> Ben Luo (3):
> genirq: enhance error recovery code in free irq
> genirq: introduce update_irq_devid()
> vfio_pci: make use of update_irq_devid and optimize irq ops
>
> drivers/vfio/pci/vfio_pci_intrs.c | 101 +++++++++++++++++++++-------------
> include/linux/interrupt.h | 3 ++
> kernel/irq/manage.c | 110 +++++++++++++++++++++++++++++++++-----
> 3 files changed, 164 insertions(+), 50 deletions(-)
>

2019-08-20 04:05:18

by Ben Luo

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops


在 2019/8/20 上午4:51, Alex Williamson 写道:
> On Thu, 15 Aug 2019 21:02:58 +0800
> Ben Luo <[email protected]> wrote:
>
>> Currently, VFIO takes a lot of free-then-request-irq actions whenever
>> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
>> irq. Those actions only change the cookie data of irqaction or even
>> change nothing. The free-then-request-irq not only adds more latency,
>> but also increases the risk of losing interrupt, which may lead to a
>> VM hung forever in waiting for IO completion
> What guest environment is generating this? Typically I don't see that
> Windows or Linux guests bounce the interrupt configuration much.
> Thanks,
>
> Alex

By tracing centos5u8 on host, I found it keep masking and unmasking
interrupt like this:

[1566032533709879] index:28 irte_hi:000000010004a601
irte_lo:adb54bc000b98001
[1566032533711242] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533711258] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533711269] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533711291] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533711321] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533711340] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533711361] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533711376] index:28 irte_hi:000000010004a601
irte_lo:adb54bc000b98001
[1566032533713368] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533713385] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533713396] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533713416] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533713447] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533713464] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533713485] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533713499] index:28 irte_hi:000000010004a601
irte_lo:adb54bc000b98001
[1566032533718855] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533718871] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533718882] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533718902] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533718932] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533718949] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533718969] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533718984] index:28 irte_hi:000000010004a601
irte_lo:adb54bc000b98001
[1566032533719873] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533719889] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533719900] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533719921] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533719954] index:28 irte_hi:0000000000000000
irte_lo:0000000000000000
[1566032533719971] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533719992] index:28 irte_hi:000000000004a601
irte_lo:00003fff00ac002d
[1566032533720007] index:28 irte_hi:000000010004a601
irte_lo:adb54bc000b98001

"[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte
modification within 10ms

Thanks,

    Ben

>> This patchset solved the issue by:
>> Patch 2 introduces update_irq_devid to only update dev_id of irqaction
>> Patch 3 make use of update_irq_devid and optimize irq operations in VFIO
>>
>> changes from v2:
>> - reformat to avoid quoted string split across lines and etc.
>>
>> changes from v1:
>> - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
>> - enhance error recovery code and debugging info in update_irq_devid
>> - use __must_check in external referencing of update_irq_devid
>> - use EXPORT_SYMBOL_GPL for update_irq_devid
>> - reformat code of patch 3 for better readability
>>
>> Ben Luo (3):
>> genirq: enhance error recovery code in free irq
>> genirq: introduce update_irq_devid()
>> vfio_pci: make use of update_irq_devid and optimize irq ops
>>
>> drivers/vfio/pci/vfio_pci_intrs.c | 101 +++++++++++++++++++++-------------
>> include/linux/interrupt.h | 3 ++
>> kernel/irq/manage.c | 110 +++++++++++++++++++++++++++++++++-----
>> 3 files changed, 164 insertions(+), 50 deletions(-)
>>

2019-08-20 15:23:33

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops

On Tue, 20 Aug 2019 12:03:50 +0800
luoben <[email protected]> wrote:

> 在 2019/8/20 上午4:51, Alex Williamson 写道:
> > On Thu, 15 Aug 2019 21:02:58 +0800
> > Ben Luo <[email protected]> wrote:
> >
> >> Currently, VFIO takes a lot of free-then-request-irq actions whenever
> >> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
> >> irq. Those actions only change the cookie data of irqaction or even
> >> change nothing. The free-then-request-irq not only adds more latency,
> >> but also increases the risk of losing interrupt, which may lead to a
> >> VM hung forever in waiting for IO completion
> > What guest environment is generating this? Typically I don't see that
> > Windows or Linux guests bounce the interrupt configuration much.
> > Thanks,
> >
> > Alex
>
> By tracing centos5u8 on host, I found it keep masking and unmasking
> interrupt like this:
>
> [1566032533709879] index:28 irte_hi:000000010004a601
> irte_lo:adb54bc000b98001
> [1566032533711242] index:28 irte_hi:0000000000000000
> irte_lo:0000000000000000
> [1566032533711258] index:28 irte_hi:000000000004a601
> irte_lo:00003fff00ac002d
> [1566032533711269] index:28 irte_hi:000000000004a601
> irte_lo:00003fff00ac002d
[snip]
> "[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte
> modification within 10ms

Ok, that matches my understanding that only very old guests behave in
this manner. It's a curious case to optimize as RHEL5 is in extended
life-cycle support, with regular maintenance releases ending 2+ years
ago. Thanks,

Alex

2019-08-20 15:29:13

by Liu, Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops



> On Aug 20, 2019, at 11:24 PM, luoben <[email protected]> wrote:
>
>
>
> 在 2019/8/16 上午12:45, Thomas Gleixner 写道:
>> On Thu, 15 Aug 2019, Ben Luo wrote:
>>
>>> if (vdev->ctx[vector].trigger) {
>>> - free_irq(irq, vdev->ctx[vector].trigger);
>>> - irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>> - kfree(vdev->ctx[vector].name);
>>> - eventfd_ctx_put(vdev->ctx[vector].trigger);
>>> - vdev->ctx[vector].trigger = NULL;
>>> + if (vdev->ctx[vector].trigger == trigger) {
>>> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>
>> What's the point of unregistering the producer, just to register it again below?
> Whether producer token changed or not, irq_bypass_register_producer() is a way (seems the
>
> only way) to update IRTE by VFIO for posted interrupt.
>
> When posted interrupt is in use, the target IRTE will be used by IOMMU directly to find the
>
> target CPU for an interrupt posted to VM, since hypervisor is bypassed.
>
> irq_bypass_register_producer() will modify IRTE based on the information retrieved from KVM,
>
>
> 0xffffffff8150a920 : modify_irte+0x0/0x180 [kernel]
> 0xffffffff8150ab94 : intel_ir_set_vcpu_affinity+0xf4/0x150 [kernel]
> 0xffffffff81125f3c : irq_set_vcpu_affinity+0x5c/0xa0 [kernel]
> 0xffffffffa0550818 : vmx_update_pi_irte+0x178/0x290 [kvm_intel] // get pi_desc etc. from KVM
> 0xffffffffa052b789 : kvm_arch_irq_bypass_add_producer+0x39/0x50 [kvm_intel] (inexact)
> 0xffffffffa024a50b : __connect+0x7b/0xa0 [kvm] (inexact)
> 0xffffffffa024a6a8 : irq_bypass_register_producer+0x108/0x140 [kvm] (inexact)
> 0xffffffffa0338386 : vfio_msi_set_vector_signal+0x1b6/0x2c0 [vfio_pci] (inexact)
>>
>>> + } else if (trigger) {
>>> + ret = update_irq_devid(irq,
>>> + vdev->ctx[vector].trigger, trigger);
>>> + if (unlikely(ret)) {
>>> + dev_info(&pdev->dev,
>>> + "update_irq_devid %d (token %p) fails: %d\n",
>>>
>> s/fails/failed/
>>
>>
>>> + irq, vdev->ctx[vector].trigger, ret);
>>> + eventfd_ctx_put(trigger);
>>> + return ret;
>>> + }
>>>
>> This lacks any form of comment why this is correct. dev_id is updated and
>> the producer with the old token is still registered.
>>
> ok, I will add comment like below:
>
> For KVM-VFIO case, once producer and consumer connected successfully, interrupt from passthrough
>
> device will be directly delivered to VM instead of triggering interrupt process in HOST. If producer and
>
> consumer are disconnected, this interrupt process will fall back to remap mode, the handler vfio_msihandler()
>
> registered in corresponding irqaction will use the dev_id to find the right way to deliver the interrupt to VM.
>
> So, it is safe to update dev_id before unregistration of irq-bypass producer, i.e. switch back from posted
>
> mode to remap mode, since only in remap mode the 'dev_id' will be used by interrupt handler. To producer
>
> and consumer, dev_id is only a token for pairing them togather.
>>
>>> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>
>> Now it's unregistered.
>>
>>
>>> + eventfd_ctx_put(vdev->ctx[vector].trigger);
>>> + vdev->ctx[vector].producer.token = trigger;
>>>
>> The token is updated and then it's newly registered below.
>>
>>
>>> + vdev->ctx[vector].trigger = trigger;
>>> - vdev->ctx[vector].producer.token = trigger;
>>> - vdev->ctx[vector].producer.irq = irq;
>>> + /* always update irte for posted mode */
>>> ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>>> if (unlikely(ret))
>>> dev_info(&pdev->dev,
>>> "irq bypass producer (token %p) registration fails: %d\n",
>>> vdev->ctx[vector].producer.token, ret);
>>>
>> I know this code already existed, but again this looks inconsistent. If the
>> registration fails then a subsequent update will try to unregister a not
>> registered producer. Does not make any sense to me.
>>
> irq_bypass_register_producer() also fails on duplicated registration, so maybe an unconditional try to
>
> unregistration is a easy way to keep consistent.
>
> Maybe it's better to change these function names to irq_bypass_try_register_producer() and
>
> irq_bypass_try_unregister_producer() :)
>
>
Hi Ben,
The point is that we shouldn’t blindly try to register again if we fails to unregister a posted interrupt producer. By this way, the fast path (posted interrupt) may get disabled, but it’s safer than blindly ignoring the failure of ungistration.
Thanks,
Gerry
>
> Thanks,
>
> Ben

2019-08-20 15:53:42

by Ben Luo

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops


在 2019/8/20 下午11:22, Alex Williamson 写道:
> On Tue, 20 Aug 2019 12:03:50 +0800
> luoben <[email protected]> wrote:
>
>> 在 2019/8/20 上午4:51, Alex Williamson 写道:
>>> On Thu, 15 Aug 2019 21:02:58 +0800
>>> Ben Luo <[email protected]> wrote:
>>>
>>>> Currently, VFIO takes a lot of free-then-request-irq actions whenever
>>>> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
>>>> irq. Those actions only change the cookie data of irqaction or even
>>>> change nothing. The free-then-request-irq not only adds more latency,
>>>> but also increases the risk of losing interrupt, which may lead to a
>>>> VM hung forever in waiting for IO completion
>>> What guest environment is generating this? Typically I don't see that
>>> Windows or Linux guests bounce the interrupt configuration much.
>>> Thanks,
>>>
>>> Alex
>> By tracing centos5u8 on host, I found it keep masking and unmasking
>> interrupt like this:
>>
>> [1566032533709879] index:28 irte_hi:000000010004a601
>> irte_lo:adb54bc000b98001
>> [1566032533711242] index:28 irte_hi:0000000000000000
>> irte_lo:0000000000000000
>> [1566032533711258] index:28 irte_hi:000000000004a601
>> irte_lo:00003fff00ac002d
>> [1566032533711269] index:28 irte_hi:000000000004a601
>> irte_lo:00003fff00ac002d
> [snip]
>> "[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte
>> modification within 10ms
> Ok, that matches my understanding that only very old guests behave in
> this manner. It's a curious case to optimize as RHEL5 is in extended
> life-cycle support, with regular maintenance releases ending 2+ years
> ago. Thanks,
>
> Alex

But repeatedly set interrupt affinity in a new version guest can also
cause the problem.


Thanks,

    Ben

2019-08-22 15:22:14

by Ben Luo

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops


在 2019/8/20 下午11:27, Liu, Jiang 写道:
>
>> On Aug 20, 2019, at 11:24 PM, luoben <[email protected]> wrote:
>>
>>
>>
>> 在 2019/8/16 上午12:45, Thomas Gleixner 写道:
>>> On Thu, 15 Aug 2019, Ben Luo wrote:
>>>
>>>> if (vdev->ctx[vector].trigger) {
>>>> - free_irq(irq, vdev->ctx[vector].trigger);
>>>> - irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>> - kfree(vdev->ctx[vector].name);
>>>> - eventfd_ctx_put(vdev->ctx[vector].trigger);
>>>> - vdev->ctx[vector].trigger = NULL;
>>>> + if (vdev->ctx[vector].trigger == trigger) {
>>>> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>>
>>> What's the point of unregistering the producer, just to register it again below?
>> Whether producer token changed or not, irq_bypass_register_producer() is a way (seems the
>>
>> only way) to update IRTE by VFIO for posted interrupt.
>>
>> When posted interrupt is in use, the target IRTE will be used by IOMMU directly to find the
>>
>> target CPU for an interrupt posted to VM, since hypervisor is bypassed.
>>
>> irq_bypass_register_producer() will modify IRTE based on the information retrieved from KVM,
>>
>>
>> 0xffffffff8150a920 : modify_irte+0x0/0x180 [kernel]
>> 0xffffffff8150ab94 : intel_ir_set_vcpu_affinity+0xf4/0x150 [kernel]
>> 0xffffffff81125f3c : irq_set_vcpu_affinity+0x5c/0xa0 [kernel]
>> 0xffffffffa0550818 : vmx_update_pi_irte+0x178/0x290 [kvm_intel] // get pi_desc etc. from KVM
>> 0xffffffffa052b789 : kvm_arch_irq_bypass_add_producer+0x39/0x50 [kvm_intel] (inexact)
>> 0xffffffffa024a50b : __connect+0x7b/0xa0 [kvm] (inexact)
>> 0xffffffffa024a6a8 : irq_bypass_register_producer+0x108/0x140 [kvm] (inexact)
>> 0xffffffffa0338386 : vfio_msi_set_vector_signal+0x1b6/0x2c0 [vfio_pci] (inexact)
>>>> + } else if (trigger) {
>>>> + ret = update_irq_devid(irq,
>>>> + vdev->ctx[vector].trigger, trigger);
>>>> + if (unlikely(ret)) {
>>>> + dev_info(&pdev->dev,
>>>> + "update_irq_devid %d (token %p) fails: %d\n",
>>>>
>>> s/fails/failed/
>>>
>>>
>>>> + irq, vdev->ctx[vector].trigger, ret);
>>>> + eventfd_ctx_put(trigger);
>>>> + return ret;
>>>> + }
>>>>
>>> This lacks any form of comment why this is correct. dev_id is updated and
>>> the producer with the old token is still registered.
>>>
>> ok, I will add comment like below:
>>
>> For KVM-VFIO case, once producer and consumer connected successfully, interrupt from passthrough
>>
>> device will be directly delivered to VM instead of triggering interrupt process in HOST. If producer and
>>
>> consumer are disconnected, this interrupt process will fall back to remap mode, the handler vfio_msihandler()
>>
>> registered in corresponding irqaction will use the dev_id to find the right way to deliver the interrupt to VM.
>>
>> So, it is safe to update dev_id before unregistration of irq-bypass producer, i.e. switch back from posted
>>
>> mode to remap mode, since only in remap mode the 'dev_id' will be used by interrupt handler. To producer
>>
>> and consumer, dev_id is only a token for pairing them togather.
>>>> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>>
>>> Now it's unregistered.
>>>
>>>
>>>> + eventfd_ctx_put(vdev->ctx[vector].trigger);
>>>> + vdev->ctx[vector].producer.token = trigger;
>>>>
>>> The token is updated and then it's newly registered below.
>>>
>>>
>>>> + vdev->ctx[vector].trigger = trigger;
>>>> - vdev->ctx[vector].producer.token = trigger;
>>>> - vdev->ctx[vector].producer.irq = irq;
>>>> + /* always update irte for posted mode */
>>>> ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>>>> if (unlikely(ret))
>>>> dev_info(&pdev->dev,
>>>> "irq bypass producer (token %p) registration fails: %d\n",
>>>> vdev->ctx[vector].producer.token, ret);
>>>>
>>> I know this code already existed, but again this looks inconsistent. If the
>>> registration fails then a subsequent update will try to unregister a not
>>> registered producer. Does not make any sense to me.
>>>
>> irq_bypass_register_producer() also fails on duplicated registration, so maybe an unconditional try to
>>
>> unregistration is a easy way to keep consistent.
>>
>> Maybe it's better to change these function names to irq_bypass_try_register_producer() and
>>
>> irq_bypass_try_unregister_producer() :)
>>
>>
> Hi Ben,
> The point is that we shouldn’t blindly try to register again if we fails to unregister a posted interrupt producer. By this way, the fast path (posted interrupt) may get disabled, but it’s safer than blindly ignoring the failure of ungistration.
> Thanks,
> Gerry

This may need another patchset to enhance the irq_bypass
register/unregister functions first, at least to return some error code
when irq_bypass_try_unregister_producer() fails. I would like to have a
try later.

Thanks,

    Ben