2019-08-08 12:13:39

by Ben Luo

[permalink] [raw]
Subject: [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops

Currently, VFIO takes a lot of free-then-request-irq actions whenever
a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
irq. Those actions only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only add more latency,
but also increases the risk of losing interrupt, this may lead to a
VM hung forever in waiting IO completion

This patchset solved this issue by:
patch 1 introduces update_irq_devid to only update dev_id of irqaction
patch 2 make use of update_irq_devid and optimize irq operations in VFIO

Ben Luo (2):
genirq: introduce update_irq_devid()
vfio_pci: make use of update_irq_devid and optimize irq ops

drivers/vfio/pci/vfio_pci_intrs.c | 100 +++++++++++++++++++++++---------------
include/linux/interrupt.h | 3 ++
kernel/irq/manage.c | 74 ++++++++++++++++++++++++++++
3 files changed, 139 insertions(+), 38 deletions(-)

--
1.8.3.1


2019-08-12 08:53:14

by Ben Luo

[permalink] [raw]
Subject: [PATCH v2 0/3] introduce update_irq_devid and optimize VFIO irq ops

Currently, VFIO takes a lot of free-then-request-irq actions whenever
a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
irq. Those actions only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only add more latency,
but also increases the risk of losing interrupt, this may lead to a
VM hung forever in waiting IO completion

This patchset solved this issue by:
Patch 2 introduces update_irq_devid to only update dev_id of irqaction
Patch 3 make use of update_irq_devid and optimize irq operations in VFIO

changes from v1:
- add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
- enhance error recovery code and debugging info in update_irq_devid
- use __must_check in external referencing of update_irq_devid
- use EXPORT_SYMBOL_GPL for update_irq_devid
- reformat code of patch 3 for better readability

Ben Luo (3):
genirq: enhance error recovery code in free irq
genirq: introduce update_irq_devid()
vfio_pci: make use of update_irq_devid and optimize irq ops

drivers/vfio/pci/vfio_pci_intrs.c | 99 +++++++++++++++++++++-------------
include/linux/interrupt.h | 3 ++
kernel/irq/manage.c | 111 +++++++++++++++++++++++++++++++++-----
3 files changed, 163 insertions(+), 50 deletions(-)

--
1.8.3.1

2019-08-12 08:53:31

by Ben Luo

[permalink] [raw]
Subject: [PATCH v2 1/3] genirq: enhance error recovery code in free irq

Per Thomas Gleixner's comments:
1) free_irq/free_percpu_irq returns if called from IRQ context
2) move WARN out of the locked region and print out dev_id

Signed-off-by: Ben Luo <[email protected]>
---
kernel/irq/manage.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..b97ee5e 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,7 +1690,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
struct irqaction *action, **action_ptr;
unsigned long flags;

- WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+ if (in_interrupt()) {
+ WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+ irq, dev_id);
+ return NULL;
+ }

mutex_lock(&desc->request_mutex);
chip_bus_lock(desc);
@@ -1705,10 +1709,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
action = *action_ptr;

if (!action) {
- WARN(1, "Trying to free already-free IRQ %d\n", irq);
raw_spin_unlock_irqrestore(&desc->lock, flags);
chip_bus_sync_unlock(desc);
mutex_unlock(&desc->request_mutex);
+ WARN(1, "Trying to free already-free IRQ %d (dev_id %p)\n",
+ irq, dev_id);
return NULL;
}

@@ -2286,7 +2291,11 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
struct irqaction *action;
unsigned long flags;

- WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+ if (in_interrupt()) {
+ WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+ irq, dev_id);
+ return NULL;
+ }

if (!desc)
return NULL;
@@ -2295,14 +2304,17 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_

action = desc->action;
if (!action || action->percpu_dev_id != dev_id) {
- WARN(1, "Trying to free already-free IRQ %d\n", irq);
- goto bad;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ WARN(1, "Trying to free already-free IRQ (dev_id %p) %d\n",
+ irq, dev_id);
+ return NULL;
}

if (!cpumask_empty(desc->percpu_enabled)) {
- WARN(1, "percpu IRQ %d still enabled on CPU%d!\n",
- irq, cpumask_first(desc->percpu_enabled));
- goto bad;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ WARN(1, "percpu IRQ %d (dev_id %p) still enabled on CPU%d!\n",
+ irq, dev_id, cpumask_first(desc->percpu_enabled));
+ return NULL;
}

/* Found it - now remove it from the list of entries: */
@@ -2317,10 +2329,6 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
irq_chip_pm_put(&desc->irq_data);
module_put(desc->owner);
return action;
-
-bad:
- raw_spin_unlock_irqrestore(&desc->lock, flags);
- return NULL;
}

/**
--
1.8.3.1

2019-08-12 08:53:42

by Ben Luo

[permalink] [raw]
Subject: [PATCH v2 2/3] genirq: introduce update_irq_devid()

Sometimes, only the dev_id field of irqaction need to be changed.
E.g. KVM VM with device passthru via VFIO may switch irq injection
path between KVM irqfd and userspace eventfd. These two paths
share the same irq num and handler for the same vector of a device,
only with different dev_id referencing to different fds' contexts.

So, instead of free/request irq, only update dev_id of irqaction.
This narrows the gap for setting up new irq (and irte, if has)
and also gains some performance benefit.

Signed-off-by: Ben Luo <[email protected]>
Reviewed-by: Liu Jiang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
include/linux/interrupt.h | 3 ++
kernel/irq/manage.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..6060c5a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
request_percpu_nmi(unsigned int irq, irq_handler_t handler,
const char *devname, void __percpu *dev);

+extern int __must_check
+update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id);
+
extern const void *free_irq(unsigned int, void *);
extern void free_percpu_irq(unsigned int, void __percpu *);

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index b97ee5e..4c34a23 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2064,6 +2064,85 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
EXPORT_SYMBOL(request_threaded_irq);

/**
+ * update_irq_devid - update irq dev_id to a new one
+ * @irq: Interrupt line to update
+ * @dev_id: A cookie to find the irqaction to update
+ * @new_dev_id: New cookie passed to the handler function
+ *
+ * Sometimes, only the cookie data need to be changed.
+ * Instead of free/request irq, only update dev_id here
+ * Not only to gain some performance benefit, but also
+ * reduce the risk of losing interrupt.
+ *
+ * E.g. irq affinity setting in a VM with VFIO passthru
+ * device is carried out in a free-then-request-irq way
+ * since lack of this very function. The free_irq()
+ * eventually triggers deactivation of IR domain, which
+ * will cleanup IRTE. There is a gap before request_irq()
+ * finally setup the IRTE again. In this gap, a in-flight
+ * irq buffering in hardware layer may trigger DMAR fault
+ * and be lost.
+ *
+ * On failure, it returns a negative value. On success,
+ * it returns 0
+ */
+int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct irqaction *action, **action_ptr;
+ unsigned long flags;
+
+ if (in_interrupt()) {
+ WARN(1, "Trying to update IRQ %d (dev_id %p to %p) \
+ from IRQ context!\n", irq, dev_id, new_dev_id);
+ return -EPERM;
+ }
+
+ if (!desc)
+ return -EINVAL;
+
+ chip_bus_lock(desc);
+ raw_spin_lock_irqsave(&desc->lock, flags);
+
+ /*
+ * There can be multiple actions per IRQ descriptor, find the right
+ * one based on the dev_id:
+ */
+ action_ptr = &desc->action;
+ for (;;) {
+ action = *action_ptr;
+
+ if (!action) {
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ chip_bus_sync_unlock(desc);
+ WARN(1, "Trying to update already-free IRQ %d \
+ (dev_id %p to %p)\n",
+ irq, dev_id, new_dev_id);
+ return -ENXIO;
+ }
+
+ if (action->dev_id == dev_id) {
+ action->dev_id = new_dev_id;
+ break;
+ }
+ action_ptr = &action->next;
+ }
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ chip_bus_sync_unlock(desc);
+
+ /*
+ * Make sure it's not being used on another CPU:
+ * There is a risk of UAF for old *dev_id, if it is
+ * freed in a short time after this func returns
+ */
+ synchronize_irq(irq);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(update_irq_devid);
+
+/**
* request_any_context_irq - allocate an interrupt line
* @irq: Interrupt line to allocate
* @handler: Function to be called when the IRQ occurs.
--
1.8.3.1

2019-08-12 08:55:38

by Ben Luo

[permalink] [raw]
Subject: [PATCH v2 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actully
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of update_irq_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <[email protected]>
Reviewed-by: Liu Jiang <[email protected]>
Reviewed-by: Zou Nanhai <[email protected]>
Reviewed-by: Yunsheng Lin <[email protected]>
---
drivers/vfio/pci/vfio_pci_intrs.c | 99 ++++++++++++++++++++++++---------------
1 file changed, 61 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..541153d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,93 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
int vector, int fd, bool msix)
{
+ struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;
- struct eventfd_ctx *trigger;
int irq, ret;

if (vector < 0 || vector >= vdev->num_ctx)
return -EINVAL;

+ if (fd >= 0) {
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger))
+ return PTR_ERR(trigger);
+ }
+
irq = pci_irq_vector(pdev, vector);

if (vdev->ctx[vector].trigger) {
- free_irq(irq, vdev->ctx[vector].trigger);
- irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
- kfree(vdev->ctx[vector].name);
- eventfd_ctx_put(vdev->ctx[vector].trigger);
- vdev->ctx[vector].trigger = NULL;
+ if (vdev->ctx[vector].trigger == trigger) {
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+ } else if (trigger) {
+ ret = update_irq_devid(irq,
+ vdev->ctx[vector].trigger, trigger);
+ if (unlikely(ret)) {
+ dev_info(&pdev->dev,
+ "update_irq_devid %d (token %p) fails: %d\n",
+ irq, vdev->ctx[vector].trigger, ret);
+ eventfd_ctx_put(trigger);
+ return ret;
+ }
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+ eventfd_ctx_put(vdev->ctx[vector].trigger);
+ vdev->ctx[vector].producer.token = trigger;
+ vdev->ctx[vector].trigger = trigger;
+ } else {
+ free_irq(irq, vdev->ctx[vector].trigger);
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+ kfree(vdev->ctx[vector].name);
+ eventfd_ctx_put(vdev->ctx[vector].trigger);
+ vdev->ctx[vector].trigger = NULL;
+ }
}

if (fd < 0)
return 0;

- vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
- msix ? "x" : "", vector,
- pci_name(pdev));
- if (!vdev->ctx[vector].name)
- return -ENOMEM;
+ if (!vdev->ctx[vector].trigger) {
+ vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
+ msix ? "x" : "", vector,
+ pci_name(pdev));
+ if (!vdev->ctx[vector].name) {
+ eventfd_ctx_put(trigger);
+ return -ENOMEM;
+ }

- trigger = eventfd_ctx_fdget(fd);
- if (IS_ERR(trigger)) {
- kfree(vdev->ctx[vector].name);
- return PTR_ERR(trigger);
- }
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;

- /*
- * The MSIx vector table resides in device memory which may be cleared
- * via backdoor resets. We don't allow direct access to the vector
- * table so even if a userspace driver attempts to save/restore around
- * such a reset it would be unsuccessful. To avoid this, restore the
- * cached value of the message prior to enabling.
- */
- if (msix) {
- struct msi_msg msg;
+ get_cached_msi_msg(irq, &msg);
+ pci_write_msi_msg(irq, &msg);
+ }

- get_cached_msi_msg(irq, &msg);
- pci_write_msi_msg(irq, &msg);
- }
+ ret = request_irq(irq, vfio_msihandler, 0,
+ vdev->ctx[vector].name, trigger);
+ if (ret) {
+ kfree(vdev->ctx[vector].name);
+ eventfd_ctx_put(trigger);
+ return ret;
+ }

- ret = request_irq(irq, vfio_msihandler, 0,
- vdev->ctx[vector].name, trigger);
- if (ret) {
- kfree(vdev->ctx[vector].name);
- eventfd_ctx_put(trigger);
- return ret;
+ vdev->ctx[vector].producer.token = trigger;
+ vdev->ctx[vector].producer.irq = irq;
+ vdev->ctx[vector].trigger = trigger;
}

- vdev->ctx[vector].producer.token = trigger;
- vdev->ctx[vector].producer.irq = irq;
+ /* always update irte for posted mode */
ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
if (unlikely(ret))
dev_info(&pdev->dev,
"irq bypass producer (token %p) registration fails: %d\n",
vdev->ctx[vector].producer.token, ret);

- vdev->ctx[vector].trigger = trigger;
-
return 0;
}

--
1.8.3.1