This series hardens interrupt code relative to eventfd registration
across several vfio bus drivers, ensuring that NULL eventfds cannot
be triggered by users. Several other more minor issues were discovered
and fixed along the way.
Thanks to Reinette for identifying this latent vulnerability. Thanks,
Alex
Alex Williamson (7):
vfio/pci: Disable auto-enable of exclusive INTx IRQ
vfio/pci: Lock external INTx masking ops
vfio: Introduce interface to flush virqfd inject workqueue
vfio/pci: Create persistent INTx handler
vfio/platform: Disable virqfds on cleanup
vfio/platform: Create persistent IRQ handlers
vfio/fsl-mc: Block calling interrupt handler without trigger
drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c | 7 +-
drivers/vfio/pci/vfio_pci_intrs.c | 176 +++++++++++++---------
drivers/vfio/platform/vfio_platform_irq.c | 109 ++++++++++----
drivers/vfio/virqfd.c | 21 +++
include/linux/vfio.h | 2 +
5 files changed, 209 insertions(+), 106 deletions(-)
--
2.43.2
Mask operations through config space changes to DisINTx may race INTx
configuration changes via ioctl. Create wrappers that add locking for
paths outside of the core interrupt code.
In particular, irq_type is updated holding igate, therefore testing
is_intx() requires holding igate. For example clearing DisINTx from
config space can otherwise race changes of the interrupt configuration.
This aligns interfaces which may trigger the INTx eventfd into two
camps, one side serialized by igate and the other only enabled while
INTx is configured. A subsequent patch introduces synchronization for
the latter flows.
Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Reported-by: Reinette Chatre <[email protected]>
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/pci/vfio_pci_intrs.c | 34 +++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 136101179fcb..75c85eec21b3 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -99,13 +99,15 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
}
/* Returns true if the INTx vfio_pci_irq_ctx.masked value is changed. */
-bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
+static bool __vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_irq_ctx *ctx;
unsigned long flags;
bool masked_changed = false;
+ lockdep_assert_held(&vdev->igate);
+
spin_lock_irqsave(&vdev->irqlock, flags);
/*
@@ -143,6 +145,17 @@ bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
return masked_changed;
}
+bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
+{
+ bool mask_changed;
+
+ mutex_lock(&vdev->igate);
+ mask_changed = __vfio_pci_intx_mask(vdev);
+ mutex_unlock(&vdev->igate);
+
+ return mask_changed;
+}
+
/*
* If this is triggered by an eventfd, we can't call eventfd_signal
* or else we'll deadlock on the eventfd wait queue. Return >0 when
@@ -194,12 +207,21 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
return ret;
}
-void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
+static void __vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
{
+ lockdep_assert_held(&vdev->igate);
+
if (vfio_pci_intx_unmask_handler(vdev, NULL) > 0)
vfio_send_intx_eventfd(vdev, NULL);
}
+void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
+{
+ mutex_lock(&vdev->igate);
+ __vfio_pci_intx_unmask(vdev);
+ mutex_unlock(&vdev->igate);
+}
+
static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
{
struct vfio_pci_core_device *vdev = dev_id;
@@ -563,11 +585,11 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev,
return -EINVAL;
if (flags & VFIO_IRQ_SET_DATA_NONE) {
- vfio_pci_intx_unmask(vdev);
+ __vfio_pci_intx_unmask(vdev);
} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
uint8_t unmask = *(uint8_t *)data;
if (unmask)
- vfio_pci_intx_unmask(vdev);
+ __vfio_pci_intx_unmask(vdev);
} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
struct vfio_pci_irq_ctx *ctx = vfio_irq_ctx_get(vdev, 0);
int32_t fd = *(int32_t *)data;
@@ -594,11 +616,11 @@ static int vfio_pci_set_intx_mask(struct vfio_pci_core_device *vdev,
return -EINVAL;
if (flags & VFIO_IRQ_SET_DATA_NONE) {
- vfio_pci_intx_mask(vdev);
+ __vfio_pci_intx_mask(vdev);
} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
uint8_t mask = *(uint8_t *)data;
if (mask)
- vfio_pci_intx_mask(vdev);
+ __vfio_pci_intx_mask(vdev);
} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
return -ENOTTY; /* XXX implement me */
}
--
2.43.2
A vulnerability exists where the eventfd for INTx signaling can be
deconfigured, which unregisters the IRQ handler but still allows
eventfds to be signaled with a NULL context through the SET_IRQS ioctl
or through unmask irqfd if the device interrupt is pending.
Ideally this could be solved with some additional locking; the igate
mutex serializes the ioctl and config space accesses, and the interrupt
handler is unregistered relative to the trigger, but the irqfd path
runs asynchronous to those. The igate mutex cannot be acquired from the
atomic context of the eventfd wake function. Disabling the irqfd
relative to the eventfd registration is potentially incompatible with
existing userspace.
As a result, the solution implemented here moves configuration of the
INTx interrupt handler to track the lifetime of the INTx context object
and irq_type configuration, rather than registration of a particular
trigger eventfd. Synchronization is added between the ioctl path and
eventfd_signal() wrapper such that the eventfd trigger can be
dynamically updated relative to in-flight interrupts or irqfd callbacks.
Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Reported-by: Reinette Chatre <[email protected]>
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/pci/vfio_pci_intrs.c | 145 ++++++++++++++++--------------
1 file changed, 78 insertions(+), 67 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 75c85eec21b3..fb5392b749ff 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -90,11 +90,15 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
struct vfio_pci_irq_ctx *ctx;
+ struct eventfd_ctx *trigger;
ctx = vfio_irq_ctx_get(vdev, 0);
if (WARN_ON_ONCE(!ctx))
return;
- eventfd_signal(ctx->trigger);
+
+ trigger = READ_ONCE(ctx->trigger);
+ if (likely(trigger))
+ eventfd_signal(trigger);
}
}
@@ -253,100 +257,100 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
return ret;
}
-static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
+static int vfio_intx_enable(struct vfio_pci_core_device *vdev,
+ struct eventfd_ctx *trigger)
{
+ struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_irq_ctx *ctx;
+ unsigned long irqflags;
+ char *name;
+ int ret;
if (!is_irq_none(vdev))
return -EINVAL;
- if (!vdev->pdev->irq)
+ if (!pdev->irq)
return -ENODEV;
+ name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)", pci_name(pdev));
+ if (!name)
+ return -ENOMEM;
+
ctx = vfio_irq_ctx_alloc(vdev, 0);
if (!ctx)
return -ENOMEM;
+ ctx->name = name;
+ ctx->trigger = trigger;
+
/*
- * If the virtual interrupt is masked, restore it. Devices
- * supporting DisINTx can be masked at the hardware level
- * here, non-PCI-2.3 devices will have to wait until the
- * interrupt is enabled.
+ * Fill the initial masked state based on virq_disabled. After
+ * enable, changing the DisINTx bit in vconfig directly changes INTx
+ * masking. igate prevents races during setup, once running masked
+ * is protected via irqlock.
+ *
+ * Devices supporting DisINTx also reflect the current mask state in
+ * the physical DisINTx bit, which is not affected during IRQ setup.
+ *
+ * Devices without DisINTx support require an exclusive interrupt.
+ * IRQ masking is performed at the IRQ chip. Again, igate protects
+ * against races during setup and IRQ handlers and irqfds are not
+ * yet active, therefore masked is stable and can be used to
+ * conditionally auto-enable the IRQ.
+ *
+ * irq_type must be stable while the IRQ handler is registered,
+ * therefore it must be set before request_irq().
*/
ctx->masked = vdev->virq_disabled;
- if (vdev->pci_2_3)
- pci_intx(vdev->pdev, !ctx->masked);
+ if (vdev->pci_2_3) {
+ pci_intx(pdev, !ctx->masked);
+ irqflags = IRQF_SHARED;
+ } else {
+ irqflags = ctx->masked ? IRQF_NO_AUTOEN : 0;
+ }
vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
+ ret = request_irq(pdev->irq, vfio_intx_handler,
+ irqflags, ctx->name, vdev);
+ if (ret) {
+ vdev->irq_type = VFIO_PCI_NUM_IRQS;
+ kfree(name);
+ vfio_irq_ctx_free(vdev, ctx, 0);
+ return ret;
+ }
+
return 0;
}
-static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
+static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev,
+ struct eventfd_ctx *trigger)
{
struct pci_dev *pdev = vdev->pdev;
- unsigned long irqflags = IRQF_SHARED;
struct vfio_pci_irq_ctx *ctx;
- struct eventfd_ctx *trigger;
- unsigned long flags;
- int ret;
+ struct eventfd_ctx *old;
ctx = vfio_irq_ctx_get(vdev, 0);
if (WARN_ON_ONCE(!ctx))
return -EINVAL;
- if (ctx->trigger) {
- free_irq(pdev->irq, vdev);
- kfree(ctx->name);
- eventfd_ctx_put(ctx->trigger);
- ctx->trigger = NULL;
- }
-
- if (fd < 0) /* Disable only */
- return 0;
-
- ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)",
- pci_name(pdev));
- if (!ctx->name)
- return -ENOMEM;
-
- trigger = eventfd_ctx_fdget(fd);
- if (IS_ERR(trigger)) {
- kfree(ctx->name);
- return PTR_ERR(trigger);
- }
+ old = ctx->trigger;
- ctx->trigger = trigger;
+ WRITE_ONCE(ctx->trigger, trigger);
- /*
- * Devices without DisINTx support require an exclusive interrupt,
- * IRQ masking is performed at the IRQ chip. The masked status is
- * protected by vdev->irqlock. Setup the IRQ without auto-enable and
- * unmask as necessary below under lock. DisINTx is unmodified by
- * the IRQ configuration and may therefore use auto-enable.
- */
- if (!vdev->pci_2_3)
- irqflags = IRQF_NO_AUTOEN;
-
- ret = request_irq(pdev->irq, vfio_intx_handler,
- irqflags, ctx->name, vdev);
- if (ret) {
- ctx->trigger = NULL;
- kfree(ctx->name);
- eventfd_ctx_put(trigger);
- return ret;
+ /* Releasing an old ctx requires synchronizing in-flight users */
+ if (old) {
+ synchronize_irq(pdev->irq);
+ vfio_virqfd_flush_thread(&ctx->unmask);
+ eventfd_ctx_put(old);
}
- spin_lock_irqsave(&vdev->irqlock, flags);
- if (!vdev->pci_2_3 && !ctx->masked)
- enable_irq(pdev->irq);
- spin_unlock_irqrestore(&vdev->irqlock, flags);
-
return 0;
}
static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
{
+ struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_irq_ctx *ctx;
ctx = vfio_irq_ctx_get(vdev, 0);
@@ -354,10 +358,13 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
if (ctx) {
vfio_virqfd_disable(&ctx->unmask);
vfio_virqfd_disable(&ctx->mask);
+ free_irq(pdev->irq, vdev);
+ if (ctx->trigger)
+ eventfd_ctx_put(ctx->trigger);
+ kfree(ctx->name);
+ vfio_irq_ctx_free(vdev, ctx, 0);
}
- vfio_intx_set_signal(vdev, -1);
vdev->irq_type = VFIO_PCI_NUM_IRQS;
- vfio_irq_ctx_free(vdev, ctx, 0);
}
/*
@@ -641,19 +648,23 @@ static int vfio_pci_set_intx_trigger(struct vfio_pci_core_device *vdev,
return -EINVAL;
if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ struct eventfd_ctx *trigger = NULL;
int32_t fd = *(int32_t *)data;
int ret;
- if (is_intx(vdev))
- return vfio_intx_set_signal(vdev, fd);
+ if (fd >= 0) {
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger))
+ return PTR_ERR(trigger);
+ }
- ret = vfio_intx_enable(vdev);
- if (ret)
- return ret;
+ if (is_intx(vdev))
+ ret = vfio_intx_set_signal(vdev, trigger);
+ else
+ ret = vfio_intx_enable(vdev, trigger);
- ret = vfio_intx_set_signal(vdev, fd);
- if (ret)
- vfio_intx_disable(vdev);
+ if (ret && trigger)
+ eventfd_ctx_put(trigger);
return ret;
}
--
2.43.2
In order to synchronize changes that can affect the thread callback,
introduce an interface to force a flush of the inject workqueue. The
irqfd pointer is only valid under spinlock, but the workqueue cannot
be flushed under spinlock. Therefore the flush work for the irqfd is
queued under spinlock. The vfio_irqfd_cleanup_wq workqueue is re-used
for queuing this work such that flushing the workqueue is also ordered
relative to shutdown.
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/virqfd.c | 21 +++++++++++++++++++++
include/linux/vfio.h | 2 ++
2 files changed, 23 insertions(+)
diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c
index 29c564b7a6e1..532269133801 100644
--- a/drivers/vfio/virqfd.c
+++ b/drivers/vfio/virqfd.c
@@ -101,6 +101,13 @@ static void virqfd_inject(struct work_struct *work)
virqfd->thread(virqfd->opaque, virqfd->data);
}
+static void virqfd_flush_inject(struct work_struct *work)
+{
+ struct virqfd *virqfd = container_of(work, struct virqfd, flush_inject);
+
+ flush_work(&virqfd->inject);
+}
+
int vfio_virqfd_enable(void *opaque,
int (*handler)(void *, void *),
void (*thread)(void *, void *),
@@ -124,6 +131,7 @@ int vfio_virqfd_enable(void *opaque,
INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
INIT_WORK(&virqfd->inject, virqfd_inject);
+ INIT_WORK(&virqfd->flush_inject, virqfd_flush_inject);
irqfd = fdget(fd);
if (!irqfd.file) {
@@ -213,3 +221,16 @@ void vfio_virqfd_disable(struct virqfd **pvirqfd)
flush_workqueue(vfio_irqfd_cleanup_wq);
}
EXPORT_SYMBOL_GPL(vfio_virqfd_disable);
+
+void vfio_virqfd_flush_thread(struct virqfd **pvirqfd)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&virqfd_lock, flags);
+ if (*pvirqfd && (*pvirqfd)->thread)
+ queue_work(vfio_irqfd_cleanup_wq, &(*pvirqfd)->flush_inject);
+ spin_unlock_irqrestore(&virqfd_lock, flags);
+
+ flush_workqueue(vfio_irqfd_cleanup_wq);
+}
+EXPORT_SYMBOL_GPL(vfio_virqfd_flush_thread);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 89b265bc6ec3..8b1a29820409 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -356,6 +356,7 @@ struct virqfd {
wait_queue_entry_t wait;
poll_table pt;
struct work_struct shutdown;
+ struct work_struct flush_inject;
struct virqfd **pvirqfd;
};
@@ -363,5 +364,6 @@ int vfio_virqfd_enable(void *opaque, int (*handler)(void *, void *),
void (*thread)(void *, void *), void *data,
struct virqfd **pvirqfd, int fd);
void vfio_virqfd_disable(struct virqfd **pvirqfd);
+void vfio_virqfd_flush_thread(struct virqfd **pvirqfd);
#endif /* VFIO_H */
--
2.43.2
irqfds for mask and unmask that are not specifically disabled by the
user are leaked. Remove any irqfds during cleanup
Cc: Eric Auger <[email protected]>
Fixes: a7fa7c77cf15 ("vfio/platform: implement IRQ masking/unmasking via an eventfd")
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/platform/vfio_platform_irq.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c
index 61a1bfb68ac7..a311d70c695b 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -321,8 +321,13 @@ void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
{
int i;
- for (i = 0; i < vdev->num_irqs; i++)
+ for (i = 0; i < vdev->num_irqs; i++) {
+ int hwirq = vdev->get_irq(vdev, i);
+
+ vfio_virqfd_disable(&vdev->irqs[i].mask);
+ vfio_virqfd_disable(&vdev->irqs[i].unmask);
vfio_set_trigger(vdev, i, -1, NULL);
+ }
vdev->num_irqs = 0;
kfree(vdev->irqs);
--
2.43.2
The vfio-platform SET_IRQS ioctl currently allows loopback triggering of
an interrupt before a signaling eventfd has been configured by the user,
which thereby allows a NULL pointer dereference.
Rather than register the IRQ relative to a valid trigger, register all
IRQs in a disabled state in the device open path. This allows mask
operations on the IRQ to nest within the overall enable state governed
by a valid eventfd signal. This decouples @masked, protected by the
@locked spinlock from @trigger, protected via the @igate mutex.
In doing so, it's guaranteed that changes to @trigger cannot race the
IRQ handlers because the IRQ handler is synchronously disabled before
modifying the trigger, and loopback triggering of the IRQ via ioctl is
safe due to serialization with trigger changes via igate.
For compatibility, request_irq() failures are maintained to be local to
the SET_IRQS ioctl rather than a fatal error in the open device path.
This allows, for example, a userspace driver with polling mode support
to continue to work regardless of moving the request_irq() call site.
This necessarily blocks all SET_IRQS access to the failed index.
Cc: Eric Auger <[email protected]>
Fixes: 57f972e2b341 ("vfio/platform: trigger an interrupt via eventfd")
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/platform/vfio_platform_irq.c | 102 +++++++++++++++-------
1 file changed, 70 insertions(+), 32 deletions(-)
diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c
index a311d70c695b..e72f5b10c266 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -136,6 +136,16 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
return 0;
}
+/*
+ * The trigger eventfd is guaranteed valid in the interrupt path
+ * and protected by the igate mutex when triggered via ioctl.
+ */
+static void vfio_send_eventfd(struct vfio_platform_irq *irq_ctx)
+{
+ if (likely(irq_ctx->trigger))
+ eventfd_signal(irq_ctx->trigger);
+}
+
static irqreturn_t vfio_automasked_irq_handler(int irq, void *dev_id)
{
struct vfio_platform_irq *irq_ctx = dev_id;
@@ -155,7 +165,7 @@ static irqreturn_t vfio_automasked_irq_handler(int irq, void *dev_id)
spin_unlock_irqrestore(&irq_ctx->lock, flags);
if (ret == IRQ_HANDLED)
- eventfd_signal(irq_ctx->trigger);
+ vfio_send_eventfd(irq_ctx);
return ret;
}
@@ -164,52 +174,40 @@ static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
{
struct vfio_platform_irq *irq_ctx = dev_id;
- eventfd_signal(irq_ctx->trigger);
+ vfio_send_eventfd(irq_ctx);
return IRQ_HANDLED;
}
static int vfio_set_trigger(struct vfio_platform_device *vdev, int index,
- int fd, irq_handler_t handler)
+ int fd)
{
struct vfio_platform_irq *irq = &vdev->irqs[index];
struct eventfd_ctx *trigger;
- int ret;
if (irq->trigger) {
- irq_clear_status_flags(irq->hwirq, IRQ_NOAUTOEN);
- free_irq(irq->hwirq, irq);
- kfree(irq->name);
+ disable_irq(irq->hwirq);
eventfd_ctx_put(irq->trigger);
irq->trigger = NULL;
}
if (fd < 0) /* Disable only */
return 0;
- irq->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-irq[%d](%s)",
- irq->hwirq, vdev->name);
- if (!irq->name)
- return -ENOMEM;
trigger = eventfd_ctx_fdget(fd);
- if (IS_ERR(trigger)) {
- kfree(irq->name);
+ if (IS_ERR(trigger))
return PTR_ERR(trigger);
- }
irq->trigger = trigger;
- irq_set_status_flags(irq->hwirq, IRQ_NOAUTOEN);
- ret = request_irq(irq->hwirq, handler, 0, irq->name, irq);
- if (ret) {
- kfree(irq->name);
- eventfd_ctx_put(trigger);
- irq->trigger = NULL;
- return ret;
- }
-
- if (!irq->masked)
- enable_irq(irq->hwirq);
+ /*
+ * irq->masked effectively provides nested disables within the overall
+ * enable relative to trigger. Specifically request_irq() is called
+ * with NO_AUTOEN, therefore the IRQ is initially disabled. The user
+ * may only further disable the IRQ with a MASK operations because
+ * irq->masked is initially false.
+ */
+ enable_irq(irq->hwirq);
return 0;
}
@@ -228,7 +226,7 @@ static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
handler = vfio_irq_handler;
if (!count && (flags & VFIO_IRQ_SET_DATA_NONE))
- return vfio_set_trigger(vdev, index, -1, handler);
+ return vfio_set_trigger(vdev, index, -1);
if (start != 0 || count != 1)
return -EINVAL;
@@ -236,7 +234,7 @@ static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
int32_t fd = *(int32_t *)data;
- return vfio_set_trigger(vdev, index, fd, handler);
+ return vfio_set_trigger(vdev, index, fd);
}
if (flags & VFIO_IRQ_SET_DATA_NONE) {
@@ -260,6 +258,14 @@ int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
unsigned start, unsigned count, uint32_t flags,
void *data) = NULL;
+ /*
+ * For compatibility, errors from request_irq() are local to the
+ * SET_IRQS path and reflected in the name pointer. This allows,
+ * for example, polling mode fallback for an exclusive IRQ failure.
+ */
+ if (IS_ERR(vdev->irqs[index].name))
+ return PTR_ERR(vdev->irqs[index].name);
+
switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
case VFIO_IRQ_SET_ACTION_MASK:
func = vfio_platform_set_irq_mask;
@@ -280,7 +286,7 @@ int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
int vfio_platform_irq_init(struct vfio_platform_device *vdev)
{
- int cnt = 0, i;
+ int cnt = 0, i, ret = 0;
while (vdev->get_irq(vdev, cnt) >= 0)
cnt++;
@@ -292,29 +298,56 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev)
for (i = 0; i < cnt; i++) {
int hwirq = vdev->get_irq(vdev, i);
+ irq_handler_t handler = vfio_irq_handler;
- if (hwirq < 0)
+ if (hwirq < 0) {
+ ret = -EINVAL;
goto err;
+ }
spin_lock_init(&vdev->irqs[i].lock);
vdev->irqs[i].flags = VFIO_IRQ_INFO_EVENTFD;
- if (irq_get_trigger_type(hwirq) & IRQ_TYPE_LEVEL_MASK)
+ if (irq_get_trigger_type(hwirq) & IRQ_TYPE_LEVEL_MASK) {
vdev->irqs[i].flags |= VFIO_IRQ_INFO_MASKABLE
| VFIO_IRQ_INFO_AUTOMASKED;
+ handler = vfio_automasked_irq_handler;
+ }
vdev->irqs[i].count = 1;
vdev->irqs[i].hwirq = hwirq;
vdev->irqs[i].masked = false;
+ vdev->irqs[i].name = kasprintf(GFP_KERNEL_ACCOUNT,
+ "vfio-irq[%d](%s)", hwirq,
+ vdev->name);
+ if (!vdev->irqs[i].name) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ ret = request_irq(hwirq, handler, IRQF_NO_AUTOEN,
+ vdev->irqs[i].name, &vdev->irqs[i]);
+ if (ret) {
+ kfree(vdev->irqs[i].name);
+ vdev->irqs[i].name = ERR_PTR(ret);
+ }
}
vdev->num_irqs = cnt;
return 0;
err:
+ for (--i; i >= 0; i--) {
+ int hwirq = vdev->get_irq(vdev, i);
+
+ if (!IS_ERR(vdev->irqs[i].name)) {
+ free_irq(hwirq, &vdev->irqs[i]);
+ kfree(vdev->irqs[i].name);
+ }
+ }
kfree(vdev->irqs);
- return -EINVAL;
+ return ret;
}
void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
@@ -326,7 +359,12 @@ void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
vfio_virqfd_disable(&vdev->irqs[i].mask);
vfio_virqfd_disable(&vdev->irqs[i].unmask);
- vfio_set_trigger(vdev, i, -1, NULL);
+ if (!IS_ERR(vdev->irqs[i].name)) {
+ free_irq(hwirq, &vdev->irqs[i]);
+ if (vdev->irqs[i].trigger)
+ eventfd_ctx_put(vdev->irqs[i].trigger);
+ kfree(vdev->irqs[i].name);
+ }
}
vdev->num_irqs = 0;
--
2.43.2
Currently for devices requiring masking at the irqchip for INTx, ie.
devices without DisINTx support, the IRQ is enabled in request_irq()
and subsequently disabled as necessary to align with the masked status
flag. This presents a window where the interrupt could fire between
these events, resulting in the IRQ incrementing the disable depth twice.
This would be unrecoverable for a user since the masked flag prevents
nested enables through vfio.
Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
is never auto-enabled, then unmask as required.
Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/pci/vfio_pci_intrs.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 237beac83809..136101179fcb 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -296,8 +296,15 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
ctx->trigger = trigger;
+ /*
+ * Devices without DisINTx support require an exclusive interrupt,
+ * IRQ masking is performed at the IRQ chip. The masked status is
+ * protected by vdev->irqlock. Setup the IRQ without auto-enable and
+ * unmask as necessary below under lock. DisINTx is unmodified by
+ * the IRQ configuration and may therefore use auto-enable.
+ */
if (!vdev->pci_2_3)
- irqflags = 0;
+ irqflags = IRQF_NO_AUTOEN;
ret = request_irq(pdev->irq, vfio_intx_handler,
irqflags, ctx->name, vdev);
@@ -308,13 +315,9 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
return ret;
}
- /*
- * INTx disable will stick across the new irq setup,
- * disable_irq won't.
- */
spin_lock_irqsave(&vdev->irqlock, flags);
- if (!vdev->pci_2_3 && ctx->masked)
- disable_irq_nosync(pdev->irq);
+ if (!vdev->pci_2_3 && !ctx->masked)
+ enable_irq(pdev->irq);
spin_unlock_irqrestore(&vdev->irqlock, flags);
return 0;
--
2.43.2
The eventfd_ctx trigger pointer of the vfio_fsl_mc_irq object is
initially NULL and may become NULL if the user sets the trigger
eventfd to -1. The interrupt handler itself is guaranteed that
trigger is always valid between request_irq() and free_irq(), but
the loopback testing mechanisms to invoke the handler function
need to test the trigger. The triggering and setting ioctl paths
both make use of igate and are therefore mutually exclusive.
The vfio-fsl-mc driver does not make use of irqfds, nor does it
support any sort of masking operations, therefore unlike vfio-pci
and vfio-platform, the flow can remain essentially unchanged.
Cc: Diana Craciun <[email protected]>
Fixes: cc0ee20bd969 ("vfio/fsl-mc: trigger an interrupt via eventfd")
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c b/drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c
index d62fbfff20b8..82b2afa9b7e3 100644
--- a/drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c
+++ b/drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c
@@ -141,13 +141,14 @@ static int vfio_fsl_mc_set_irq_trigger(struct vfio_fsl_mc_device *vdev,
irq = &vdev->mc_irqs[index];
if (flags & VFIO_IRQ_SET_DATA_NONE) {
- vfio_fsl_mc_irq_handler(hwirq, irq);
+ if (irq->trigger)
+ eventfd_signal(irq->trigger);
} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
u8 trigger = *(u8 *)data;
- if (trigger)
- vfio_fsl_mc_irq_handler(hwirq, irq);
+ if (trigger && irq->trigger)
+ eventfd_signal(irq->trigger);
}
return 0;
--
2.43.2
> From: Alex Williamson <[email protected]>
> Sent: Thursday, March 7, 2024 5:15 AM
>
> Currently for devices requiring masking at the irqchip for INTx, ie.
> devices without DisINTx support, the IRQ is enabled in request_irq()
> and subsequently disabled as necessary to align with the masked status
> flag. This presents a window where the interrupt could fire between
> these events, resulting in the IRQ incrementing the disable depth twice.
Can you elaborate the last point about disable depth?
> This would be unrecoverable for a user since the masked flag prevents
> nested enables through vfio.
What is 'nested enables'?
>
> Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> is never auto-enabled, then unmask as required.
>
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> Signed-off-by: Alex Williamson <[email protected]>
But this patch looks good to me:
Reviewed-by: Kevin Tian <[email protected]>
with one nit...
>
> + /*
> + * Devices without DisINTx support require an exclusive interrupt,
> + * IRQ masking is performed at the IRQ chip. The masked status is
"exclusive interrupt, with IRQ masking performed at..."
> From: Alex Williamson <[email protected]>
> Sent: Thursday, March 7, 2024 5:15 AM
>
> Mask operations through config space changes to DisINTx may race INTx
> configuration changes via ioctl. Create wrappers that add locking for
> paths outside of the core interrupt code.
>
> In particular, irq_type is updated holding igate, therefore testing
> is_intx() requires holding igate. For example clearing DisINTx from
> config space can otherwise race changes of the interrupt configuration.
>
Looks the suspend path still checks irq_type w/o holding igate:
vdev->pm_intx_masked = ((vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX) &&
vfio_pci_intx_mask(vdev));
Is it with assumption that no change of configuration is possible at
that point?
> From: Alex Williamson <[email protected]>
> Sent: Thursday, March 7, 2024 5:15 AM
>
> Currently for devices requiring masking at the irqchip for INTx, ie.
> devices without DisINTx support, the IRQ is enabled in request_irq()
> and subsequently disabled as necessary to align with the masked status
> flag. This presents a window where the interrupt could fire between
> these events, resulting in the IRQ incrementing the disable depth twice.
> This would be unrecoverable for a user since the masked flag prevents
> nested enables through vfio.
>
> Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> is never auto-enabled, then unmask as required.
>
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> Signed-off-by: Alex Williamson <[email protected]>
CC stable?
> From: Alex Williamson <[email protected]>
> Sent: Thursday, March 7, 2024 5:15 AM
>
> In order to synchronize changes that can affect the thread callback,
> introduce an interface to force a flush of the inject workqueue. The
> irqfd pointer is only valid under spinlock, but the workqueue cannot
> be flushed under spinlock. Therefore the flush work for the irqfd is
> queued under spinlock. The vfio_irqfd_cleanup_wq workqueue is re-used
> for queuing this work such that flushing the workqueue is also ordered
> relative to shutdown.
>
> Signed-off-by: Alex Williamson <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Hi Alex,
kernel test robot noticed the following build warnings:
[auto build test WARNING on awilliam-vfio/next]
[also build test WARNING on linus/master v6.8-rc7 next-20240307]
[cannot apply to awilliam-vfio/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alex-Williamson/vfio-pci-Disable-auto-enable-of-exclusive-INTx-IRQ/20240307-051931
base: https://github.com/awilliam/linux-vfio.git next
patch link: https://lore.kernel.org/r/20240306211445.1856768-8-alex.williamson%40redhat.com
patch subject: [PATCH 7/7] vfio/fsl-mc: Block calling interrupt handler without trigger
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20240307/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240307/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c:111:11: warning: variable 'hwirq' set but not used [-Wunused-but-set-variable]
111 | int ret, hwirq;
| ^
1 warning generated.
vim +/hwirq +111 drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c
cc0ee20bd96971 Diana Craciun 2020-10-05 104
2e0d29561f593a Diana Craciun 2020-10-05 105 static int vfio_fsl_mc_set_irq_trigger(struct vfio_fsl_mc_device *vdev,
2e0d29561f593a Diana Craciun 2020-10-05 106 unsigned int index, unsigned int start,
2e0d29561f593a Diana Craciun 2020-10-05 107 unsigned int count, u32 flags,
2e0d29561f593a Diana Craciun 2020-10-05 108 void *data)
2e0d29561f593a Diana Craciun 2020-10-05 109 {
cc0ee20bd96971 Diana Craciun 2020-10-05 110 struct fsl_mc_device *mc_dev = vdev->mc_dev;
cc0ee20bd96971 Diana Craciun 2020-10-05 @111 int ret, hwirq;
cc0ee20bd96971 Diana Craciun 2020-10-05 112 struct vfio_fsl_mc_irq *irq;
cc0ee20bd96971 Diana Craciun 2020-10-05 113 struct device *cont_dev = fsl_mc_cont_dev(&mc_dev->dev);
cc0ee20bd96971 Diana Craciun 2020-10-05 114 struct fsl_mc_device *mc_cont = to_fsl_mc_device(cont_dev);
cc0ee20bd96971 Diana Craciun 2020-10-05 115
159246378d8483 Diana Craciun 2020-10-15 116 if (!count && (flags & VFIO_IRQ_SET_DATA_NONE))
159246378d8483 Diana Craciun 2020-10-15 117 return vfio_set_trigger(vdev, index, -1);
159246378d8483 Diana Craciun 2020-10-15 118
cc0ee20bd96971 Diana Craciun 2020-10-05 119 if (start != 0 || count != 1)
2e0d29561f593a Diana Craciun 2020-10-05 120 return -EINVAL;
cc0ee20bd96971 Diana Craciun 2020-10-05 121
da119f387e9464 Jason Gunthorpe 2021-08-05 122 mutex_lock(&vdev->vdev.dev_set->lock);
cc0ee20bd96971 Diana Craciun 2020-10-05 123 ret = fsl_mc_populate_irq_pool(mc_cont,
cc0ee20bd96971 Diana Craciun 2020-10-05 124 FSL_MC_IRQ_POOL_MAX_TOTAL_IRQS);
cc0ee20bd96971 Diana Craciun 2020-10-05 125 if (ret)
cc0ee20bd96971 Diana Craciun 2020-10-05 126 goto unlock;
cc0ee20bd96971 Diana Craciun 2020-10-05 127
cc0ee20bd96971 Diana Craciun 2020-10-05 128 ret = vfio_fsl_mc_irqs_allocate(vdev);
cc0ee20bd96971 Diana Craciun 2020-10-05 129 if (ret)
cc0ee20bd96971 Diana Craciun 2020-10-05 130 goto unlock;
da119f387e9464 Jason Gunthorpe 2021-08-05 131 mutex_unlock(&vdev->vdev.dev_set->lock);
cc0ee20bd96971 Diana Craciun 2020-10-05 132
cc0ee20bd96971 Diana Craciun 2020-10-05 133 if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
cc0ee20bd96971 Diana Craciun 2020-10-05 134 s32 fd = *(s32 *)data;
cc0ee20bd96971 Diana Craciun 2020-10-05 135
cc0ee20bd96971 Diana Craciun 2020-10-05 136 return vfio_set_trigger(vdev, index, fd);
cc0ee20bd96971 Diana Craciun 2020-10-05 137 }
cc0ee20bd96971 Diana Craciun 2020-10-05 138
d86a6d47bcc6b4 Thomas Gleixner 2021-12-10 139 hwirq = vdev->mc_dev->irqs[index]->virq;
cc0ee20bd96971 Diana Craciun 2020-10-05 140
cc0ee20bd96971 Diana Craciun 2020-10-05 141 irq = &vdev->mc_irqs[index];
cc0ee20bd96971 Diana Craciun 2020-10-05 142
cc0ee20bd96971 Diana Craciun 2020-10-05 143 if (flags & VFIO_IRQ_SET_DATA_NONE) {
dce72fdf5c6be9 Alex Williamson 2024-03-06 144 if (irq->trigger)
dce72fdf5c6be9 Alex Williamson 2024-03-06 145 eventfd_signal(irq->trigger);
cc0ee20bd96971 Diana Craciun 2020-10-05 146
cc0ee20bd96971 Diana Craciun 2020-10-05 147 } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
cc0ee20bd96971 Diana Craciun 2020-10-05 148 u8 trigger = *(u8 *)data;
cc0ee20bd96971 Diana Craciun 2020-10-05 149
dce72fdf5c6be9 Alex Williamson 2024-03-06 150 if (trigger && irq->trigger)
dce72fdf5c6be9 Alex Williamson 2024-03-06 151 eventfd_signal(irq->trigger);
cc0ee20bd96971 Diana Craciun 2020-10-05 152 }
cc0ee20bd96971 Diana Craciun 2020-10-05 153
cc0ee20bd96971 Diana Craciun 2020-10-05 154 return 0;
cc0ee20bd96971 Diana Craciun 2020-10-05 155
cc0ee20bd96971 Diana Craciun 2020-10-05 156 unlock:
da119f387e9464 Jason Gunthorpe 2021-08-05 157 mutex_unlock(&vdev->vdev.dev_set->lock);
cc0ee20bd96971 Diana Craciun 2020-10-05 158 return ret;
cc0ee20bd96971 Diana Craciun 2020-10-05 159
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Thu, 7 Mar 2024 08:28:45 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, March 7, 2024 5:15 AM
> >
> > Currently for devices requiring masking at the irqchip for INTx, ie.
> > devices without DisINTx support, the IRQ is enabled in request_irq()
> > and subsequently disabled as necessary to align with the masked status
> > flag. This presents a window where the interrupt could fire between
> > these events, resulting in the IRQ incrementing the disable depth twice.
>
> Can you elaborate the last point about disable depth?
Each irq_desc maintains a depth field, a disable increments the depth,
an enable decrements. On the disable transition from 0 to 1 the IRQ
chip is disabled, on the enable transition from 1 to 0 the IRQ chip is
enabled.
Therefore if an interrupt fires between request_irq() and
disable_irq(), vfio_intx_handler() will disable the IRQ (depth 0->1).
Note that masked is not tested here, the interrupt is expected to be
exclusive for non-pci_2_3 devices. @masked would be redundantly set to
true. The setup call path would increment depth to 2. The order these
happen is not important so long as the interrupt is in-flight before
the setup path disables the IRQ.
> > This would be unrecoverable for a user since the masked flag prevents
> > nested enables through vfio.
>
> What is 'nested enables'?
In the case above we have masked true and disable depth 2. If the user
now unmasks the interrupt then depth is reduced to 1, the IRQ is still
disabled, and masked is false. The masked value is now out of sync
with the IRQ line and prevents the user from unmasking again. The
disable depth is stuck at 1.
Nested enables would be if we allowed the user to unmask a line that we
think is already unmasked.
> > Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> > is never auto-enabled, then unmask as required.
> >
> > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > Signed-off-by: Alex Williamson <[email protected]>
>
> But this patch looks good to me:
>
> Reviewed-by: Kevin Tian <[email protected]>
>
> with one nit...
>
> >
> > + /*
> > + * Devices without DisINTx support require an exclusive interrupt,
> > + * IRQ masking is performed at the IRQ chip. The masked status is
>
> "exclusive interrupt, with IRQ masking performed at..."
TBH, the difference is too subtle for me. With my version above you
could replace the comma with a period, I think it has the same meaning.
However, "...exclusive interrupt, with IRQ masking performed at..."
almost suggests that we need a specific type of exclusive interrupt
with this property. There's nothing unique about the exclusive
interrupt, we could arbitrarily decide we want an exclusive interrupt
for DisINTx masking if we wanted to frustrate a lot of users.
Performing masking at the IRQ chip is actually what necessitates the
exclusive interrupt here. Thanks,
Alex
On Thu, 7 Mar 2024 08:37:53 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, March 7, 2024 5:15 AM
> >
> > Mask operations through config space changes to DisINTx may race INTx
> > configuration changes via ioctl. Create wrappers that add locking for
> > paths outside of the core interrupt code.
> >
> > In particular, irq_type is updated holding igate, therefore testing
> > is_intx() requires holding igate. For example clearing DisINTx from
> > config space can otherwise race changes of the interrupt configuration.
> >
>
> Looks the suspend path still checks irq_type w/o holding igate:
>
> vdev->pm_intx_masked = ((vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX) &&
> vfio_pci_intx_mask(vdev));
>
> Is it with assumption that no change of configuration is possible at
> that point?
Yes, I believe this is relatively safe because userspace is frozen at
this point. That's not however to claim that irq_type is absolutely
used consistently after this series. I just didn't see the other
violations rise to the same level as the fixes in this series and
wanted to avoid the distraction. I've stashed a number of patches that
I'd eventually like to post as follow-ups to this series. Thanks,
Alex
On Thu, 7 Mar 2024 08:39:16 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, March 7, 2024 5:15 AM
> >
> > Currently for devices requiring masking at the irqchip for INTx, ie.
> > devices without DisINTx support, the IRQ is enabled in request_irq()
> > and subsequently disabled as necessary to align with the masked status
> > flag. This presents a window where the interrupt could fire between
> > these events, resulting in the IRQ incrementing the disable depth twice.
> > This would be unrecoverable for a user since the masked flag prevents
> > nested enables through vfio.
> >
> > Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> > is never auto-enabled, then unmask as required.
> >
> > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > Signed-off-by: Alex Williamson <[email protected]>
>
> CC stable?
I've always found that having a Fixes: tag is sufficient to get picked
up for stable, so I typically don't do both. If it helps out someone's
process I'd be happy to though. Thanks,
Alex
> From: Alex Williamson <[email protected]>
> Sent: Thursday, March 7, 2024 5:15 AM
>
> A vulnerability exists where the eventfd for INTx signaling can be
> deconfigured, which unregisters the IRQ handler but still allows
> eventfds to be signaled with a NULL context through the SET_IRQS ioctl
> or through unmask irqfd if the device interrupt is pending.
>
> Ideally this could be solved with some additional locking; the igate
> mutex serializes the ioctl and config space accesses, and the interrupt
> handler is unregistered relative to the trigger, but the irqfd path
> runs asynchronous to those. The igate mutex cannot be acquired from the
> atomic context of the eventfd wake function. Disabling the irqfd
> relative to the eventfd registration is potentially incompatible with
> existing userspace.
>
> As a result, the solution implemented here moves configuration of the
> INTx interrupt handler to track the lifetime of the INTx context object
> and irq_type configuration, rather than registration of a particular
> trigger eventfd. Synchronization is added between the ioctl path and
> eventfd_signal() wrapper such that the eventfd trigger can be
> dynamically updated relative to in-flight interrupts or irqfd callbacks.
>
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> Reported-by: Reinette Chatre <[email protected]>
> Signed-off-by: Alex Williamson <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
> From: Alex Williamson <[email protected]>
> Sent: Thursday, March 7, 2024 5:15 AM
>
> @@ -321,8 +321,13 @@ void vfio_platform_irq_cleanup(struct
> vfio_platform_device *vdev)
> {
> int i;
>
> - for (i = 0; i < vdev->num_irqs; i++)
> + for (i = 0; i < vdev->num_irqs; i++) {
> + int hwirq = vdev->get_irq(vdev, i);
hwirq is unused.
> +
> + vfio_virqfd_disable(&vdev->irqs[i].mask);
> + vfio_virqfd_disable(&vdev->irqs[i].unmask);
> vfio_set_trigger(vdev, i, -1, NULL);
> + }
>
> vdev->num_irqs = 0;
> kfree(vdev->irqs);
> --
> 2.43.2
> From: Alex Williamson <[email protected]>
> Sent: Friday, March 8, 2024 4:21 AM
>
> On Thu, 7 Mar 2024 08:37:53 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, March 7, 2024 5:15 AM
> > >
> > > Mask operations through config space changes to DisINTx may race INTx
> > > configuration changes via ioctl. Create wrappers that add locking for
> > > paths outside of the core interrupt code.
> > >
> > > In particular, irq_type is updated holding igate, therefore testing
> > > is_intx() requires holding igate. For example clearing DisINTx from
> > > config space can otherwise race changes of the interrupt configuration.
> > >
> >
> > Looks the suspend path still checks irq_type w/o holding igate:
> >
> > vdev->pm_intx_masked = ((vdev->irq_type ==
> VFIO_PCI_INTX_IRQ_INDEX) &&
> > vfio_pci_intx_mask(vdev));
> >
> > Is it with assumption that no change of configuration is possible at
> > that point?
>
> Yes, I believe this is relatively safe because userspace is frozen at
> this point. That's not however to claim that irq_type is absolutely
> used consistently after this series. I just didn't see the other
> violations rise to the same level as the fixes in this series and
> wanted to avoid the distraction. I've stashed a number of patches that
> I'd eventually like to post as follow-ups to this series. Thanks,
>
Reviewed-by: Kevin Tian <[email protected]>
> From: Alex Williamson <[email protected]>
> Sent: Friday, March 8, 2024 4:17 AM
>
> On Thu, 7 Mar 2024 08:28:45 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, March 7, 2024 5:15 AM
> > >
> > > Currently for devices requiring masking at the irqchip for INTx, ie.
> > > devices without DisINTx support, the IRQ is enabled in request_irq()
> > > and subsequently disabled as necessary to align with the masked status
> > > flag. This presents a window where the interrupt could fire between
> > > these events, resulting in the IRQ incrementing the disable depth twice.
> >
> > Can you elaborate the last point about disable depth?
>
> Each irq_desc maintains a depth field, a disable increments the depth,
> an enable decrements. On the disable transition from 0 to 1 the IRQ
> chip is disabled, on the enable transition from 1 to 0 the IRQ chip is
> enabled.
>
> Therefore if an interrupt fires between request_irq() and
> disable_irq(), vfio_intx_handler() will disable the IRQ (depth 0->1).
> Note that masked is not tested here, the interrupt is expected to be
> exclusive for non-pci_2_3 devices. @masked would be redundantly set to
> true. The setup call path would increment depth to 2. The order these
> happen is not important so long as the interrupt is in-flight before
> the setup path disables the IRQ.
>
> > > This would be unrecoverable for a user since the masked flag prevents
> > > nested enables through vfio.
> >
> > What is 'nested enables'?
>
> In the case above we have masked true and disable depth 2. If the user
> now unmasks the interrupt then depth is reduced to 1, the IRQ is still
> disabled, and masked is false. The masked value is now out of sync
> with the IRQ line and prevents the user from unmasking again. The
> disable depth is stuck at 1.
>
> Nested enables would be if we allowed the user to unmask a line that we
> think is already unmasked.
Thanks! clear to me now.
>
> > > Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> > > is never auto-enabled, then unmask as required.
> > >
> > > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > > Signed-off-by: Alex Williamson <[email protected]>
> >
> > But this patch looks good to me:
> >
> > Reviewed-by: Kevin Tian <[email protected]>
> >
> > with one nit...
> >
> > >
> > > + /*
> > > + * Devices without DisINTx support require an exclusive interrupt,
> > > + * IRQ masking is performed at the IRQ chip. The masked status is
> >
> > "exclusive interrupt, with IRQ masking performed at..."
>
> TBH, the difference is too subtle for me. With my version above you
> could replace the comma with a period, I think it has the same meaning.
> However, "...exclusive interrupt, with IRQ masking performed at..."
> almost suggests that we need a specific type of exclusive interrupt
> with this property. There's nothing unique about the exclusive
> interrupt, we could arbitrarily decide we want an exclusive interrupt
> for DisINTx masking if we wanted to frustrate a lot of users.
>
> Performing masking at the IRQ chip is actually what necessitates the
> exclusive interrupt here. Thanks,
>
make sense. and I saw you replaced the commaon with a period in patch4.
> From: Alex Williamson <[email protected]>
> Sent: Friday, March 8, 2024 4:24 AM
>
> On Thu, 7 Mar 2024 08:39:16 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, March 7, 2024 5:15 AM
> > >
> > > Currently for devices requiring masking at the irqchip for INTx, ie.
> > > devices without DisINTx support, the IRQ is enabled in request_irq()
> > > and subsequently disabled as necessary to align with the masked status
> > > flag. This presents a window where the interrupt could fire between
> > > these events, resulting in the IRQ incrementing the disable depth twice.
> > > This would be unrecoverable for a user since the masked flag prevents
> > > nested enables through vfio.
> > >
> > > Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> > > is never auto-enabled, then unmask as required.
> > >
> > > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > > Signed-off-by: Alex Williamson <[email protected]>
> >
> > CC stable?
>
> I've always found that having a Fixes: tag is sufficient to get picked
> up for stable, so I typically don't do both. If it helps out someone's
> process I'd be happy to though. Thanks,
>
According to "Documentation/process/submitting-patches.rst":
Note: Attaching a Fixes: tag does not subvert the stable kernel rules
process nor the requirement to Cc: [email protected] on all stable
patch candidates. For more information, please read
Documentation/process/stable-kernel-rules.rst.
Probably it's fine as long as the stable kernel maintainers don't complain. ????
On Fri, 8 Mar 2024 07:23:21 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Friday, March 8, 2024 4:24 AM
> >
> > On Thu, 7 Mar 2024 08:39:16 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > > From: Alex Williamson <[email protected]>
> > > > Sent: Thursday, March 7, 2024 5:15 AM
> > > >
> > > > Currently for devices requiring masking at the irqchip for INTx, ie.
> > > > devices without DisINTx support, the IRQ is enabled in request_irq()
> > > > and subsequently disabled as necessary to align with the masked status
> > > > flag. This presents a window where the interrupt could fire between
> > > > these events, resulting in the IRQ incrementing the disable depth twice.
> > > > This would be unrecoverable for a user since the masked flag prevents
> > > > nested enables through vfio.
> > > >
> > > > Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> > > > is never auto-enabled, then unmask as required.
> > > >
> > > > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > > > Signed-off-by: Alex Williamson <[email protected]>
> > >
> > > CC stable?
> >
> > I've always found that having a Fixes: tag is sufficient to get picked
> > up for stable, so I typically don't do both. If it helps out someone's
> > process I'd be happy to though. Thanks,
> >
>
> According to "Documentation/process/submitting-patches.rst":
>
> Note: Attaching a Fixes: tag does not subvert the stable kernel rules
> process nor the requirement to Cc: [email protected] on all stable
> patch candidates. For more information, please read
> Documentation/process/stable-kernel-rules.rst.
>
> Probably it's fine as long as the stable kernel maintainers don't complain. ????
I think the stable maintainers are far more aggressive than the
documentation would suggest, but it doesn't hurt to include the Cc,
I'll add it next version. Thanks,
Alex
On Thu, Mar 07, 2024 at 01:23:48PM -0700, Alex Williamson wrote:
> On Thu, 7 Mar 2024 08:39:16 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, March 7, 2024 5:15 AM
> > >
> > > Currently for devices requiring masking at the irqchip for INTx, ie.
> > > devices without DisINTx support, the IRQ is enabled in request_irq()
> > > and subsequently disabled as necessary to align with the masked status
> > > flag. This presents a window where the interrupt could fire between
> > > these events, resulting in the IRQ incrementing the disable depth twice.
> > > This would be unrecoverable for a user since the masked flag prevents
> > > nested enables through vfio.
> > >
> > > Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx
> > > is never auto-enabled, then unmask as required.
> > >
> > > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > > Signed-off-by: Alex Williamson <[email protected]>
> >
> > CC stable?
>
> I've always found that having a Fixes: tag is sufficient to get picked
> up for stable, so I typically don't do both. If it helps out someone's
> process I'd be happy to though. Thanks,
It helps other distros in the ecosystem to flag patches that really
should be backported. Not everyone runs their backport trees as
agressively as a stable does.
Jason
On Fri, 8 Mar 2024 07:16:02 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, March 7, 2024 5:15 AM
> >
> > @@ -321,8 +321,13 @@ void vfio_platform_irq_cleanup(struct
> > vfio_platform_device *vdev)
> > {
> > int i;
> >
> > - for (i = 0; i < vdev->num_irqs; i++)
> > + for (i = 0; i < vdev->num_irqs; i++) {
> > + int hwirq = vdev->get_irq(vdev, i);
>
> hwirq is unused.
Yep, poor split with the next patch. I'll update the next patch to use
vdev->irqs[i].hwirq here and in the unwind on init to avoid this. Thanks,
Alex
> > +
> > + vfio_virqfd_disable(&vdev->irqs[i].mask);
> > + vfio_virqfd_disable(&vdev->irqs[i].unmask);
> > vfio_set_trigger(vdev, i, -1, NULL);
> > + }
> >
> > vdev->num_irqs = 0;
> > kfree(vdev->irqs);
> > --
> > 2.43.2
>
Hi Alex,
On 3/6/2024 1:14 PM, Alex Williamson wrote:
> Mask operations through config space changes to DisINTx may race INTx
> configuration changes via ioctl. Create wrappers that add locking for
> paths outside of the core interrupt code.
>
> In particular, irq_type is updated holding igate, therefore testing
> is_intx() requires holding igate. For example clearing DisINTx from
> config space can otherwise race changes of the interrupt configuration.
>
> This aligns interfaces which may trigger the INTx eventfd into two
> camps, one side serialized by igate and the other only enabled while
> INTx is configured. A subsequent patch introduces synchronization for
> the latter flows.
>
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> Reported-by: Reinette Chatre <[email protected]>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
Thank you very much.
Reviewed-by: Reinette Chatre <[email protected]>
Reinette
Hi Alex,
On 3/6/2024 1:14 PM, Alex Williamson wrote:
> In order to synchronize changes that can affect the thread callback,
> introduce an interface to force a flush of the inject workqueue. The
> irqfd pointer is only valid under spinlock, but the workqueue cannot
> be flushed under spinlock. Therefore the flush work for the irqfd is
> queued under spinlock. The vfio_irqfd_cleanup_wq workqueue is re-used
> for queuing this work such that flushing the workqueue is also ordered
> relative to shutdown.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
Thank you very much.
Reviewed-by: Reinette Chatre <[email protected]>
Reinette
Hi Alex,
On 3/6/2024 1:14 PM, Alex Williamson wrote:
> A vulnerability exists where the eventfd for INTx signaling can be
> deconfigured, which unregisters the IRQ handler but still allows
> eventfds to be signaled with a NULL context through the SET_IRQS ioctl
> or through unmask irqfd if the device interrupt is pending.
>
> Ideally this could be solved with some additional locking; the igate
> mutex serializes the ioctl and config space accesses, and the interrupt
> handler is unregistered relative to the trigger, but the irqfd path
> runs asynchronous to those. The igate mutex cannot be acquired from the
> atomic context of the eventfd wake function. Disabling the irqfd
> relative to the eventfd registration is potentially incompatible with
> existing userspace.
>
> As a result, the solution implemented here moves configuration of the
> INTx interrupt handler to track the lifetime of the INTx context object
> and irq_type configuration, rather than registration of a particular
> trigger eventfd. Synchronization is added between the ioctl path and
> eventfd_signal() wrapper such that the eventfd trigger can be
> dynamically updated relative to in-flight interrupts or irqfd callbacks.
>
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> Reported-by: Reinette Chatre <[email protected]>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
Thank you very much.
Reviewed-by: Reinette Chatre <[email protected]>
Reinette