2023-04-18 17:31:37

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V3 00/10] vfio/pci: Support dynamic allocation of MSI-X interrupts

Changes since V2:
- V2: https://lore.kernel.org/lkml/[email protected]/
- During testing of V2 "kernel test robot" reported issues resulting from
include/linux/pci.h missing a stub for pci_msix_can_alloc_dyn() when
CONFIG_PCI_MSI=n. A separate fix was sent to address this. The fix can
be found in the kernel (since v6.3-rc7) as
commit 195d8e5da3ac ("PCI/MSI: Provide missing stub for pci_msix_can_alloc_dyn()")
- Biggest change is the transition to "active contexts" for both MSI and MSI-X.
Interrupt contexts have always been allocated when the interrupts are
allocated while they are only used while interrupts are
enabled. In this series interrupt contexts are made dynamic, while doing
so their allocation is moved to match how they are used: allocated when
interrupts are enabled. Whether a Linux interrupt number exists determines
whether an interrupt can be enabled.
Previous policy (up to V2) that an allocated interrupt has an interrupt
context no longer applies. Instead, an interrupt context has a
handler/trigger, aka "active contexts". (Alex)
- Re-ordered patches in support of "active contexts".
- Only free interrupts on MSI-X teardown and otherwise use the
allocated interrupts as a cache. (Alex)
- Using unsigned int for the vector broke the unwind loop within
vfio_msi_set_block(). (Alex)
- Introduce new "has_dyn_msix" property of virtual device instead of
querying support every time. (Alex)
- Some smaller changes, please refer to individual patches.

Changes since RFC V1:
- RFC V1: https://lore.kernel.org/lkml/[email protected]/
- Improved changelogs.
- Simplify interface so that vfio_irq_ctx_alloc_single() returns pointer to
allocated context. (Alex)
- Remove vfio_irq_ctx_range_allocated() and associated attempts to maintain
invalid error path behavior. (Alex and Kevin)
- Add pointer to interrupt context as function parameter to
vfio_irq_ctx_free(). (Alex)
- Ensure variables are initialized. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)

Qemu allocates interrupts incrementally at the time the guest unmasks an
interrupt, for example each time a Linux guest runs request_irq().

Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1].
This prompted Qemu to, when allocating a new interrupt, first release all
previously allocated interrupts (including disable of MSI-X) followed
by re-allocation of all interrupts that includes the new interrupt.
Please see [2] for a detailed discussion about this issue.

Releasing and re-allocating interrupts may be acceptable if all
interrupts are unmasked during device initialization. If unmasking of
interrupts occur during runtime this may result in lost interrupts.
For example, consider an accelerator device with multiple work queues,
each work queue having a dedicated interrupt. A work queue can be
enabled at any time with its associated interrupt unmasked while other
work queues are already active. Having all interrupts released and MSI-X
disabled to enable the new work queue will impact active work queues.

This series builds on the recent interrupt sub-system core changes
that added support for dynamic MSI-X allocation after initial MSI-X
enabling.

Add support for dynamic MSI-X allocation to vfio-pci. A flag
indicating lack of support for dynamic allocation already exist:
VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With
support for dynamic MSI-X the flag is cleared for MSI-X when supported,
enabling Qemu to modify its behavior.

Any feedback is appreciated

Reinette

[1] commit 34026364df8e ("PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X")
[2] https://lore.kernel.org/kvm/MWHPR11MB188603D0D809C1079F5817DC8C099@MWHPR11MB1886.namprd11.prod.outlook.com/#t

Reinette Chatre (10):
vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
vfio/pci: Remove negative check on unsigned vector
vfio/pci: Prepare for dynamic interrupt context storage
vfio/pci: Move to single error path
vfio/pci: Use xarray for interrupt context storage
vfio/pci: Remove interrupt context counter
vfio/pci: Update stale comment
vfio/pci: Probe and store ability to support dynamic MSI-X
vfio/pci: Support dynamic MSI-X
vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

drivers/vfio/pci/vfio_pci_core.c | 10 +-
drivers/vfio/pci/vfio_pci_intrs.c | 320 +++++++++++++++++++++---------
include/linux/vfio_pci_core.h | 4 +-
include/uapi/linux/vfio.h | 3 +
4 files changed, 234 insertions(+), 103 deletions(-)


base-commit: 6a8f57ae2eb07ab39a6f0ccad60c760743051026
--
2.34.1


2023-04-18 17:31:54

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V3 04/10] vfio/pci: Move to single error path

Enabling and disabling of an interrupt involves several steps
that can fail. Cleanup after failure is done when the error
is encountered, resulting in some repetitive code.

Support for dynamic contexts will introduce more steps during
interrupt enabling and disabling.

Transition to centralized exit path in preparation for dynamic
contexts to eliminate duplicate error handling code.

Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V2:
- Move patch to earlier in series in support of the change to
dynamic context management.
- Do not add the "ctx->name = NULL" in error path. It is not done
in baseline and will not be needed when transitioning to
dynamic context management.
- Update changelog to not make this change specific to dynamic
MSI-X.

Changes since RFC V1:
- Improve changelog.

drivers/vfio/pci/vfio_pci_intrs.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index b664fbb6d2f2..9e17e59a4d60 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -418,8 +418,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,

trigger = eventfd_ctx_fdget(fd);
if (IS_ERR(trigger)) {
- kfree(ctx->name);
- return PTR_ERR(trigger);
+ ret = PTR_ERR(trigger);
+ goto out_free_name;
}

/*
@@ -439,11 +439,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,

ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
vfio_pci_memory_unlock_and_restore(vdev, cmd);
- if (ret) {
- kfree(ctx->name);
- eventfd_ctx_put(trigger);
- return ret;
- }
+ if (ret)
+ goto out_put_eventfd_ctx;

ctx->producer.token = trigger;
ctx->producer.irq = irq;
@@ -458,6 +455,12 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
ctx->trigger = trigger;

return 0;
+
+out_put_eventfd_ctx:
+ eventfd_ctx_put(trigger);
+out_free_name:
+ kfree(ctx->name);
+ return ret;
}

static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
--
2.34.1

2023-04-18 17:31:58

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V3 05/10] vfio/pci: Use xarray for interrupt context storage

Interrupt context is statically allocated at the time interrupts
are allocated. Following allocation, the context is managed by
directly accessing the elements of the array using the vector
as index. The storage is released when interrupts are disabled.

It is possible to dynamically allocate a single MSI-X interrupt
after MSI-X is enabled. A dynamic storage for interrupt context
is needed to support this. Replace the interrupt context array with an
xarray (similar to what the core uses as store for MSI descriptors)
that can support the dynamic expansion while maintaining the
custom that uses the vector as index.

With a dynamic storage it is no longer required to pre-allocate
interrupt contexts at the time the interrupts are allocated.
MSI and MSI-X interrupt contexts are only used when interrupts are
enabled. Their allocation can thus be delayed until interrupt enabling.
Only enabled interrupts will have associated interrupt contexts.
Whether an interrupt has been allocated (a Linux irq number exists
for it) becomes the criteria for whether an interrupt can be enabled.

Signed-off-by: Reinette Chatre <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
---
Changes since V2:
- Only allocate contexts as they are used, or "active". (Alex)
- Move vfio_irq_ctx_free() from later patch to prevent open-coding
the same within vfio_irq_ctx_free_all(). This evolved into
vfio_irq_ctx_free() used for dynamic context allocation and
vfio_irq_ctx_free_all() removed because of it. (Alex)
- With vfio_irq_ctx_alloc_num() removed, rename vfio_irq_ctx_alloc_single()
to vfio_irq_ctx_alloc().

Changes since RFC V1:
- Let vfio_irq_ctx_alloc_single() return pointer to allocated
context. (Alex)
- Transition INTx allocation to simplified vfio_irq_ctx_alloc_single().
- Improve accuracy of changelog.

drivers/vfio/pci/vfio_pci_core.c | 1 +
drivers/vfio/pci/vfio_pci_intrs.c | 91 ++++++++++++++++---------------
include/linux/vfio_pci_core.h | 2 +-
3 files changed, 48 insertions(+), 46 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..ae0e161c7fc9 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2102,6 +2102,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
INIT_LIST_HEAD(&vdev->vma_list);
INIT_LIST_HEAD(&vdev->sriov_pfs_item);
init_rwsem(&vdev->memory_lock);
+ xa_init(&vdev->ctx);

return 0;
}
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9e17e59a4d60..117cd384b3ad 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -52,25 +52,33 @@ static
struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
unsigned long index)
{
- if (index >= vdev->num_ctx)
- return NULL;
- return &vdev->ctx[index];
+ return xa_load(&vdev->ctx, index);
}

-static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
+static void vfio_irq_ctx_free(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_irq_ctx *ctx, unsigned long index)
{
- kfree(vdev->ctx);
+ xa_erase(&vdev->ctx, index);
+ kfree(ctx);
}

-static int vfio_irq_ctx_alloc_num(struct vfio_pci_core_device *vdev,
- unsigned long num)
+static struct vfio_pci_irq_ctx *
+vfio_irq_ctx_alloc(struct vfio_pci_core_device *vdev, unsigned long index)
{
- vdev->ctx = kcalloc(num, sizeof(struct vfio_pci_irq_ctx),
- GFP_KERNEL_ACCOUNT);
- if (!vdev->ctx)
- return -ENOMEM;
+ struct vfio_pci_irq_ctx *ctx;
+ int ret;

- return 0;
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+ if (!ctx)
+ return NULL;
+
+ ret = xa_insert(&vdev->ctx, index, ctx, GFP_KERNEL_ACCOUNT);
+ if (ret) {
+ kfree(ctx);
+ return NULL;
+ }
+
+ return ctx;
}

/*
@@ -218,7 +226,6 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
{
struct vfio_pci_irq_ctx *ctx;
- int ret;

if (!is_irq_none(vdev))
return -EINVAL;
@@ -226,15 +233,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
if (!vdev->pdev->irq)
return -ENODEV;

- ret = vfio_irq_ctx_alloc_num(vdev, 1);
- if (ret)
- return ret;
-
- ctx = vfio_irq_ctx_get(vdev, 0);
- if (!ctx) {
- vfio_irq_ctx_free_all(vdev);
- return -EINVAL;
- }
+ ctx = vfio_irq_ctx_alloc(vdev, 0);
+ if (!ctx)
+ return -ENOMEM;

vdev->num_ctx = 1;

@@ -325,7 +326,7 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
vfio_intx_set_signal(vdev, -1);
vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
- vfio_irq_ctx_free_all(vdev);
+ vfio_irq_ctx_free(vdev, ctx, 0);
}

/*
@@ -349,10 +350,6 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
if (!is_irq_none(vdev))
return -EINVAL;

- ret = vfio_irq_ctx_alloc_num(vdev, nvec);
- if (ret)
- return ret;
-
/* return the number of supported vectors if we can't get all: */
cmd = vfio_pci_memory_lock_and_enable(vdev);
ret = pci_alloc_irq_vectors(pdev, 1, nvec, flag);
@@ -360,7 +357,6 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
if (ret > 0)
pci_free_irq_vectors(pdev);
vfio_pci_memory_unlock_and_restore(vdev, cmd);
- vfio_irq_ctx_free_all(vdev);
return ret;
}
vfio_pci_memory_unlock_and_restore(vdev, cmd);
@@ -392,12 +388,13 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
if (vector >= vdev->num_ctx)
return -EINVAL;

- ctx = vfio_irq_ctx_get(vdev, vector);
- if (!ctx)
- return -EINVAL;
irq = pci_irq_vector(pdev, vector);
+ if (irq < 0)
+ return -EINVAL;

- if (ctx->trigger) {
+ ctx = vfio_irq_ctx_get(vdev, vector);
+
+ if (ctx) {
irq_bypass_unregister_producer(&ctx->producer);

cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -405,16 +402,22 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
vfio_pci_memory_unlock_and_restore(vdev, cmd);
kfree(ctx->name);
eventfd_ctx_put(ctx->trigger);
- ctx->trigger = NULL;
+ vfio_irq_ctx_free(vdev, ctx, vector);
}

if (fd < 0)
return 0;

+ ctx = vfio_irq_ctx_alloc(vdev, vector);
+ if (!ctx)
+ return -ENOMEM;
+
ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
msix ? "x" : "", vector, pci_name(pdev));
- if (!ctx->name)
- return -ENOMEM;
+ if (!ctx->name) {
+ ret = -ENOMEM;
+ goto out_free_ctx;
+ }

trigger = eventfd_ctx_fdget(fd);
if (IS_ERR(trigger)) {
@@ -460,6 +463,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
eventfd_ctx_put(trigger);
out_free_name:
kfree(ctx->name);
+out_free_ctx:
+ vfio_irq_ctx_free(vdev, ctx, vector);
return ret;
}

@@ -489,16 +494,13 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
{
struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_irq_ctx *ctx;
- unsigned int i;
+ unsigned long i;
u16 cmd;

- for (i = 0; i < vdev->num_ctx; i++) {
- ctx = vfio_irq_ctx_get(vdev, i);
- if (ctx) {
- vfio_virqfd_disable(&ctx->unmask);
- vfio_virqfd_disable(&ctx->mask);
- vfio_msi_set_vector_signal(vdev, i, -1, msix);
- }
+ xa_for_each(&vdev->ctx, i, ctx) {
+ vfio_virqfd_disable(&ctx->unmask);
+ vfio_virqfd_disable(&ctx->mask);
+ vfio_msi_set_vector_signal(vdev, i, -1, msix);
}

cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -514,7 +516,6 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)

vdev->irq_type = VFIO_PCI_NUM_IRQS;
vdev->num_ctx = 0;
- vfio_irq_ctx_free_all(vdev);
}

/*
@@ -654,7 +655,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,

for (i = start; i < start + count; i++) {
ctx = vfio_irq_ctx_get(vdev, i);
- if (!ctx || !ctx->trigger)
+ if (!ctx)
continue;
if (flags & VFIO_IRQ_SET_DATA_NONE) {
eventfd_signal(ctx->trigger, 1);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 367fd79226a3..61d7873a3973 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -59,7 +59,7 @@ struct vfio_pci_core_device {
struct perm_bits *msi_perm;
spinlock_t irqlock;
struct mutex igate;
- struct vfio_pci_irq_ctx *ctx;
+ struct xarray ctx;
int num_ctx;
int irq_type;
int num_regions;
--
2.34.1

2023-04-18 17:32:02

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X

Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
enables an individual MSI-X interrupt to be allocated and freed after
MSI-X enabling.

Use dynamic MSI-X (if supported by the device) to allocate an interrupt
after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
the time a valid eventfd is assigned. This is different behavior from
a range provided during MSI-X enabling where interrupts are allocated
for the entire range whether a valid eventfd is provided for each
interrupt or not.

Do not dynamically free interrupts, leave that to when MSI-X is
disabled.

Signed-off-by: Reinette Chatre <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
---
The get_cached_msi_msg()/pci_write_msi_msg() behavior is kept
similar to the scenario when MSI-X is enabled with triggers
provided for new interrupts. get_cached_msi_msg()/pci_write_msi_msg()
follows for interrupts recently allocated with pci_msix_alloc_irq_at()
just like get_cached_msi_msg()/pci_write_msi_msg() is done for
interrupts recently allocated with pci_alloc_irq_vectors().

Changes since V2:
- Move vfio_irq_ctx_free() to earlier in series to support
earlier usage. (Alex)
- Use consistent terms in changelog: MSI-x changed to MSI-X.
- Make dynamic interrupt context creation generic across all
MSI/MSI-X interrupts. This resulted in code moving to earlier
in series as part of xarray introduction patch. (Alex)
- Remove the local allow_dyn_alloc and direct calling of
pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
introduced earlier instead. (Alex)
- Stop tracking new allocations (remove "new_ctx"). (Alex)
- Introduce new wrapper that returns Linux interrupt number or
dynamically allocate a new interrupt. Wrapper can be used for
all interrupt cases. (Alex)
- Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)

Changes since RFC V1:
- Add pointer to interrupt context as function parameter to
vfio_irq_ctx_free(). (Alex)
- Initialize new_ctx to false. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)

drivers/vfio/pci/vfio_pci_intrs.c | 73 +++++++++++++++++++++++++++----
1 file changed, 65 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bdda7f46c2be..c1a3e224c867 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -372,27 +372,74 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
return 0;
}

+/*
+ * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
+ * If a Linux IRQ number is not available then a new interrupt will be
+ * allocated if dynamic MSI-X is supported.
+ */
+static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
+ unsigned int vector, bool msix)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct msi_map map;
+ int irq;
+ u16 cmd;
+
+ irq = pci_irq_vector(pdev, vector);
+ if (irq > 0 || !msix || !vdev->has_dyn_msix)
+ return irq;
+
+ cmd = vfio_pci_memory_lock_and_enable(vdev);
+ map = pci_msix_alloc_irq_at(pdev, vector, NULL);
+ vfio_pci_memory_unlock_and_restore(vdev, cmd);
+
+ return map.index < 0 ? map.index : map.virq;
+}
+
+/*
+ * Free interrupt if it can be re-allocated dynamically (while MSI-X
+ * is enabled).
+ */
+static void vfio_msi_free_irq(struct vfio_pci_core_device *vdev,
+ unsigned int vector, bool msix)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct msi_map map;
+ int irq;
+ u16 cmd;
+
+ if (!msix || !vdev->has_dyn_msix)
+ return;
+
+ irq = pci_irq_vector(pdev, vector);
+ map = (struct msi_map) { .index = vector, .virq = irq };
+
+ if (WARN_ON(irq < 0))
+ return;
+
+ cmd = vfio_pci_memory_lock_and_enable(vdev);
+ pci_msix_free_irq(pdev, map);
+ vfio_pci_memory_unlock_and_restore(vdev, cmd);
+}
+
static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
unsigned int vector, int fd, bool msix)
{
struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_irq_ctx *ctx;
struct eventfd_ctx *trigger;
- int irq, ret;
+ int irq = -EINVAL, ret;
u16 cmd;

- irq = pci_irq_vector(pdev, vector);
- if (irq < 0)
- return -EINVAL;
-
ctx = vfio_irq_ctx_get(vdev, vector);

if (ctx) {
irq_bypass_unregister_producer(&ctx->producer);
-
+ irq = pci_irq_vector(pdev, vector);
cmd = vfio_pci_memory_lock_and_enable(vdev);
free_irq(irq, ctx->trigger);
vfio_pci_memory_unlock_and_restore(vdev, cmd);
+ /* Interrupt stays allocated, will be freed at MSI-X disable. */
kfree(ctx->name);
eventfd_ctx_put(ctx->trigger);
vfio_irq_ctx_free(vdev, ctx, vector);
@@ -401,9 +448,17 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
if (fd < 0)
return 0;

+ if (irq == -EINVAL) {
+ irq = vfio_msi_alloc_irq(vdev, vector, msix);
+ if (irq < 0)
+ return irq;
+ }
+
ctx = vfio_irq_ctx_alloc(vdev, vector);
- if (!ctx)
- return -ENOMEM;
+ if (!ctx) {
+ ret = -ENOMEM;
+ goto out_free_irq;
+ }

ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
msix ? "x" : "", vector, pci_name(pdev));
@@ -456,6 +511,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
kfree(ctx->name);
out_free_ctx:
vfio_irq_ctx_free(vdev, ctx, vector);
+out_free_irq:
+ vfio_msi_free_irq(vdev, vector, msix);
return ret;
}

--
2.34.1

2023-04-18 17:32:16

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V3 10/10] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
to provide guidance to user space.

Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V2:
- Use new vdev->has_dyn_msix property instead of calling
pci_msix_can_alloc_dyn() directly. (Alex)

Changes since RFC V1:
- Only advertise VFIO_IRQ_INFO_NORESIZE for MSI-X devices that
can actually support dynamic allocation. (Alex)

drivers/vfio/pci/vfio_pci_core.c | 4 +++-
include/uapi/linux/vfio.h | 3 +++
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a3635a8e54c8..4050ad3388c2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1114,7 +1114,9 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
info.flags |=
(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
- else
+ else if ((info.index != VFIO_PCI_MSIX_IRQ_INDEX) ||
+ (info.index == VFIO_PCI_MSIX_IRQ_INDEX &&
+ !vdev->has_dyn_msix))
info.flags |= VFIO_IRQ_INFO_NORESIZE;

return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0552e8dcf0cb..1a36134cae5c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -511,6 +511,9 @@ struct vfio_region_info_cap_nvlink2_lnkspd {
* then add and unmask vectors, it's up to userspace to make the decision
* whether to allocate the maximum supported number of vectors or tear
* down setup and incrementally increase the vectors as each is enabled.
+ * Absence of the NORESIZE flag indicates that vectors can be enabled
+ * and disabled dynamically without impacting other vectors within the
+ * index.
*/
struct vfio_irq_info {
__u32 argsz;
--
2.34.1

2023-04-18 17:33:18

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

Not all MSI-X devices support dynamic MSI-X allocation. Whether
a device supports dynamic MSI-X should be queried using
pci_msix_can_alloc_dyn().

Instead of scattering code with pci_msix_can_alloc_dyn(),
probe this ability once and store it as a property of the
virtual device.

Suggested-by: Alex Williamson <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V2:
- New patch. (Alex)

drivers/vfio/pci/vfio_pci_core.c | 5 ++++-
include/linux/vfio_pci_core.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ae0e161c7fc9..a3635a8e54c8 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -530,8 +530,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
- } else
+ vdev->has_dyn_msix = pci_msix_can_alloc_dyn(pdev);
+ } else {
vdev->msix_bar = 0xFF;
+ vdev->has_dyn_msix = false;
+ }

if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
vdev->has_vga = true;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 148fd1ae6c1c..4f070f2d6fde 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -67,6 +67,7 @@ struct vfio_pci_core_device {
u8 msix_bar;
u16 msix_size;
u32 msix_offset;
+ bool has_dyn_msix;
u32 rbar[7];
bool pci_2_3;
bool virq_disabled;
--
2.34.1

2023-04-18 22:40:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

On Tue, 18 Apr 2023 10:29:19 -0700
Reinette Chatre <[email protected]> wrote:

> Not all MSI-X devices support dynamic MSI-X allocation. Whether
> a device supports dynamic MSI-X should be queried using
> pci_msix_can_alloc_dyn().
>
> Instead of scattering code with pci_msix_can_alloc_dyn(),
> probe this ability once and store it as a property of the
> virtual device.
>
> Suggested-by: Alex Williamson <[email protected]>
> Signed-off-by: Reinette Chatre <[email protected]>
> ---
> Changes since V2:
> - New patch. (Alex)
>
> drivers/vfio/pci/vfio_pci_core.c | 5 ++++-
> include/linux/vfio_pci_core.h | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index ae0e161c7fc9..a3635a8e54c8 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -530,8 +530,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
> vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
> vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
> vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
> - } else
> + vdev->has_dyn_msix = pci_msix_can_alloc_dyn(pdev);
> + } else {
> vdev->msix_bar = 0xFF;
> + vdev->has_dyn_msix = false;
> + }
>
> if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
> vdev->has_vga = true;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 148fd1ae6c1c..4f070f2d6fde 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -67,6 +67,7 @@ struct vfio_pci_core_device {
> u8 msix_bar;
> u16 msix_size;
> u32 msix_offset;
> + bool has_dyn_msix;
> u32 rbar[7];
> bool pci_2_3;
> bool virq_disabled;

Nit, the whole data structure probably needs to be sorted with pahole,
but creating a hole here for locality to other msix fields should
probably be secondary to keeping the structure well packed, which
suggests including this new field among the bools below. Thanks,

Alex

2023-04-18 22:41:15

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X

On Tue, 18 Apr 2023 10:29:20 -0700
Reinette Chatre <[email protected]> wrote:

> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
> enables an individual MSI-X interrupt to be allocated and freed after
> MSI-X enabling.
>
> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> the time a valid eventfd is assigned. This is different behavior from
> a range provided during MSI-X enabling where interrupts are allocated
> for the entire range whether a valid eventfd is provided for each
> interrupt or not.
>
> Do not dynamically free interrupts, leave that to when MSI-X is
> disabled.

But we do, sometimes, even if essentially only on the error path. Is
that worthwhile? It seems like we could entirely remove
vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
teardown.

I'd probably also add a comment in the commit log about the theory
behind not dynamically freeing irqs, ie. latency, reliability, and
whatever else we used to justify it. Thanks,

Alex

> Signed-off-by: Reinette Chatre <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> ---
> The get_cached_msi_msg()/pci_write_msi_msg() behavior is kept
> similar to the scenario when MSI-X is enabled with triggers
> provided for new interrupts. get_cached_msi_msg()/pci_write_msi_msg()
> follows for interrupts recently allocated with pci_msix_alloc_irq_at()
> just like get_cached_msi_msg()/pci_write_msi_msg() is done for
> interrupts recently allocated with pci_alloc_irq_vectors().
>
> Changes since V2:
> - Move vfio_irq_ctx_free() to earlier in series to support
> earlier usage. (Alex)
> - Use consistent terms in changelog: MSI-x changed to MSI-X.
> - Make dynamic interrupt context creation generic across all
> MSI/MSI-X interrupts. This resulted in code moving to earlier
> in series as part of xarray introduction patch. (Alex)
> - Remove the local allow_dyn_alloc and direct calling of
> pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
> introduced earlier instead. (Alex)
> - Stop tracking new allocations (remove "new_ctx"). (Alex)
> - Introduce new wrapper that returns Linux interrupt number or
> dynamically allocate a new interrupt. Wrapper can be used for
> all interrupt cases. (Alex)
> - Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)
>
> Changes since RFC V1:
> - Add pointer to interrupt context as function parameter to
> vfio_irq_ctx_free(). (Alex)
> - Initialize new_ctx to false. (Dan Carpenter)
> - Only support dynamic allocation if device supports it. (Alex)
>
> drivers/vfio/pci/vfio_pci_intrs.c | 73 +++++++++++++++++++++++++++----
> 1 file changed, 65 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index bdda7f46c2be..c1a3e224c867 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -372,27 +372,74 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
> return 0;
> }
>
> +/*
> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
> + * If a Linux IRQ number is not available then a new interrupt will be
> + * allocated if dynamic MSI-X is supported.
> + */
> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
> + unsigned int vector, bool msix)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + struct msi_map map;
> + int irq;
> + u16 cmd;
> +
> + irq = pci_irq_vector(pdev, vector);
> + if (irq > 0 || !msix || !vdev->has_dyn_msix)
> + return irq;
> +
> + cmd = vfio_pci_memory_lock_and_enable(vdev);
> + map = pci_msix_alloc_irq_at(pdev, vector, NULL);
> + vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +
> + return map.index < 0 ? map.index : map.virq;
> +}
> +
> +/*
> + * Free interrupt if it can be re-allocated dynamically (while MSI-X
> + * is enabled).
> + */
> +static void vfio_msi_free_irq(struct vfio_pci_core_device *vdev,
> + unsigned int vector, bool msix)
> +{
> + struct pci_dev *pdev = vdev->pdev;
> + struct msi_map map;
> + int irq;
> + u16 cmd;
> +
> + if (!msix || !vdev->has_dyn_msix)
> + return;
> +
> + irq = pci_irq_vector(pdev, vector);
> + map = (struct msi_map) { .index = vector, .virq = irq };
> +
> + if (WARN_ON(irq < 0))
> + return;
> +
> + cmd = vfio_pci_memory_lock_and_enable(vdev);
> + pci_msix_free_irq(pdev, map);
> + vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +}
> +
> static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
> unsigned int vector, int fd, bool msix)
> {
> struct pci_dev *pdev = vdev->pdev;
> struct vfio_pci_irq_ctx *ctx;
> struct eventfd_ctx *trigger;
> - int irq, ret;
> + int irq = -EINVAL, ret;
> u16 cmd;
>
> - irq = pci_irq_vector(pdev, vector);
> - if (irq < 0)
> - return -EINVAL;
> -
> ctx = vfio_irq_ctx_get(vdev, vector);
>
> if (ctx) {
> irq_bypass_unregister_producer(&ctx->producer);
> -
> + irq = pci_irq_vector(pdev, vector);
> cmd = vfio_pci_memory_lock_and_enable(vdev);
> free_irq(irq, ctx->trigger);
> vfio_pci_memory_unlock_and_restore(vdev, cmd);
> + /* Interrupt stays allocated, will be freed at MSI-X disable. */
> kfree(ctx->name);
> eventfd_ctx_put(ctx->trigger);
> vfio_irq_ctx_free(vdev, ctx, vector);
> @@ -401,9 +448,17 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
> if (fd < 0)
> return 0;
>
> + if (irq == -EINVAL) {
> + irq = vfio_msi_alloc_irq(vdev, vector, msix);
> + if (irq < 0)
> + return irq;
> + }
> +
> ctx = vfio_irq_ctx_alloc(vdev, vector);
> - if (!ctx)
> - return -ENOMEM;
> + if (!ctx) {
> + ret = -ENOMEM;
> + goto out_free_irq;
> + }
>
> ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
> msix ? "x" : "", vector, pci_name(pdev));
> @@ -456,6 +511,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
> kfree(ctx->name);
> out_free_ctx:
> vfio_irq_ctx_free(vdev, ctx, vector);
> +out_free_irq:
> + vfio_msi_free_irq(vdev, vector, msix);
> return ret;
> }
>

2023-04-18 22:42:28

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3 10/10] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

On Tue, 18 Apr 2023 10:29:21 -0700
Reinette Chatre <[email protected]> wrote:

> Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
> to provide guidance to user space.
>
> Signed-off-by: Reinette Chatre <[email protected]>
> ---
> Changes since V2:
> - Use new vdev->has_dyn_msix property instead of calling
> pci_msix_can_alloc_dyn() directly. (Alex)
>
> Changes since RFC V1:
> - Only advertise VFIO_IRQ_INFO_NORESIZE for MSI-X devices that
> can actually support dynamic allocation. (Alex)
>
> drivers/vfio/pci/vfio_pci_core.c | 4 +++-
> include/uapi/linux/vfio.h | 3 +++
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a3635a8e54c8..4050ad3388c2 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1114,7 +1114,9 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
> if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> info.flags |=
> (VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
> - else
> + else if ((info.index != VFIO_PCI_MSIX_IRQ_INDEX) ||
> + (info.index == VFIO_PCI_MSIX_IRQ_INDEX &&
> + !vdev->has_dyn_msix))

Isn't this the same as:

(info.index != VFIO_PCI_MSIX_IRQ_INDEX || !vdev->has_dyn_msix)

Thanks,
Alex

> info.flags |= VFIO_IRQ_INFO_NORESIZE;
>
> return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 0552e8dcf0cb..1a36134cae5c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -511,6 +511,9 @@ struct vfio_region_info_cap_nvlink2_lnkspd {
> * then add and unmask vectors, it's up to userspace to make the decision
> * whether to allocate the maximum supported number of vectors or tear
> * down setup and incrementally increase the vectors as each is enabled.
> + * Absence of the NORESIZE flag indicates that vectors can be enabled
> + * and disabled dynamically without impacting other vectors within the
> + * index.
> */
> struct vfio_irq_info {
> __u32 argsz;

2023-04-19 18:15:04

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

Hi Alex,

On 4/18/2023 3:38 PM, Alex Williamson wrote:
> On Tue, 18 Apr 2023 10:29:19 -0700
> Reinette Chatre <[email protected]> wrote:
>

...

>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index 148fd1ae6c1c..4f070f2d6fde 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -67,6 +67,7 @@ struct vfio_pci_core_device {
>> u8 msix_bar;
>> u16 msix_size;
>> u32 msix_offset;
>> + bool has_dyn_msix;
>> u32 rbar[7];
>> bool pci_2_3;
>> bool virq_disabled;
>
> Nit, the whole data structure probably needs to be sorted with pahole,
> but creating a hole here for locality to other msix fields should
> probably be secondary to keeping the structure well packed, which
> suggests including this new field among the bools below. Thanks,

Thanks for catching this. Moving it one field lower as shown in the
delta patch below seems to improve the layout:

diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 4f070f2d6fde..d730d78754a2 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -67,8 +67,8 @@ struct vfio_pci_core_device {
u8 msix_bar;
u16 msix_size;
u32 msix_offset;
- bool has_dyn_msix;
u32 rbar[7];
+ bool has_dyn_msix;
bool pci_2_3;
bool virq_disabled;
bool reset_works;


Combined with the other changes to this struct (new struct xarray
for the context, and removing int num_ctx) the bools are no longer
together on a single cache line. Placing has_dyn_msix as shown above
keeps it on the same cache line as the other msix_* fields.

After this change the layout of this struct appears to be improved.
Before this patch series (v6.3-rc7):
/* size: 2496, cachelines: 39, members: 46 */
/* sum members: 2485, holes: 4, sum holes: 11 */
/* paddings: 2, sum paddings: 11 */
/* forced alignments: 1 */

After this patch series (v6.3-rc7 + V3 + delta patch):
/* size: 2568, cachelines: 41, members: 46 */
/* sum members: 2562, holes: 2, sum holes: 6 */
/* paddings: 2, sum paddings: 11 */
/* forced alignments: 1 */
/* last cacheline: 8 bytes */

Reinette

2023-04-19 18:16:54

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X

Hi Alex,

On 4/18/2023 3:38 PM, Alex Williamson wrote:
> On Tue, 18 Apr 2023 10:29:20 -0700
> Reinette Chatre <[email protected]> wrote:
>
>> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
>> enables an individual MSI-X interrupt to be allocated and freed after
>> MSI-X enabling.
>>
>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
>> the time a valid eventfd is assigned. This is different behavior from
>> a range provided during MSI-X enabling where interrupts are allocated
>> for the entire range whether a valid eventfd is provided for each
>> interrupt or not.
>>
>> Do not dynamically free interrupts, leave that to when MSI-X is
>> disabled.
>
> But we do, sometimes, even if essentially only on the error path. Is
> that worthwhile? It seems like we could entirely remove
> vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
> teardown.

Yes, it is only on the error path where dynamic MSI-X interrupts are
removed. I do not know how to determine if it is worthwhile. On the
kernel side failure seems unlikely since it would mean memory cannot
be allocated or request_irq() failed. In these cases it may not be
worthwhile since user space may try again and having the interrupt
already allocated would be helpful. The remaining error seems to be
if user space provided an invalid eventfd. An allocation in response
to wrong user input is a concern to me. Should we consider
buggy/malicious users? I am uncertain here so would defer to your
guidance.

> I'd probably also add a comment in the commit log about the theory
> behind not dynamically freeing irqs, ie. latency, reliability, and
> whatever else we used to justify it. Thanks,

Sure. How about something like below to replace the final sentence of
the changelog:

"When a guest disables an interrupt, user space (Qemu) does not
disable the interrupt but instead assigns it a different trigger. A
common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be
used to assign a new eventfd to an already enabled interrupt. Freeing
and re-allocating an interrupt in this scenario would add unnecessary
latency as well as uncertainty since the re-allocation may fail. Do
not dynamically free interrupts when an interrupt is disabled, instead
support a subsequent re-enable to draw from the initial allocation when
possible. Leave freeing of interrupts to when MSI-X is disabled."

Reinette

2023-04-19 18:17:39

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V3 10/10] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

Hi Alex,

On 4/18/2023 3:38 PM, Alex Williamson wrote:
> On Tue, 18 Apr 2023 10:29:21 -0700
> Reinette Chatre <[email protected]> wrote:
>

...

>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index a3635a8e54c8..4050ad3388c2 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1114,7 +1114,9 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
>> if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
>> info.flags |=
>> (VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
>> - else
>> + else if ((info.index != VFIO_PCI_MSIX_IRQ_INDEX) ||
>> + (info.index == VFIO_PCI_MSIX_IRQ_INDEX &&
>> + !vdev->has_dyn_msix))
>
> Isn't this the same as:
>
> (info.index != VFIO_PCI_MSIX_IRQ_INDEX || !vdev->has_dyn_msix)
>

Yes, it is. Will fix. Thank you very much.

Reinette

2023-04-19 21:44:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X

On Wed, 19 Apr 2023 11:13:29 -0700
Reinette Chatre <[email protected]> wrote:

> Hi Alex,
>
> On 4/18/2023 3:38 PM, Alex Williamson wrote:
> > On Tue, 18 Apr 2023 10:29:20 -0700
> > Reinette Chatre <[email protected]> wrote:
> >
> >> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
> >> enables an individual MSI-X interrupt to be allocated and freed after
> >> MSI-X enabling.
> >>
> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> >> the time a valid eventfd is assigned. This is different behavior from
> >> a range provided during MSI-X enabling where interrupts are allocated
> >> for the entire range whether a valid eventfd is provided for each
> >> interrupt or not.
> >>
> >> Do not dynamically free interrupts, leave that to when MSI-X is
> >> disabled.
> >
> > But we do, sometimes, even if essentially only on the error path. Is
> > that worthwhile? It seems like we could entirely remove
> > vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
> > teardown.
>
> Yes, it is only on the error path where dynamic MSI-X interrupts are
> removed. I do not know how to determine if it is worthwhile. On the
> kernel side failure seems unlikely since it would mean memory cannot
> be allocated or request_irq() failed. In these cases it may not be
> worthwhile since user space may try again and having the interrupt
> already allocated would be helpful. The remaining error seems to be
> if user space provided an invalid eventfd. An allocation in response
> to wrong user input is a concern to me. Should we consider
> buggy/malicious users? I am uncertain here so would defer to your
> guidance.

I don't really see that a malicious user can exploit anything here,
their irq allocation is bound by the device support and they're
entitled to make use of the full vector set of the device by virtue of
having ownership of the device. All the MSI-X allocated irqs are freed
when the interrupt mode is changed or the device is released regardless.

The end result is also no different than if the user had not configured
the vector when enabling MSI-X or configured it and later de-configured
with a -1 eventfd. The irq is allocated but not attached to a ctx.
We're intentionally using this as a cache.

Also, as implemented here in v3, we might be freeing from the original
allocation rather than a new, dynamically allocated irq.

My thinking is that if we think there's a benefit to caching any
allocated irqs, we should do so consistently. We don't currently know
if the irq was allocated now or previously. Tracking that would add
complexity for little benefit. The user can get to the same end result
of an allocated, unused irq in numerous way, the state itself is not
erroneous, and is actually in support of caching irq allocations.
Removing the de-allocation of a single vector further simplifies the
code as there exists only one path where irqs are freed, ie.
pci_free_irq_vectors().

So I'd lean towards removing vfio_msi_free_irq().

> > I'd probably also add a comment in the commit log about the theory
> > behind not dynamically freeing irqs, ie. latency, reliability, and
> > whatever else we used to justify it. Thanks,
>
> Sure. How about something like below to replace the final sentence of
> the changelog:
>
> "When a guest disables an interrupt, user space (Qemu) does not
> disable the interrupt but instead assigns it a different trigger. A
> common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be
> used to assign a new eventfd to an already enabled interrupt. Freeing
> and re-allocating an interrupt in this scenario would add unnecessary
> latency as well as uncertainty since the re-allocation may fail. Do
> not dynamically free interrupts when an interrupt is disabled, instead
> support a subsequent re-enable to draw from the initial allocation when
> possible. Leave freeing of interrupts to when MSI-X is disabled."

There are other means besides caching irqs that could achieve the above,
for instance if a trigger is simply swapped from one eventfd to another,
that all happens within vfio_msi_set_vector_signal() where we could
hold the irq for the transition.

I think I might justify it as:

The PCI-MSIX API requires that some number of irqs are
allocated for an initial set of vectors when enabling MSI-X on
the device. When dynamic MSIX allocation is not supported, the
vector table, and thus the allocated irq set can only be resized
by disabling and re-enabling MSIX with a different range. In
that case the irq allocation is essentially a cache for
configuring vectors within the previously allocated vector
range. When dynamic MSIX allocation is supported, the API
still requires some initial set of irqs to be allocated, but
also supports allocating and freeing specific irq vectors both
within and beyond the initially allocated range.

For consistency between modes, as well as to reduce latency and
improve reliability of allocations, and also simplicity, this
implementation only releases irqs via pci_free_irq_vectors()
when either the interrupt mode changes or the device is
released.

Does that cover the key points for someone that might want to revisit
this decision later? Thanks,

Alex

2023-04-19 22:18:16

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X

Hi Alex,

On 4/19/2023 2:38 PM, Alex Williamson wrote:
> On Wed, 19 Apr 2023 11:13:29 -0700
> Reinette Chatre <[email protected]> wrote:
>> On 4/18/2023 3:38 PM, Alex Williamson wrote:
>>> On Tue, 18 Apr 2023 10:29:20 -0700
>>> Reinette Chatre <[email protected]> wrote:
>>>
>>>> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
>>>> enables an individual MSI-X interrupt to be allocated and freed after
>>>> MSI-X enabling.
>>>>
>>>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
>>>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
>>>> the time a valid eventfd is assigned. This is different behavior from
>>>> a range provided during MSI-X enabling where interrupts are allocated
>>>> for the entire range whether a valid eventfd is provided for each
>>>> interrupt or not.
>>>>
>>>> Do not dynamically free interrupts, leave that to when MSI-X is
>>>> disabled.
>>>
>>> But we do, sometimes, even if essentially only on the error path. Is
>>> that worthwhile? It seems like we could entirely remove
>>> vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
>>> teardown.
>>
>> Yes, it is only on the error path where dynamic MSI-X interrupts are
>> removed. I do not know how to determine if it is worthwhile. On the
>> kernel side failure seems unlikely since it would mean memory cannot
>> be allocated or request_irq() failed. In these cases it may not be
>> worthwhile since user space may try again and having the interrupt
>> already allocated would be helpful. The remaining error seems to be
>> if user space provided an invalid eventfd. An allocation in response
>> to wrong user input is a concern to me. Should we consider
>> buggy/malicious users? I am uncertain here so would defer to your
>> guidance.
>
> I don't really see that a malicious user can exploit anything here,
> their irq allocation is bound by the device support and they're
> entitled to make use of the full vector set of the device by virtue of
> having ownership of the device. All the MSI-X allocated irqs are freed
> when the interrupt mode is changed or the device is released regardless.
>
> The end result is also no different than if the user had not configured
> the vector when enabling MSI-X or configured it and later de-configured
> with a -1 eventfd. The irq is allocated but not attached to a ctx.
> We're intentionally using this as a cache.
>
> Also, as implemented here in v3, we might be freeing from the original
> allocation rather than a new, dynamically allocated irq.

Great point.

>
> My thinking is that if we think there's a benefit to caching any
> allocated irqs, we should do so consistently. We don't currently know
> if the irq was allocated now or previously. Tracking that would add
> complexity for little benefit. The user can get to the same end result
> of an allocated, unused irq in numerous way, the state itself is not
> erroneous, and is actually in support of caching irq allocations.
> Removing the de-allocation of a single vector further simplifies the
> code as there exists only one path where irqs are freed, ie.
> pci_free_irq_vectors().
>
> So I'd lean towards removing vfio_msi_free_irq().

Thank you for your detailed analysis. I understand and agree.
I will remove vfio_msi_free_irq().

>
>>> I'd probably also add a comment in the commit log about the theory
>>> behind not dynamically freeing irqs, ie. latency, reliability, and
>>> whatever else we used to justify it. Thanks,
>>
>> Sure. How about something like below to replace the final sentence of
>> the changelog:
>>
>> "When a guest disables an interrupt, user space (Qemu) does not
>> disable the interrupt but instead assigns it a different trigger. A
>> common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be
>> used to assign a new eventfd to an already enabled interrupt. Freeing
>> and re-allocating an interrupt in this scenario would add unnecessary
>> latency as well as uncertainty since the re-allocation may fail. Do
>> not dynamically free interrupts when an interrupt is disabled, instead
>> support a subsequent re-enable to draw from the initial allocation when
>> possible. Leave freeing of interrupts to when MSI-X is disabled."
>
> There are other means besides caching irqs that could achieve the above,
> for instance if a trigger is simply swapped from one eventfd to another,
> that all happens within vfio_msi_set_vector_signal() where we could
> hold the irq for the transition.
>
> I think I might justify it as:
>
> The PCI-MSIX API requires that some number of irqs are
> allocated for an initial set of vectors when enabling MSI-X on
> the device. When dynamic MSIX allocation is not supported, the
> vector table, and thus the allocated irq set can only be resized
> by disabling and re-enabling MSIX with a different range. In
> that case the irq allocation is essentially a cache for
> configuring vectors within the previously allocated vector
> range. When dynamic MSIX allocation is supported, the API
> still requires some initial set of irqs to be allocated, but
> also supports allocating and freeing specific irq vectors both
> within and beyond the initially allocated range.
>
> For consistency between modes, as well as to reduce latency and
> improve reliability of allocations, and also simplicity, this
> implementation only releases irqs via pci_free_irq_vectors()
> when either the interrupt mode changes or the device is
> released.
>
> Does that cover the key points for someone that might want to revisit
> this decision later? Thanks,

It does so clearly, yes. Thank you so much for taking the time to
write this. I will include it in the changelog.

Reinette


2023-04-24 17:43:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

On Wed, Apr 19, 2023 at 11:11:48AM -0700, Reinette Chatre wrote:
> Hi Alex,
>
> On 4/18/2023 3:38 PM, Alex Williamson wrote:
> > On Tue, 18 Apr 2023 10:29:19 -0700
> > Reinette Chatre <[email protected]> wrote:
> >
>
> ...
>
> >> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> >> index 148fd1ae6c1c..4f070f2d6fde 100644
> >> --- a/include/linux/vfio_pci_core.h
> >> +++ b/include/linux/vfio_pci_core.h
> >> @@ -67,6 +67,7 @@ struct vfio_pci_core_device {
> >> u8 msix_bar;
> >> u16 msix_size;
> >> u32 msix_offset;
> >> + bool has_dyn_msix;
> >> u32 rbar[7];
> >> bool pci_2_3;
> >> bool virq_disabled;
> >
> > Nit, the whole data structure probably needs to be sorted with pahole,
> > but creating a hole here for locality to other msix fields should
> > probably be secondary to keeping the structure well packed, which
> > suggests including this new field among the bools below. Thanks,
>
> Thanks for catching this. Moving it one field lower as shown in the
> delta patch below seems to improve the layout:
>
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 4f070f2d6fde..d730d78754a2 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -67,8 +67,8 @@ struct vfio_pci_core_device {
> u8 msix_bar;
> u16 msix_size;
> u32 msix_offset;
> - bool has_dyn_msix;
> u32 rbar[7];
> + bool has_dyn_msix;
> bool pci_2_3;
> bool virq_disabled;
> bool reset_works;

Also, Linus on record as strongly disliking these lists of bools

If they don't need read_once/etc stuff then use a list of bitfields

bool abc:1;
bool xyz:1;

Jason

2023-04-24 23:53:49

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

Hi Jason,

On 4/24/2023 10:43 AM, Jason Gunthorpe wrote:
> On Wed, Apr 19, 2023 at 11:11:48AM -0700, Reinette Chatre wrote:
>> On 4/18/2023 3:38 PM, Alex Williamson wrote:
>>> On Tue, 18 Apr 2023 10:29:19 -0700
>>> Reinette Chatre <[email protected]> wrote:

...

>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index 4f070f2d6fde..d730d78754a2 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -67,8 +67,8 @@ struct vfio_pci_core_device {
>> u8 msix_bar;
>> u16 msix_size;
>> u32 msix_offset;
>> - bool has_dyn_msix;
>> u32 rbar[7];
>> + bool has_dyn_msix;
>> bool pci_2_3;
>> bool virq_disabled;
>> bool reset_works;
>
> Also, Linus on record as strongly disliking these lists of bools

This looks like an example:
https://lkml.org/lkml/2017/11/21/384

>
> If they don't need read_once/etc stuff then use a list of bitfields

I do not see any direct usage of read_once in the driver, but it is not
clear to me what falls under the "etc" umbrella. Do you consider all
the bools in struct vfio_pci_core_device to be candidates for
transition?

>
> bool abc:1;
> bool xyz:1;
>

I think a base type of unsigned int since it appears to be the custom
and (if I understand correctly) was preferred at the time Linus wrote
the message I found.

Looking ahead there seems be be a bigger task here. A quick search
revealed a few other instances of vfio using "bool" in a struct. It
does not all qualify for your "lists of bools" comment, but they
may need a closer look because of the "please don't use "bool" in
structures at all" comment made by Linus in the email I found.

vfio_device::iommufd_attached
vfio_container::noiommu
vfio_platform_irq::masked
vfio_platform_device::reset_required
vfio_iommu::v2
vfio_iommu::nesting
vfio_iommu::dirty_page_tracking
vfio_dma::iommu_mapped
vfio_dma::lock_cap
vfio_dma::vaddr_invalid
vfio_iommu_group::pinned_page_dirty_scope
tce_container::enabled
tce_container::v2
tce_container::def_window_pending

Reinette

2023-04-25 15:03:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

On Mon, Apr 24, 2023 at 04:52:08PM -0700, Reinette Chatre wrote:
> Hi Jason,
>
> On 4/24/2023 10:43 AM, Jason Gunthorpe wrote:
> > On Wed, Apr 19, 2023 at 11:11:48AM -0700, Reinette Chatre wrote:
> >> On 4/18/2023 3:38 PM, Alex Williamson wrote:
> >>> On Tue, 18 Apr 2023 10:29:19 -0700
> >>> Reinette Chatre <[email protected]> wrote:
>
> ...
>
> >> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> >> index 4f070f2d6fde..d730d78754a2 100644
> >> --- a/include/linux/vfio_pci_core.h
> >> +++ b/include/linux/vfio_pci_core.h
> >> @@ -67,8 +67,8 @@ struct vfio_pci_core_device {
> >> u8 msix_bar;
> >> u16 msix_size;
> >> u32 msix_offset;
> >> - bool has_dyn_msix;
> >> u32 rbar[7];
> >> + bool has_dyn_msix;
> >> bool pci_2_3;
> >> bool virq_disabled;
> >> bool reset_works;
> >
> > Also, Linus on record as strongly disliking these lists of bools
>
> This looks like an example:
> https://lkml.org/lkml/2017/11/21/384
>
> >
> > If they don't need read_once/etc stuff then use a list of bitfields
>
> I do not see any direct usage of read_once in the driver, but it is not
> clear to me what falls under the "etc" umbrella.

Anything that might assume atomicity, smp_store_release, set_bit, and others

> Do you consider all the bools in struct vfio_pci_core_device to be
> candidates for transition?

Yes, group them ito into a bitfield.

> I think a base type of unsigned int since it appears to be the custom
> and (if I understand correctly) was preferred at the time Linus wrote
> the message I found.

It doesn't matter a lot, using "bool" means the compiler adds extra
code to ensure "foo = 4" stores true, and the underyling size is not
well defined (but we don't care here)

> Looking ahead there seems be be a bigger task here. A quick search
> revealed a few other instances of vfio using "bool" in a struct. It
> does not all qualify for your "lists of bools" comment, but they
> may need a closer look because of the "please don't use "bool" in
> structures at all" comment made by Linus in the email I found.

IMHO bool is helpful for clarity, it says it is a flag. In these cases
we won't gain anything by using u8 instead

Lists of bools however start to get a little silly when we use maybe 4
bytes per bool (though x86-64 is using 1 byte in structs)

Jason

2023-04-25 16:57:35

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V3 08/10] vfio/pci: Probe and store ability to support dynamic MSI-X

Hi Jason,

On 4/25/2023 7:51 AM, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 04:52:08PM -0700, Reinette Chatre wrote:
>> On 4/24/2023 10:43 AM, Jason Gunthorpe wrote:
>>> On Wed, Apr 19, 2023 at 11:11:48AM -0700, Reinette Chatre wrote:
>>>> On 4/18/2023 3:38 PM, Alex Williamson wrote:
>>>>> On Tue, 18 Apr 2023 10:29:19 -0700
>>>>> Reinette Chatre <[email protected]> wrote:
>>
>> ...
>>
>>>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>>>> index 4f070f2d6fde..d730d78754a2 100644
>>>> --- a/include/linux/vfio_pci_core.h
>>>> +++ b/include/linux/vfio_pci_core.h
>>>> @@ -67,8 +67,8 @@ struct vfio_pci_core_device {
>>>> u8 msix_bar;
>>>> u16 msix_size;
>>>> u32 msix_offset;
>>>> - bool has_dyn_msix;
>>>> u32 rbar[7];
>>>> + bool has_dyn_msix;
>>>> bool pci_2_3;
>>>> bool virq_disabled;
>>>> bool reset_works;
>>>
>>> Also, Linus on record as strongly disliking these lists of bools
>>
>> This looks like an example:
>> https://lkml.org/lkml/2017/11/21/384
>>
>>>
>>> If they don't need read_once/etc stuff then use a list of bitfields
>>
>> I do not see any direct usage of read_once in the driver, but it is not
>> clear to me what falls under the "etc" umbrella.
>
> Anything that might assume atomicity, smp_store_release, set_bit, and others
>
>> Do you consider all the bools in struct vfio_pci_core_device to be
>> candidates for transition?
>
> Yes, group them ito into a bitfield.

Will do.

>
>> I think a base type of unsigned int since it appears to be the custom
>> and (if I understand correctly) was preferred at the time Linus wrote
>> the message I found.
>
> It doesn't matter a lot, using "bool" means the compiler adds extra
> code to ensure "foo = 4" stores true, and the underyling size is not
> well defined (but we don't care here)

Looking further outside that email thread I do see using base type of bool
is common. I will use that. Doing so also reduces the churn of this
transition since only the data structure changes, not the code.
>> Looking ahead there seems be be a bigger task here. A quick search
>> revealed a few other instances of vfio using "bool" in a struct. It
>> does not all qualify for your "lists of bools" comment, but they
>> may need a closer look because of the "please don't use "bool" in
>> structures at all" comment made by Linus in the email I found.
>
> IMHO bool is helpful for clarity, it says it is a flag. In these cases
> we won't gain anything by using u8 instead
>
> Lists of bools however start to get a little silly when we use maybe 4
> bytes per bool (though x86-64 is using 1 byte in structs)
>

Thank you very much for catching this and providing guidance. I plan to
include this change to struct vfio_pci_core_device as a prep
patch within this series.

Reinette