2023-09-29 23:55:55

by Jakub Sitnicki

[permalink] [raw]
Subject: [PATCH 0/2] Support multiple interrupts for virtio over MMIO devices

# Intro

This patch set enables virtio-mmio devices to use multiple interrupts.

The elevator pitch would be:

"""
To keep the complexity down to a minimum, but at the same time get to the
same performance level as virtio-pci devices, we:

1) keep using the legacy interrupts, and
2) have a predefined, device type specific, mapping of IRQs to virtqueues,
3) rely on vhost offload for both data and notifications (irqfd/ioeventfd).
"""

As this is an RFC, we aim to (i) present our use-case, and (ii) get an idea
if we are going in the right direction.

Otherwise we have kept changes down to a working minimum, where we can
already demonstrate the performance benefits.

At this point, we did not:
- draft any change proposals to VIRTIO spec, or
- added support for other virtio-mmio driver "configuration backends" than
the kernel command line (that is ACPI and DT).

# Motivation

This work aims to enable lightweight VMs (like QEMU microvm, Firecracker,
Cloud Hypervisor), which rely on virtio MMIO transport, to utilize
multi-queue virtio NIC to their full potential when multiple vCPUs are
available.

Currently with MMIO transport, it is not possible to process vNIC queue
events in parallel because there is just one interrupt perf virtio-mmio
device, and hence one CPU handling processing the virtqueue events.

We are looking to change that, so that the vNIC performance (measured in
pps) scales together with the number of vNIC queues, and allocated vCPUs.

Our goal is to reach the same pps level as virtio-pci vNIC delivers today.

# Prior Work

So far we have seen two attempts making virtio-mmio devices use multiple
IRQs. First in 2014 [1], then in 2020 [2]. At least that is all we could
find.

Gathering from discussions and review feedback, the pitfalls in the
previous submissions were:

1. lack of proof that there are performance benefits (see [1]),
2. code complexity (see [2]),
3. no reference VMM (QEMU) implementation ([1] and [2]).

We try not to repeat these mistakes.

[1] https://lore.kernel.org/r/[email protected]/
[2] https://lore.kernel.org/r/[email protected]/

# Benchmark Setup and Results

Traffic flow:

host -> guest (reflect in XDP native) -> host

host-guest-host with XDP program reflecting UDP packets is just one of
production use-cases we are interested in.

Another one is a typical host-to-guest scenario, where UDP flows are
terminated in the guest. The latter, however, takes more work to benchmark
because it requires manual sender throttling to avoid very high loses on
receiver.

Setup details:

- guest:
- Linux v6.5 + this patchset
- 8 vCPUs
- 16 vNIC queues (8 in use + 8 for lockless XDP TX)
- host
- VMM - QEMU v8.1.0 + PoC changes (see below)
- vhost offload enabled
- iperf3 v3.12 used as sender and receiver
- traffic pattern
- 8 uni-directional, small-packet UDP flows
- flow steering - one flow per vNIC RX queue
- CPU affinity
- iperf clients, iperfs servers, KVM vCPU threads, vhost threads pinned to
their own logical CPUs
- all used logical CPUs on the same NUMA node

Recorded receiver pps:

virtio-pci virtio-mmio virtio-mmio
8+8+1 IRQs 8 IRQs 1 IRQ

rx pps (mean ± rsd): 217,743 ± 2.4% 221,741 ± 2.7% 48,910 ± 0.03%
pkt loss (min … max): 1.8% … 2.3% 2.9% … 3.6% 82.1% … 89.3%

rx pps is the average over 8 receivers, each receiving one UDP flow.
pkt loss is not aggregated. Loss for each UDP flow is within the range.

If anyone would like to reproduce these results, we would be happy to share
detailed setup steps and tooling (scripts).

# PoC QEMU changes

QEMU is the only known to us VMM where we can compare the performance of
both virtio PCI and MMIO transport with a multi-queue virtio NIC and vhost
offload.

Hence, accompanying this patches, we also have a rather raw, and not yet
review ready, QEMU code changes that we used to test and benchmark
virtio-mmio device with multiple IRQs.

The tag with changes is available at:

https://github.com/jsitnicki/qemu/commits/virtio-mmio-multi-irq-rfc1

# Open Questions

- Do we need a feature flag, for example VIRTIO_F_MULTI_IRQ, for the guest to
inform the VMM that it understands the feature?

Or can we assume that VMM assigns multiple IRQs to virtio-mmio device only if
guest is compatible?

Looking forward to your feedback.

Jakub Sitnicki (2):
virtio-mmio: Parse a range of IRQ numbers passed on the command line
virtio-mmio: Support multiple interrupts per device

drivers/virtio/virtio_mmio.c | 179 ++++++++++++++++++++++++++++++++-----------
1 file changed, 135 insertions(+), 44 deletions(-)


2023-09-30 00:22:19

by Jakub Sitnicki

[permalink] [raw]
Subject: [PATCH 2/2] virtio-mmio: Support multiple interrupts per device

Some virtual devices, such as the virtio network device, can use multiple
virtqueues (or multiple pairs of virtqueues in the case of a vNIC). In such
case, when there are multiple vCPUs present, it is possible to process
virtqueue events in parallel. Each vCPU can service a subset of all
virtqueues when notified that there is work to carry out.

However, the current virtio-mmio transport implementation poses a
limitation. Only one vCPU can service notifications from any of the
virtqueues of a single virtio device. This is because a virtio-mmio device
driver supports registering just one interrupt per device. With such setup
we are not able to scale virtqueue event processing among vCPUs.

Now, with more than one IRQ resource registered for a virtio-mmio platform
device, we can address this limitation.

First, we request multiple IRQs when creating virtqueues for a device.

Then, map each virtqueue to one of the IRQs assigned to the device. The
mapping is done in a device type specific manner. For instance, a network
device will want each RX/TX virtqueue pair mapped to a different IRQ
line. Other device types might require a different mapping scheme. We
currently provide a mapping for virtio-net device type.

Finally, when handling an interrupt, we service only the virtqueues
associated with the IRQ line that triggered the event.

Signed-off-by: Jakub Sitnicki <[email protected]>
---
drivers/virtio/virtio_mmio.c | 102 +++++++++++++++++++++++++++++++++++--------
1 file changed, 83 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 06a587b23542..180c51c27704 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -69,6 +69,7 @@
#include <linux/spinlock.h>
#include <linux/virtio.h>
#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_ids.h>
#include <uapi/linux/virtio_mmio.h>
#include <linux/virtio_ring.h>

@@ -93,6 +94,10 @@ struct virtio_mmio_device {
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
struct list_head virtqueues;
+
+ /* IRQ range allocated to the device */
+ unsigned int irq_base;
+ unsigned int num_irqs;
};

struct virtio_mmio_vq_info {
@@ -101,6 +106,9 @@ struct virtio_mmio_vq_info {

/* the list node for the virtqueues list */
struct list_head node;
+
+ /* IRQ mapped to virtqueue */
+ unsigned int irq;
};


@@ -297,7 +305,7 @@ static bool vm_notify_with_data(struct virtqueue *vq)
return true;
}

-/* Notify all virtqueues on an interrupt. */
+/* Notify all or some virtqueues on an interrupt. */
static irqreturn_t vm_interrupt(int irq, void *opaque)
{
struct virtio_mmio_device *vm_dev = opaque;
@@ -308,20 +316,31 @@ static irqreturn_t vm_interrupt(int irq, void *opaque)

/* Read and acknowledge interrupts */
status = readl(vm_dev->base + VIRTIO_MMIO_INTERRUPT_STATUS);
- writel(status, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);

if (unlikely(status & VIRTIO_MMIO_INT_CONFIG)) {
+ writel(status & VIRTIO_MMIO_INT_CONFIG, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);
virtio_config_changed(&vm_dev->vdev);
ret = IRQ_HANDLED;
}

if (likely(status & VIRTIO_MMIO_INT_VRING)) {
+ writel(status & VIRTIO_MMIO_INT_VRING, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);
spin_lock_irqsave(&vm_dev->lock, flags);
list_for_each_entry(info, &vm_dev->virtqueues, node)
ret |= vring_interrupt(irq, info->vq);
spin_unlock_irqrestore(&vm_dev->lock, flags);
}

+ /* Notify only affected vrings if device uses multiple interrupts */
+ if (vm_dev->num_irqs > 1) {
+ spin_lock_irqsave(&vm_dev->lock, flags);
+ list_for_each_entry(info, &vm_dev->virtqueues, node) {
+ if (info->irq == irq)
+ ret |= vring_interrupt(irq, info->vq);
+ }
+ spin_unlock_irqrestore(&vm_dev->lock, flags);
+ }
+
return ret;
}

@@ -356,11 +375,15 @@ static void vm_del_vqs(struct virtio_device *vdev)
{
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
struct virtqueue *vq, *n;
+ int i, irq;
+
+ for (i = 0; i < vm_dev->num_irqs; i++) {
+ irq = vm_dev->irq_base + i;
+ devm_free_irq(&vdev->dev, irq, vm_dev);
+ }

list_for_each_entry_safe(vq, n, &vdev->vqs, list)
vm_del_vq(vq);
-
- free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev);
}

static void vm_synchronize_cbs(struct virtio_device *vdev)
@@ -488,6 +511,18 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int in
return ERR_PTR(err);
}

+/* Map virtqueue to zero-based interrupt number */
+static unsigned int vq2irq(const struct virtqueue *vq)
+{
+ switch (vq->vdev->id.device) {
+ case VIRTIO_ID_NET:
+ /* interrupt shared by rx/tx virtqueue pair */
+ return vq->index / 2;
+ default:
+ return 0;
+ }
+}
+
static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
struct virtqueue *vqs[],
vq_callback_t *callbacks[],
@@ -496,19 +531,9 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
struct irq_affinity *desc)
{
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
- int irq = platform_get_irq(vm_dev->pdev, 0);
- int i, err, queue_idx = 0;
-
- if (irq < 0)
- return irq;
-
- err = request_irq(irq, vm_interrupt, IRQF_SHARED,
- dev_name(&vdev->dev), vm_dev);
- if (err)
- return err;
-
- if (of_property_read_bool(vm_dev->pdev->dev.of_node, "wakeup-source"))
- enable_irq_wake(irq);
+ struct virtio_mmio_vq_info *info;
+ int i, err, irq, nirqs, queue_idx = 0;
+ unsigned int irq_base = UINT_MAX;

for (i = 0; i < nvqs; ++i) {
if (!names[i]) {
@@ -519,12 +544,51 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
vqs[i] = vm_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
ctx ? ctx[i] : false);
if (IS_ERR(vqs[i])) {
- vm_del_vqs(vdev);
- return PTR_ERR(vqs[i]);
+ err = PTR_ERR(vqs[i]);
+ goto fail_vq;
}
}

+ nirqs = platform_irq_count(vm_dev->pdev);
+ if (nirqs < 0) {
+ err = nirqs;
+ goto fail_vq;
+ }
+
+ for (i = 0; i < nirqs; i++) {
+ irq = platform_get_irq(vm_dev->pdev, i);
+ if (irq < 0)
+ goto fail_irq;
+ if (irq < irq_base)
+ irq_base = irq;
+
+ err = devm_request_irq(&vdev->dev, irq, vm_interrupt,
+ IRQF_SHARED, NULL, vm_dev);
+ if (err)
+ goto fail_irq;
+
+ if (of_property_read_bool(vm_dev->pdev->dev.of_node, "wakeup-source"))
+ enable_irq_wake(irq);
+ }
+
+ for (i = 0; i < nvqs; i++) {
+ irq = vq2irq(vqs[i]);
+ info = vqs[i]->priv;
+ info->irq = irq_base + (irq % nirqs);
+ }
+
+ vm_dev->irq_base = irq_base;
+ vm_dev->num_irqs = nirqs;
+
return 0;
+
+fail_irq:
+ while (i--)
+ devm_free_irq(&vdev->dev, irq_base + i, vm_dev);
+fail_vq:
+ vm_del_vqs(vdev);
+
+ return err;
}

static const char *vm_bus_name(struct virtio_device *vdev)

--
2.41.0

2023-09-30 02:07:39

by Jakub Sitnicki

[permalink] [raw]
Subject: [PATCH 1/2] virtio-mmio: Parse a range of IRQ numbers passed on the command line

virtio-mmio devices in theory can use multiple interrupts. However, the
existing virtio-mmio command line configuration format does not provide a
way to express it.

Extend the configuration format so that the user can specify an IRQ
range. This prepares ground for registering multiple IRQs per device.

Because we need to stay backward compatible with VMMs which don't yet use
the new format, fall back to the old one if we fail to find an IRQ range in
the passed parameters.

Signed-off-by: Jakub Sitnicki <[email protected]>
---
drivers/virtio/virtio_mmio.c | 77 ++++++++++++++++++++++++++++++--------------
1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 97760f611295..06a587b23542 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -39,11 +39,12 @@
* 3. Kernel module (or command line) parameter. Can be used more than once -
* one device will be created for each one. Syntax:
*
- * [virtio_mmio.]device=<size>@<baseaddr>:<irq>[:<id>]
+ * [virtio_mmio.]device=<size>@<baseaddr>:<irq>[-<irq last>][:<id>]
* where:
* <size> := size (can use standard suffixes like K, M or G)
* <baseaddr> := physical base address
- * <irq> := interrupt number (as passed to request_irq())
+ * <irq> := first interrupt number (as passed to request_irq())
+ * <irq last> := (optional) last interrupt number
* <id> := (optional) platform device id
* eg.:
* virtio_mmio.device=0x100@0x100b0000:48 \
@@ -712,36 +713,38 @@ static int vm_cmdline_set(const char *device,
const struct kernel_param *kp)
{
int err;
- struct resource resources[2] = {};
char *str;
long long base, size;
- unsigned int irq;
+ unsigned int irq, irq_last;
int processed, consumed = 0;
struct platform_device *pdev;
+ struct resource *resources, *res;
+ size_t num_resources;

/* Consume "size" part of the command line parameter */
size = memparse(device, &str);

- /* Get "@<base>:<irq>[:<id>]" chunks */
- processed = sscanf(str, "@%lli:%u%n:%d%n",
- &base, &irq, &consumed,
- &vm_cmdline_id, &consumed);
+ /* Get "@<base>:<irq>-<irq last>[:<id>]" chunks */
+ processed = sscanf(str, "@%lli:%u-%u%n:%d%n",
+ &base, &irq, &irq_last, &consumed,
+ &vm_cmdline_id, &consumed);
+ /* Fall back to "@<base>:<irq>[:<id>]" */
+ if (processed < 3) {
+ processed = sscanf(str, "@%lli:%u%n:%d%n",
+ &base, &irq, &consumed,
+ &vm_cmdline_id, &consumed);
+ irq_last = irq;
+ }

/*
* sscanf() must process at least 2 chunks; also there
* must be no extra characters after the last chunk, so
* str[consumed] must be '\0'
*/
- if (processed < 2 || str[consumed] || irq == 0)
+ if (processed < 2 || str[consumed] ||
+ irq == 0 || irq_last == 0 || irq > irq_last)
return -EINVAL;

- resources[0].flags = IORESOURCE_MEM;
- resources[0].start = base;
- resources[0].end = base + size - 1;
-
- resources[1].flags = IORESOURCE_IRQ;
- resources[1].start = resources[1].end = irq;
-
if (!vm_cmdline_parent_registered) {
err = device_register(&vm_cmdline_parent);
if (err) {
@@ -752,16 +755,34 @@ static int vm_cmdline_set(const char *device,
vm_cmdline_parent_registered = 1;
}

- pr_info("Registering device virtio-mmio.%d at 0x%llx-0x%llx, IRQ %d.\n",
- vm_cmdline_id,
- (unsigned long long)resources[0].start,
- (unsigned long long)resources[0].end,
- (int)resources[1].start);
+ num_resources = 1; /* mem */
+ num_resources += irq_last - irq + 1;
+ resources = devm_kcalloc(&vm_cmdline_parent, num_resources,
+ sizeof(*resources), GFP_KERNEL);
+ if (!resources)
+ return -ENOMEM;
+
+ resources[0].flags = IORESOURCE_MEM;
+ resources[0].start = base;
+ resources[0].end = base + size - 1;
+
+ for (res = &resources[1]; irq <= irq_last; irq++, res++) {
+ res->flags = IORESOURCE_IRQ;
+ res->start = irq;
+ res->end = irq;
+ }
+
+ pr_info("Registering device virtio-mmio.%d at 0x%llx-0x%llx, IRQs %u-%u.\n",
+ vm_cmdline_id,
+ (unsigned long long)resources[0].start,
+ (unsigned long long)resources[0].end,
+ irq, irq_last);

pdev = platform_device_register_resndata(&vm_cmdline_parent,
"virtio-mmio", vm_cmdline_id++,
- resources, ARRAY_SIZE(resources), NULL, 0);
+ resources, num_resources, NULL, 0);

+ devm_kfree(&vm_cmdline_parent, resources);
return PTR_ERR_OR_ZERO(pdev);
}

@@ -770,12 +791,18 @@ static int vm_cmdline_get_device(struct device *dev, void *data)
char *buffer = data;
unsigned int len = strlen(buffer);
struct platform_device *pdev = to_platform_device(dev);
+ size_t nres = pdev->num_resources;
+ bool irq_range = nres > 2;

- snprintf(buffer + len, PAGE_SIZE - len, "0x%llx@0x%llx:%llu:%d\n",
+ len += snprintf(buffer + len, PAGE_SIZE - len, "0x%llx@0x%llx:%llu",
pdev->resource[0].end - pdev->resource[0].start + 1ULL,
(unsigned long long)pdev->resource[0].start,
- (unsigned long long)pdev->resource[1].start,
- pdev->id);
+ (unsigned long long)pdev->resource[1].start);
+ if (irq_range)
+ len += snprintf(buffer + len, PAGE_SIZE - len, "-%llu",
+ (unsigned long long)pdev->resource[nres - 1].start);
+ snprintf(buffer + len, PAGE_SIZE - len, ":%d\n", pdev->id);
+
return 0;
}


--
2.41.0

2023-09-30 17:16:56

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: [PATCH 0/2] Support multiple interrupts for virtio over MMIO devices

Subject prefix should have been [RFC]. My mistake.

2023-10-10 06:53:40

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio-mmio: Support multiple interrupts per device

On Sat, Sep 30, 2023 at 4:46 AM Jakub Sitnicki <[email protected]> wrote:
>
> Some virtual devices, such as the virtio network device, can use multiple
> virtqueues (or multiple pairs of virtqueues in the case of a vNIC). In such
> case, when there are multiple vCPUs present, it is possible to process
> virtqueue events in parallel. Each vCPU can service a subset of all
> virtqueues when notified that there is work to carry out.
>
> However, the current virtio-mmio transport implementation poses a
> limitation. Only one vCPU can service notifications from any of the
> virtqueues of a single virtio device. This is because a virtio-mmio device
> driver supports registering just one interrupt per device. With such setup
> we are not able to scale virtqueue event processing among vCPUs.
>
> Now, with more than one IRQ resource registered for a virtio-mmio platform
> device, we can address this limitation.
>
> First, we request multiple IRQs when creating virtqueues for a device.
>
> Then, map each virtqueue to one of the IRQs assigned to the device. The
> mapping is done in a device type specific manner. For instance, a network
> device will want each RX/TX virtqueue pair mapped to a different IRQ
> line. Other device types might require a different mapping scheme. We
> currently provide a mapping for virtio-net device type.
>
> Finally, when handling an interrupt, we service only the virtqueues
> associated with the IRQ line that triggered the event.
>
> Signed-off-by: Jakub Sitnicki <[email protected]>
> ---
> drivers/virtio/virtio_mmio.c | 102 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 83 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index 06a587b23542..180c51c27704 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -69,6 +69,7 @@
> #include <linux/spinlock.h>
> #include <linux/virtio.h>
> #include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_ids.h>
> #include <uapi/linux/virtio_mmio.h>
> #include <linux/virtio_ring.h>
>
> @@ -93,6 +94,10 @@ struct virtio_mmio_device {
> /* a list of queues so we can dispatch IRQs */
> spinlock_t lock;
> struct list_head virtqueues;
> +
> + /* IRQ range allocated to the device */
> + unsigned int irq_base;
> + unsigned int num_irqs;
> };
>
> struct virtio_mmio_vq_info {
> @@ -101,6 +106,9 @@ struct virtio_mmio_vq_info {
>
> /* the list node for the virtqueues list */
> struct list_head node;
> +
> + /* IRQ mapped to virtqueue */
> + unsigned int irq;
> };
>
>
> @@ -297,7 +305,7 @@ static bool vm_notify_with_data(struct virtqueue *vq)
> return true;
> }
>
> -/* Notify all virtqueues on an interrupt. */
> +/* Notify all or some virtqueues on an interrupt. */
> static irqreturn_t vm_interrupt(int irq, void *opaque)
> {
> struct virtio_mmio_device *vm_dev = opaque;
> @@ -308,20 +316,31 @@ static irqreturn_t vm_interrupt(int irq, void *opaque)
>
> /* Read and acknowledge interrupts */
> status = readl(vm_dev->base + VIRTIO_MMIO_INTERRUPT_STATUS);
> - writel(status, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);
>
> if (unlikely(status & VIRTIO_MMIO_INT_CONFIG)) {
> + writel(status & VIRTIO_MMIO_INT_CONFIG, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);
> virtio_config_changed(&vm_dev->vdev);
> ret = IRQ_HANDLED;
> }
>
> if (likely(status & VIRTIO_MMIO_INT_VRING)) {
> + writel(status & VIRTIO_MMIO_INT_VRING, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);
> spin_lock_irqsave(&vm_dev->lock, flags);
> list_for_each_entry(info, &vm_dev->virtqueues, node)
> ret |= vring_interrupt(irq, info->vq);
> spin_unlock_irqrestore(&vm_dev->lock, flags);
> }
>
> + /* Notify only affected vrings if device uses multiple interrupts */
> + if (vm_dev->num_irqs > 1) {
> + spin_lock_irqsave(&vm_dev->lock, flags);
> + list_for_each_entry(info, &vm_dev->virtqueues, node) {
> + if (info->irq == irq)
> + ret |= vring_interrupt(irq, info->vq);
> + }
> + spin_unlock_irqrestore(&vm_dev->lock, flags);
> + }
> +
> return ret;
> }
>
> @@ -356,11 +375,15 @@ static void vm_del_vqs(struct virtio_device *vdev)
> {
> struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> struct virtqueue *vq, *n;
> + int i, irq;
> +
> + for (i = 0; i < vm_dev->num_irqs; i++) {
> + irq = vm_dev->irq_base + i;
> + devm_free_irq(&vdev->dev, irq, vm_dev);
> + }
>
> list_for_each_entry_safe(vq, n, &vdev->vqs, list)
> vm_del_vq(vq);
> -
> - free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev);
> }
>
> static void vm_synchronize_cbs(struct virtio_device *vdev)
> @@ -488,6 +511,18 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int in
> return ERR_PTR(err);
> }
>
> +/* Map virtqueue to zero-based interrupt number */
> +static unsigned int vq2irq(const struct virtqueue *vq)
> +{
> + switch (vq->vdev->id.device) {
> + case VIRTIO_ID_NET:
> + /* interrupt shared by rx/tx virtqueue pair */
> + return vq->index / 2;
> + default:
> + return 0;
> + }

Transport drivers should have no knowledge of a specific type of device.

> +}
> +
> static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
> struct virtqueue *vqs[],
> vq_callback_t *callbacks[],
> @@ -496,19 +531,9 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
> struct irq_affinity *desc)
> {
> struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> - int irq = platform_get_irq(vm_dev->pdev, 0);
> - int i, err, queue_idx = 0;
> -
> - if (irq < 0)
> - return irq;
> -
> - err = request_irq(irq, vm_interrupt, IRQF_SHARED,
> - dev_name(&vdev->dev), vm_dev);
> - if (err)
> - return err;
> -
> - if (of_property_read_bool(vm_dev->pdev->dev.of_node, "wakeup-source"))
> - enable_irq_wake(irq);
> + struct virtio_mmio_vq_info *info;
> + int i, err, irq, nirqs, queue_idx = 0;
> + unsigned int irq_base = UINT_MAX;
>
> for (i = 0; i < nvqs; ++i) {
> if (!names[i]) {
> @@ -519,12 +544,51 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
> vqs[i] = vm_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
> ctx ? ctx[i] : false);
> if (IS_ERR(vqs[i])) {
> - vm_del_vqs(vdev);
> - return PTR_ERR(vqs[i]);
> + err = PTR_ERR(vqs[i]);
> + goto fail_vq;
> }
> }
>
> + nirqs = platform_irq_count(vm_dev->pdev);
> + if (nirqs < 0) {
> + err = nirqs;
> + goto fail_vq;
> + }
> +
> + for (i = 0; i < nirqs; i++) {
> + irq = platform_get_irq(vm_dev->pdev, i);
> + if (irq < 0)
> + goto fail_irq;
> + if (irq < irq_base)
> + irq_base = irq;
> +
> + err = devm_request_irq(&vdev->dev, irq, vm_interrupt,
> + IRQF_SHARED, NULL, vm_dev);
> + if (err)
> + goto fail_irq;
> +
> + if (of_property_read_bool(vm_dev->pdev->dev.of_node, "wakeup-source"))
> + enable_irq_wake(irq);

Could we simply use the same policy as PCI (vp_find_vqs_msix())?

Thanks

> + }
> +
> + for (i = 0; i < nvqs; i++) {
> + irq = vq2irq(vqs[i]);
> + info = vqs[i]->priv;
> + info->irq = irq_base + (irq % nirqs);
> + }
> +
> + vm_dev->irq_base = irq_base;
> + vm_dev->num_irqs = nirqs;
> +
> return 0;
> +
> +fail_irq:
> + while (i--)
> + devm_free_irq(&vdev->dev, irq_base + i, vm_dev);
> +fail_vq:
> + vm_del_vqs(vdev);
> +
> + return err;
> }
>
> static const char *vm_bus_name(struct virtio_device *vdev)
>
> --
> 2.41.0
>

2023-10-14 15:49:53

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio-mmio: Support multiple interrupts per device

Sorry for the delay in my response. I've been away at a conference.

On Tue, Oct 10, 2023 at 02:52 PM +08, Jason Wang wrote:
> On Sat, Sep 30, 2023 at 4:46 AM Jakub Sitnicki <[email protected]> wrote:
>>
>> Some virtual devices, such as the virtio network device, can use multiple
>> virtqueues (or multiple pairs of virtqueues in the case of a vNIC). In such
>> case, when there are multiple vCPUs present, it is possible to process
>> virtqueue events in parallel. Each vCPU can service a subset of all
>> virtqueues when notified that there is work to carry out.
>>
>> However, the current virtio-mmio transport implementation poses a
>> limitation. Only one vCPU can service notifications from any of the
>> virtqueues of a single virtio device. This is because a virtio-mmio device
>> driver supports registering just one interrupt per device. With such setup
>> we are not able to scale virtqueue event processing among vCPUs.
>>
>> Now, with more than one IRQ resource registered for a virtio-mmio platform
>> device, we can address this limitation.
>>
>> First, we request multiple IRQs when creating virtqueues for a device.
>>
>> Then, map each virtqueue to one of the IRQs assigned to the device. The
>> mapping is done in a device type specific manner. For instance, a network
>> device will want each RX/TX virtqueue pair mapped to a different IRQ
>> line. Other device types might require a different mapping scheme. We
>> currently provide a mapping for virtio-net device type.
>>
>> Finally, when handling an interrupt, we service only the virtqueues
>> associated with the IRQ line that triggered the event.
>>
>> Signed-off-by: Jakub Sitnicki <[email protected]>
>> ---
>> drivers/virtio/virtio_mmio.c | 102 +++++++++++++++++++++++++++++++++++--------
>> 1 file changed, 83 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index 06a587b23542..180c51c27704 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c

[...]

>> @@ -488,6 +511,18 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned int in
>> return ERR_PTR(err);
>> }
>>
>> +/* Map virtqueue to zero-based interrupt number */
>> +static unsigned int vq2irq(const struct virtqueue *vq)
>> +{
>> + switch (vq->vdev->id.device) {
>> + case VIRTIO_ID_NET:
>> + /* interrupt shared by rx/tx virtqueue pair */
>> + return vq->index / 2;
>> + default:
>> + return 0;
>> + }
>
> Transport drivers should have no knowledge of a specific type of device.
>

Makes sense. This breaks layering. I will see how to pull this into the
device driver. Perhaps this can be communicated through set_vq_affinity
op.

>> +}
>> +
>> static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
>> struct virtqueue *vqs[],
>> vq_callback_t *callbacks[],

[...]

>> @@ -519,12 +544,51 @@ static int vm_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
>> vqs[i] = vm_setup_vq(vdev, queue_idx++, callbacks[i], names[i],
>> ctx ? ctx[i] : false);
>> if (IS_ERR(vqs[i])) {
>> - vm_del_vqs(vdev);
>> - return PTR_ERR(vqs[i]);
>> + err = PTR_ERR(vqs[i]);
>> + goto fail_vq;
>> }
>> }
>>
>> + nirqs = platform_irq_count(vm_dev->pdev);
>> + if (nirqs < 0) {
>> + err = nirqs;
>> + goto fail_vq;
>> + }
>> +
>> + for (i = 0; i < nirqs; i++) {
>> + irq = platform_get_irq(vm_dev->pdev, i);
>> + if (irq < 0)
>> + goto fail_irq;
>> + if (irq < irq_base)
>> + irq_base = irq;
>> +
>> + err = devm_request_irq(&vdev->dev, irq, vm_interrupt,
>> + IRQF_SHARED, NULL, vm_dev);
>> + if (err)
>> + goto fail_irq;
>> +
>> + if (of_property_read_bool(vm_dev->pdev->dev.of_node, "wakeup-source"))
>> + enable_irq_wake(irq);
>
> Could we simply use the same policy as PCI (vp_find_vqs_msix())?

Reading that routine, the PCI policy is:

1) Best option: one for change interrupt, one per vq.
2) Second best: one for change, shared for all vqs.

Would be great to be able to go with option (1), but we have only a
limited number of legacy IRQs to spread among MMIO devices. 48 IRQs at
most in a 2 x IOAPIC setup.

Having one IRQ per VQ would mean less Rx/Tx queue pairs for a vNIC. Less
than 24 queue pairs. While, from our target workload PoV, ideally, we
would like to support at least 32 queue pairs.

Hence the idea to have one IRQ per Rx/Tx VQ pair. Not as ideal as (1),
but a lot better than (2).

Comparing this to PCI - virtio-net, with one interrupt per VQ, will map
each Rx/Tx VQ pair to the same CPU.

We could achieve the same VQ-CPU affinity setup for MMIO, but with less
interrupt vectors.

Thanks for feedback.