In current VFIO MSI/MSI-X implementation, linux host kernel
allocates MSI/MSI-X vectors when userspace requests through vfio ioctls.
Vfio creates irqfd mappings to notify MSI/MSI-X interrupts
to the userspace when raised.
Guest OS will see emulated MSI/MSI-X controller and receives an interrupt
when kernel notifies the same via irqfd.
Host kernel allocates MSI/MSI-X using standard linux routines
like pci_enable_msix_range() and pci_enable_msi_range().
These routines along with requset_irq() in host kernel sets up
MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
interrupt controller driver in host kernel.
This means when a device is assigned with the guest OS, MSI/MSI-X addresses
present in PCIe EP are the PAs programmed by the host linux kernel.
In x86 MSI/MSI-X physical address range is reserved and iommu is aware
about these addreses and transalation is bypassed for these address range.
Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address range and
all the transactions including MSI go through iommu/smmu without bypass.
This requires extending current vfio MSI layer with additional functionality
for ARM/ARM64 by
1. Programing IOVA (referred as a MSI virtual doorbell address)
in device's MSI vector as a MSI address.
This IOVA will be provided by the userspace based on the
MSI/MSI-X addresses reserved for the guest.
2. Create an IOMMU mapping between this IOVA and
Physical address (PA) assigned to the MSI vector.
This RFC is proposing a solution for MSI/MSI-X passthrough for ARM/ARM64.
Pranavkumar Sawargaonkar (2):
drivers: vfio: iommu map and unmap device specific memory from
kernel.
drivers: vfio: pci: Add virtual MSI doorbell support.
drivers/vfio/pci/vfio_pci.c | 32 ++++++++++++++++++
drivers/vfio/pci/vfio_pci_intrs.c | 64 +++++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 3 ++
drivers/vfio/vfio.c | 29 ++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 60 ++++++++++++++++++++++++++++++++
include/linux/vfio.h | 11 +++++-
include/uapi/linux/vfio.h | 19 +++++++++++
7 files changed, 217 insertions(+), 1 deletion(-)
--
1.7.9.5
In vfio we map and unmap various regions using "VFIO_IOMMU_MAP_DMA" and
"VFIO_IOMMU_UNMAP_DMA" ioctls from userspace.
Some device regions (like MSI in case of PCI), which we do not expose
to the userspace with mmap. These regions might require vfio driver
to create an iommu mapping, as their transactions goes through
an iommu like in case of ARM/ARM64.
As the memory is not mmaped in userspace and might needs to be mapped
with different memory attributes than user memory, we can not use
VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA ioctls.
This patch extends "vfio_iommu_driver_ops" to provide -
device_map() and device_unmap() methods by vfio iommu driver.
These methods can be used by other vfio device drivers like PCI,
to create and destroy simple iommu mappings for regions like MSI/MSI-X.
This patch also implements these methods for vfio iommu type1 driver.
Signed-off-by: Ankit Jindal <[email protected]>
Signed-off-by: Pranavkumar Sawargaonkar <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Christoffer Dall <[email protected]>
---
drivers/vfio/vfio.c | 29 +++++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 60 +++++++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 11 ++++++-
3 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 2fb29df..7897c47 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -143,6 +143,35 @@ void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
}
EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
+int vfio_device_iommu_map(struct vfio_device *device, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot)
+{
+ struct vfio_container *container = device->group->container;
+ const struct vfio_iommu_driver_ops *ops = container->iommu_driver->ops;
+ int ret;
+
+ if (!ops->device_map)
+ return -EINVAL;
+
+ ret = ops->device_map(container->iommu_data, iova, paddr, size, prot);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_device_iommu_map);
+
+void vfio_device_iommu_unmap(struct vfio_device *device, unsigned long iova,
+ size_t size)
+{
+ struct vfio_container *container = device->group->container;
+ const struct vfio_iommu_driver_ops *ops = container->iommu_driver->ops;
+
+ if (!ops->device_unmap)
+ return;
+
+ ops->device_unmap(container->iommu_data, iova, size);
+}
+EXPORT_SYMBOL_GPL(vfio_device_iommu_unmap);
+
/**
* Group minor allocation/free - both called with vfio.group_lock held
*/
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..e41995d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1025,6 +1025,64 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
return -ENOTTY;
}
+static int vfio_iommu_type1_device_map(void *iommu_data, unsigned long iova,
+ phys_addr_t paddr, size_t size,
+ int prot)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_domain *d;
+ int ret;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+
+ if (iommu_iova_to_phys(d->domain, iova))
+ continue;
+
+ ret = iommu_map(d->domain, iova, paddr,
+ size, prot | d->prot);
+
+ if (ret) {
+ if (ret != -EBUSY)
+ goto unwind;
+ }
+
+ cond_resched();
+ }
+
+ mutex_unlock(&iommu->lock);
+
+ return 0;
+
+unwind:
+ list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
+ iommu_unmap(d->domain, iova, size);
+
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static void vfio_iommu_type1_device_unmap(void *iommu_data, unsigned long iova,
+ size_t size)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_domain *d;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+
+ if (!iommu_iova_to_phys(d->domain, iova))
+ continue;
+
+ iommu_unmap(d->domain, iova, size);
+ cond_resched();
+ }
+
+ mutex_unlock(&iommu->lock);
+}
+
static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.name = "vfio-iommu-type1",
.owner = THIS_MODULE,
@@ -1033,6 +1091,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
.ioctl = vfio_iommu_type1_ioctl,
.attach_group = vfio_iommu_type1_attach_group,
.detach_group = vfio_iommu_type1_detach_group,
+ .device_map = vfio_iommu_type1_device_map,
+ .device_unmap = vfio_iommu_type1_device_unmap,
};
static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ddb4409..ef0d974 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -52,6 +52,12 @@ extern void *vfio_del_group_dev(struct device *dev);
extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
extern void vfio_device_put(struct vfio_device *device);
extern void *vfio_device_data(struct vfio_device *device);
+extern int vfio_device_iommu_map(struct vfio_device *device,
+ unsigned long iova,
+ phys_addr_t paddr,
+ size_t size, int prot);
+extern void vfio_device_iommu_unmap(struct vfio_device *device,
+ unsigned long iova, size_t size);
/**
* struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
@@ -72,7 +78,10 @@ struct vfio_iommu_driver_ops {
struct iommu_group *group);
void (*detach_group)(void *iommu_data,
struct iommu_group *group);
-
+ int (*device_map)(void *iommu_data, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot);
+ void (*device_unmap)(void *iommu_data, unsigned long iova,
+ size_t size);
};
extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
--
1.7.9.5
In ARM/ARM64 MSI transactions goes through iommu/smmu.
This means there has to be an iommu mapping created for MSI addresses.
This patch adds a new ioctl "VFIO_DEVICE_PCI_MSI_VIRT_DOORBELL".
Userspace can call this ioctl to do following things:
1. Create a virtual doorbell mapping between
MSI IOVA term as a "virtual msi doorbell" (known by the userspace) and
MSI PA (known by the kernel).
2. Set MSI/MSI-X vetcor with a virtual doorbell address instead of PA.
Signed-off-by: Ankit Jindal <[email protected]>
Signed-off-by: Pranavkumar Sawargaonkar <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Christoffer Dall <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 32 ++++++++++++++++++
drivers/vfio/pci/vfio_pci_intrs.c | 64 +++++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_private.h | 3 ++
include/uapi/linux/vfio.h | 19 +++++++++++
4 files changed, 118 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 964ad57..9c92707 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -784,6 +784,38 @@ hot_reset_release:
kfree(groups);
return ret;
+ } else if (cmd == VFIO_DEVICE_PCI_MSI_VIRT_DOORBELL) {
+ struct vfio_pci_msi_virt_doorbell hdr;
+ u64 *data = NULL;
+ int ret = 0;
+ size_t size = sizeof(uint64_t);
+
+ minsz = offsetofend(struct vfio_pci_msi_virt_doorbell, count);
+
+ if (copy_from_user(&hdr, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (hdr.argsz < minsz)
+ return -EINVAL;
+
+ if (hdr.argsz - minsz < hdr.count * size)
+ return -EINVAL;
+
+ data = memdup_user((void __user *)(arg + minsz),
+ hdr.count * size);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+
+ mutex_lock(&vdev->igate);
+
+ ret = vfio_pci_msi_virt_doorbell(vdev, hdr.flags,
+ hdr.start, hdr.count, data);
+
+ mutex_unlock(&vdev->igate);
+
+ kfree(data);
+
+ return ret;
}
return -ENOTTY;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 1f577b4..22a25b8 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/irq.h>
#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
@@ -352,8 +353,10 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
pci_write_msi_msg(irq, &msg);
}
+
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
+
if (ret) {
kfree(vdev->ctx[vector].name);
eventfd_ctx_put(trigger);
@@ -673,3 +676,64 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
return func(vdev, index, start, count, flags, data);
}
+
+int vfio_pci_msi_virt_doorbell(struct vfio_pci_device *vdev, uint32_t flags,
+ unsigned start, unsigned count,
+ void *data)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ int irq;
+ bool msix = flags & VFIO_PCI_IS_MSIX;
+ struct msi_msg msg;
+ struct irq_data *d;
+ unsigned long msi_paddr, msi_iova;
+ struct vfio_device *device;
+ int ret;
+ int i, j;
+
+ for (i = 0, j = start; i < count && !ret; i++, j++) {
+
+ device = vfio_device_get_from_dev(&pdev->dev);
+ irq = msix ? vdev->msix[j].vector :
+ pdev->irq + j;
+
+ if (flags & VFIO_PCI_MSI_SET_DOORBELL) {
+ /* Get the MSI message to extract Physical Address */
+ d = irq_get_irq_data(irq);
+ irq_chip_compose_msi_msg(d, &msg);
+ msi_paddr = (msg.address_hi << 31) | msg.address_lo;
+ } else {
+ /*
+ * Restore the cached value of the message prior
+ * to the virtual doorbell setting.
+ */
+ get_cached_msi_msg(irq, &msg);
+ }
+
+ /* MSI IPA/GPA i.e. virtual doorbell address */
+ msi_iova = (unsigned long) ((unsigned long *) data)[i];
+
+ if (flags & VFIO_PCI_MSI_SET_DOORBELL) {
+ ret = vfio_device_iommu_map(device,
+ (msi_iova & PAGE_MASK),
+ (msi_paddr & PAGE_MASK),
+ PAGE_SIZE,
+ IOMMU_READ | IOMMU_WRITE |
+ IOMMU_CACHE);
+
+ /* Fill MSI GPA/IPA as a new MSI doorbell address. */
+ msg.address_hi = msi_iova << 31;
+ msg.address_lo = msi_iova & 0xFFFFFFFF;
+ } else if (flags & VFIO_PCI_MSI_CLEAR_DOORBELL) {
+ vfio_device_iommu_unmap(device, (msi_iova & PAGE_MASK),
+ PAGE_SIZE);
+ } else {
+ return -EINVAL;
+ }
+
+ pci_write_msi_msg(irq, &msg);
+ }
+
+ return 0;
+}
+
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index ae0e1b4..ec76e45 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -73,6 +73,9 @@ extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev,
uint32_t flags, unsigned index,
unsigned start, unsigned count, void *data);
+extern int vfio_pci_msi_virt_doorbell(struct vfio_pci_device *vdev,
+ uint32_t flags, unsigned start,
+ unsigned count, void *data);
extern ssize_t vfio_pci_config_rw(struct vfio_pci_device *vdev,
char __user *buf, size_t count,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9fd7b5d..12384f5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -379,6 +379,25 @@ struct vfio_pci_hot_reset {
#define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13)
+/**
+ * VFIO_DEVICE_PCI_MSI_VIRT_DOORBELL - _IOW(VFIO_TYPE, VFIO_BASE + 14,
+ * struct vfio_pci_msi_virt_doorbell)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_pci_msi_virt_doorbell {
+ __u32 argsz;
+ __u32 flags;
+#define VFIO_PCI_MSI_CLEAR_DOORBELL (1 << 0) /* Remove virtual doorbell */
+#define VFIO_PCI_MSI_SET_DOORBELL (1 << 1) /* Set virtual doorbell */
+#define VFIO_PCI_IS_MSIX (1 << 2) /* Is MSI-X ? */
+ __u32 start;
+ __u32 count;
+ __u64 data[];
+};
+
+#define VFIO_DEVICE_PCI_MSI_VIRT_DOORBELL _IO(VFIO_TYPE, VFIO_BASE + 14)
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
1.7.9.5
On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
> In current VFIO MSI/MSI-X implementation, linux host kernel
> allocates MSI/MSI-X vectors when userspace requests through vfio ioctls.
> Vfio creates irqfd mappings to notify MSI/MSI-X interrupts
> to the userspace when raised.
> Guest OS will see emulated MSI/MSI-X controller and receives an interrupt
> when kernel notifies the same via irqfd.
>
> Host kernel allocates MSI/MSI-X using standard linux routines
> like pci_enable_msix_range() and pci_enable_msi_range().
> These routines along with requset_irq() in host kernel sets up
> MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
> interrupt controller driver in host kernel.
>
> This means when a device is assigned with the guest OS, MSI/MSI-X addresses
> present in PCIe EP are the PAs programmed by the host linux kernel.
>
> In x86 MSI/MSI-X physical address range is reserved and iommu is aware
> about these addreses and transalation is bypassed for these address range.
>
> Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address range and
> all the transactions including MSI go through iommu/smmu without bypass.
> This requires extending current vfio MSI layer with additional functionality
> for ARM/ARM64 by
> 1. Programing IOVA (referred as a MSI virtual doorbell address)
> in device's MSI vector as a MSI address.
> This IOVA will be provided by the userspace based on the
> MSI/MSI-X addresses reserved for the guest.
> 2. Create an IOMMU mapping between this IOVA and
> Physical address (PA) assigned to the MSI vector.
>
> This RFC is proposing a solution for MSI/MSI-X passthrough for ARM/ARM64.
Hi Pranavkumar,
Freescale has the same, or very similar, need, so any solution in this
space will need to work for both ARM and powerpc. I'm not a big fan of
this approach as it seems to require the user to configure MSI/X via
ioctl and then call a separate ioctl mapping the doorbells. That's more
code for the user, more code to get wrong and potentially a gap between
configuring MSI/X and enabling mappings where we could see IOMMU faults.
If we know that doorbell mappings are required, why can't we set aside a
bank of IOVA space and have them mapped automatically as MSI/X is being
configured? Then the user's need for special knowledge and handling of
this case is limited to setup. The IOVA space will be mapped and used
as needed, we only need the user to specify the IOVA space reserved for
this. Thanks,
Alex
Hi Alex,
On 28/07/15 17:21, Alex Williamson wrote:
> On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
>> In current VFIO MSI/MSI-X implementation, linux host kernel
>> allocates MSI/MSI-X vectors when userspace requests through vfio ioctls.
>> Vfio creates irqfd mappings to notify MSI/MSI-X interrupts
>> to the userspace when raised.
>> Guest OS will see emulated MSI/MSI-X controller and receives an interrupt
>> when kernel notifies the same via irqfd.
>>
>> Host kernel allocates MSI/MSI-X using standard linux routines
>> like pci_enable_msix_range() and pci_enable_msi_range().
>> These routines along with requset_irq() in host kernel sets up
>> MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
>> interrupt controller driver in host kernel.
>>
>> This means when a device is assigned with the guest OS, MSI/MSI-X addresses
>> present in PCIe EP are the PAs programmed by the host linux kernel.
>>
>> In x86 MSI/MSI-X physical address range is reserved and iommu is aware
>> about these addreses and transalation is bypassed for these address range.
>>
>> Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address range and
>> all the transactions including MSI go through iommu/smmu without bypass.
>> This requires extending current vfio MSI layer with additional functionality
>> for ARM/ARM64 by
>> 1. Programing IOVA (referred as a MSI virtual doorbell address)
>> in device's MSI vector as a MSI address.
>> This IOVA will be provided by the userspace based on the
>> MSI/MSI-X addresses reserved for the guest.
>> 2. Create an IOMMU mapping between this IOVA and
>> Physical address (PA) assigned to the MSI vector.
>>
>> This RFC is proposing a solution for MSI/MSI-X passthrough for ARM/ARM64.
>
>
> Hi Pranavkumar,
>
> Freescale has the same, or very similar, need, so any solution in this
> space will need to work for both ARM and powerpc. I'm not a big fan of
> this approach as it seems to require the user to configure MSI/X via
> ioctl and then call a separate ioctl mapping the doorbells. That's more
> code for the user, more code to get wrong and potentially a gap between
> configuring MSI/X and enabling mappings where we could see IOMMU faults.
>
> If we know that doorbell mappings are required, why can't we set aside a
> bank of IOVA space and have them mapped automatically as MSI/X is being
> configured? Then the user's need for special knowledge and handling of
> this case is limited to setup. The IOVA space will be mapped and used
> as needed, we only need the user to specify the IOVA space reserved for
> this. Thanks,
I guess my immediate worry is that it seems to impose a fixed mapping
for all the guests, which would restrict the "shape" of the mappings we
give to a guest. Or did you intend for that IOVA mapping to be defined
on a "per userspace instance" basis?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
Hi Alex,
> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Tuesday, July 28, 2015 9:52 PM
> To: Pranavkumar Sawargaonkar
> Cc: [email protected]; [email protected]; linux-arm-
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Bhushan Bharat-R65777; Yoder
> Stuart-B08248
> Subject: Re: [RFC 0/2] VFIO: Add virtual MSI doorbell support.
>
> On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
> > In current VFIO MSI/MSI-X implementation, linux host kernel allocates
> > MSI/MSI-X vectors when userspace requests through vfio ioctls.
> > Vfio creates irqfd mappings to notify MSI/MSI-X interrupts to the
> > userspace when raised.
> > Guest OS will see emulated MSI/MSI-X controller and receives an
> > interrupt when kernel notifies the same via irqfd.
> >
> > Host kernel allocates MSI/MSI-X using standard linux routines like
> > pci_enable_msix_range() and pci_enable_msi_range().
> > These routines along with requset_irq() in host kernel sets up
> > MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
> > interrupt controller driver in host kernel.
> >
> > This means when a device is assigned with the guest OS, MSI/MSI-X
> > addresses present in PCIe EP are the PAs programmed by the host linux
> kernel.
> >
> > In x86 MSI/MSI-X physical address range is reserved and iommu is aware
> > about these addreses and transalation is bypassed for these address range.
> >
> > Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address
> > range and all the transactions including MSI go through iommu/smmu
> without bypass.
> > This requires extending current vfio MSI layer with additional
> > functionality for ARM/ARM64 by 1. Programing IOVA (referred as a MSI
> > virtual doorbell address)
> > in device's MSI vector as a MSI address.
> > This IOVA will be provided by the userspace based on the
> > MSI/MSI-X addresses reserved for the guest.
> > 2. Create an IOMMU mapping between this IOVA and
> > Physical address (PA) assigned to the MSI vector.
> >
> > This RFC is proposing a solution for MSI/MSI-X passthrough for
> ARM/ARM64.
>
>
> Hi Pranavkumar,
>
> Freescale has the same, or very similar, need, so any solution in this space
> will need to work for both ARM and powerpc. I'm not a big fan of this
> approach as it seems to require the user to configure MSI/X via ioctl and then
> call a separate ioctl mapping the doorbells. That's more code for the user,
> more code to get wrong and potentially a gap between configuring MSI/X
> and enabling mappings where we could see IOMMU faults.
>
> If we know that doorbell mappings are required, why can't we set aside a
> bank of IOVA space and have them mapped automatically as MSI/X is being
> configured? Then the user's need for special knowledge and handling of this
> case is limited to setup. The IOVA space will be mapped and used as needed,
> we only need the user to specify the IOVA space reserved for this. Thanks,
We probably need a mix of both to support Freescale PowerPC and ARM based machines.
In this mix mode kernel vfio driver will reserve some IOVA for mapping MSI page/s. If any other iova mapping will overlap with this then it will return error and user-space. Ideally this should be choosen in such a way that it never overlap, which is easy on some systems but can be tricky on some other system like Freescale PowerPC. This is not sufficient for at-least Freescale PowerPC based SOC. This is because of hardware limitation, where we need to fit this reserved iova address within aperture decided by user-space. So if we allow user-space to change this reserved iova address to a value decided by user-spece itself then we can support both ARM/PowerPC based solutions.
I have some implementation ready/tested with this approach and if this approach looks good then I can submit a RFC patch.
Thanks
-Bharat
>
> Alex
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, 2015-07-28 at 17:55 +0100, Marc Zyngier wrote:
> Hi Alex,
>
> On 28/07/15 17:21, Alex Williamson wrote:
> > On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
> >> In current VFIO MSI/MSI-X implementation, linux host kernel
> >> allocates MSI/MSI-X vectors when userspace requests through vfio ioctls.
> >> Vfio creates irqfd mappings to notify MSI/MSI-X interrupts
> >> to the userspace when raised.
> >> Guest OS will see emulated MSI/MSI-X controller and receives an interrupt
> >> when kernel notifies the same via irqfd.
> >>
> >> Host kernel allocates MSI/MSI-X using standard linux routines
> >> like pci_enable_msix_range() and pci_enable_msi_range().
> >> These routines along with requset_irq() in host kernel sets up
> >> MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
> >> interrupt controller driver in host kernel.
> >>
> >> This means when a device is assigned with the guest OS, MSI/MSI-X addresses
> >> present in PCIe EP are the PAs programmed by the host linux kernel.
> >>
> >> In x86 MSI/MSI-X physical address range is reserved and iommu is aware
> >> about these addreses and transalation is bypassed for these address range.
> >>
> >> Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address range and
> >> all the transactions including MSI go through iommu/smmu without bypass.
> >> This requires extending current vfio MSI layer with additional functionality
> >> for ARM/ARM64 by
> >> 1. Programing IOVA (referred as a MSI virtual doorbell address)
> >> in device's MSI vector as a MSI address.
> >> This IOVA will be provided by the userspace based on the
> >> MSI/MSI-X addresses reserved for the guest.
> >> 2. Create an IOMMU mapping between this IOVA and
> >> Physical address (PA) assigned to the MSI vector.
> >>
> >> This RFC is proposing a solution for MSI/MSI-X passthrough for ARM/ARM64.
> >
> >
> > Hi Pranavkumar,
> >
> > Freescale has the same, or very similar, need, so any solution in this
> > space will need to work for both ARM and powerpc. I'm not a big fan of
> > this approach as it seems to require the user to configure MSI/X via
> > ioctl and then call a separate ioctl mapping the doorbells. That's more
> > code for the user, more code to get wrong and potentially a gap between
> > configuring MSI/X and enabling mappings where we could see IOMMU faults.
> >
> > If we know that doorbell mappings are required, why can't we set aside a
> > bank of IOVA space and have them mapped automatically as MSI/X is being
> > configured? Then the user's need for special knowledge and handling of
> > this case is limited to setup. The IOVA space will be mapped and used
> > as needed, we only need the user to specify the IOVA space reserved for
> > this. Thanks,
>
> I guess my immediate worry is that it seems to impose a fixed mapping
> for all the guests, which would restrict the "shape" of the mappings we
> give to a guest. Or did you intend for that IOVA mapping to be defined
> on a "per userspace instance" basis?
Hi Marc,
Right, I'm not suggesting a fixed mapping imposed on the user, I'm
suggesting the user can set aside a range of IOVA space for VFIO to make
use of for this purpose. The user would be explicitly defining that
range of reserved IOVA space.
We effectively have your first scenario on x86, where the platform
restriction shapes the VM. When we have an x86 guest on x86 host, the
guest and host implicit reserved regions align and we don't have any
problems. However if we wanted to run an ARM64 guest with an assigned
device on an x86 host, we'd need to "shape" the VM to avoid memory
overlapping the implicitly reserved range.
Ideally we could extend the interfaces to support both of these, a
mechanism to discover implicitly reserved ranges and for the user to set
explicitly reserved ranges. Thanks,
Alex
On Tue, 2015-07-28 at 17:23 +0000, Bhushan Bharat wrote:
> Hi Alex,
>
> > -----Original Message-----
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Tuesday, July 28, 2015 9:52 PM
> > To: Pranavkumar Sawargaonkar
> > Cc: [email protected]; [email protected]; linux-arm-
> > [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; Bhushan Bharat-R65777; Yoder
> > Stuart-B08248
> > Subject: Re: [RFC 0/2] VFIO: Add virtual MSI doorbell support.
> >
> > On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
> > > In current VFIO MSI/MSI-X implementation, linux host kernel allocates
> > > MSI/MSI-X vectors when userspace requests through vfio ioctls.
> > > Vfio creates irqfd mappings to notify MSI/MSI-X interrupts to the
> > > userspace when raised.
> > > Guest OS will see emulated MSI/MSI-X controller and receives an
> > > interrupt when kernel notifies the same via irqfd.
> > >
> > > Host kernel allocates MSI/MSI-X using standard linux routines like
> > > pci_enable_msix_range() and pci_enable_msi_range().
> > > These routines along with requset_irq() in host kernel sets up
> > > MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
> > > interrupt controller driver in host kernel.
> > >
> > > This means when a device is assigned with the guest OS, MSI/MSI-X
> > > addresses present in PCIe EP are the PAs programmed by the host linux
> > kernel.
> > >
> > > In x86 MSI/MSI-X physical address range is reserved and iommu is aware
> > > about these addreses and transalation is bypassed for these address range.
> > >
> > > Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address
> > > range and all the transactions including MSI go through iommu/smmu
> > without bypass.
> > > This requires extending current vfio MSI layer with additional
> > > functionality for ARM/ARM64 by 1. Programing IOVA (referred as a MSI
> > > virtual doorbell address)
> > > in device's MSI vector as a MSI address.
> > > This IOVA will be provided by the userspace based on the
> > > MSI/MSI-X addresses reserved for the guest.
> > > 2. Create an IOMMU mapping between this IOVA and
> > > Physical address (PA) assigned to the MSI vector.
> > >
> > > This RFC is proposing a solution for MSI/MSI-X passthrough for
> > ARM/ARM64.
> >
> >
> > Hi Pranavkumar,
> >
> > Freescale has the same, or very similar, need, so any solution in this space
> > will need to work for both ARM and powerpc. I'm not a big fan of this
> > approach as it seems to require the user to configure MSI/X via ioctl and then
> > call a separate ioctl mapping the doorbells. That's more code for the user,
> > more code to get wrong and potentially a gap between configuring MSI/X
> > and enabling mappings where we could see IOMMU faults.
> >
> > If we know that doorbell mappings are required, why can't we set aside a
> > bank of IOVA space and have them mapped automatically as MSI/X is being
> > configured? Then the user's need for special knowledge and handling of this
> > case is limited to setup. The IOVA space will be mapped and used as needed,
> > we only need the user to specify the IOVA space reserved for this. Thanks,
>
> We probably need a mix of both to support Freescale PowerPC and ARM
> based machines.
> In this mix mode kernel vfio driver will reserve some IOVA for mapping
> MSI page/s.
If vfio is reserving pages independently from the user, this becomes
what Marc called "shaping" the VM and what x86 effectively does. An
interface extension should expose these implicit regions so the user can
avoid them for DMA memory mapping.
> If any other iova mapping will overlap with this then it will return
> error and user-space. Ideally this should be choosen in such a way
> that it never overlap, which is easy on some systems but can be tricky
> on some other system like Freescale PowerPC. This is not sufficient
> for at-least Freescale PowerPC based SOC. This is because of hardware
> limitation, where we need to fit this reserved iova address within
> aperture decided by user-space. So if we allow user-space to change
> this reserved iova address to a value decided by user-spece itself
> then we can support both ARM/PowerPC based solutions.
Yes, that's my intention, to allow userspace to specify the reserved
region. I believe you have some additional restrictions on the number
of MSI banks available and whether MSI banks can be shared, but I would
hope that doesn't preclude a shared interface with ARM.
> I have some implementation ready/tested with this approach and if this
> approach looks good then I can submit a RFC patch.
Yes, please post. Thanks,
Alex
Hi Bharat,
On 28 July 2015 at 23:28, Alex Williamson <[email protected]> wrote:
> On Tue, 2015-07-28 at 17:23 +0000, Bhushan Bharat wrote:
>> Hi Alex,
>>
>> > -----Original Message-----
>> > From: Alex Williamson [mailto:[email protected]]
>> > Sent: Tuesday, July 28, 2015 9:52 PM
>> > To: Pranavkumar Sawargaonkar
>> > Cc: [email protected]; [email protected]; linux-arm-
>> > [email protected]; [email protected];
>> > [email protected]; [email protected]; [email protected];
>> > [email protected]; [email protected]; [email protected];
>> > [email protected]; [email protected]; Bhushan Bharat-R65777; Yoder
>> > Stuart-B08248
>> > Subject: Re: [RFC 0/2] VFIO: Add virtual MSI doorbell support.
>> >
>> > On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
>> > > In current VFIO MSI/MSI-X implementation, linux host kernel allocates
>> > > MSI/MSI-X vectors when userspace requests through vfio ioctls.
>> > > Vfio creates irqfd mappings to notify MSI/MSI-X interrupts to the
>> > > userspace when raised.
>> > > Guest OS will see emulated MSI/MSI-X controller and receives an
>> > > interrupt when kernel notifies the same via irqfd.
>> > >
>> > > Host kernel allocates MSI/MSI-X using standard linux routines like
>> > > pci_enable_msix_range() and pci_enable_msi_range().
>> > > These routines along with requset_irq() in host kernel sets up
>> > > MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
>> > > interrupt controller driver in host kernel.
>> > >
>> > > This means when a device is assigned with the guest OS, MSI/MSI-X
>> > > addresses present in PCIe EP are the PAs programmed by the host linux
>> > kernel.
>> > >
>> > > In x86 MSI/MSI-X physical address range is reserved and iommu is aware
>> > > about these addreses and transalation is bypassed for these address range.
>> > >
>> > > Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address
>> > > range and all the transactions including MSI go through iommu/smmu
>> > without bypass.
>> > > This requires extending current vfio MSI layer with additional
>> > > functionality for ARM/ARM64 by 1. Programing IOVA (referred as a MSI
>> > > virtual doorbell address)
>> > > in device's MSI vector as a MSI address.
>> > > This IOVA will be provided by the userspace based on the
>> > > MSI/MSI-X addresses reserved for the guest.
>> > > 2. Create an IOMMU mapping between this IOVA and
>> > > Physical address (PA) assigned to the MSI vector.
>> > >
>> > > This RFC is proposing a solution for MSI/MSI-X passthrough for
>> > ARM/ARM64.
>> >
>> >
>> > Hi Pranavkumar,
>> >
>> > Freescale has the same, or very similar, need, so any solution in this space
>> > will need to work for both ARM and powerpc. I'm not a big fan of this
>> > approach as it seems to require the user to configure MSI/X via ioctl and then
>> > call a separate ioctl mapping the doorbells. That's more code for the user,
>> > more code to get wrong and potentially a gap between configuring MSI/X
>> > and enabling mappings where we could see IOMMU faults.
>> >
>> > If we know that doorbell mappings are required, why can't we set aside a
>> > bank of IOVA space and have them mapped automatically as MSI/X is being
>> > configured? Then the user's need for special knowledge and handling of this
>> > case is limited to setup. The IOVA space will be mapped and used as needed,
>> > we only need the user to specify the IOVA space reserved for this. Thanks,
>>
>> We probably need a mix of both to support Freescale PowerPC and ARM
>> based machines.
>> In this mix mode kernel vfio driver will reserve some IOVA for mapping
>> MSI page/s.
>
> If vfio is reserving pages independently from the user, this becomes
> what Marc called "shaping" the VM and what x86 effectively does. An
> interface extension should expose these implicit regions so the user can
> avoid them for DMA memory mapping.
>
>> If any other iova mapping will overlap with this then it will return
>> error and user-space. Ideally this should be choosen in such a way
>> that it never overlap, which is easy on some systems but can be tricky
>> on some other system like Freescale PowerPC. This is not sufficient
>> for at-least Freescale PowerPC based SOC. This is because of hardware
>> limitation, where we need to fit this reserved iova address within
>> aperture decided by user-space. So if we allow user-space to change
>> this reserved iova address to a value decided by user-spece itself
>> then we can support both ARM/PowerPC based solutions.
>
> Yes, that's my intention, to allow userspace to specify the reserved
> region. I believe you have some additional restrictions on the number
> of MSI banks available and whether MSI banks can be shared, but I would
> hope that doesn't preclude a shared interface with ARM.
>
>> I have some implementation ready/tested with this approach and if this
>> approach looks good then I can submit a RFC patch.
>
> Yes, please post. Thanks,
Could you please share a tentative timeline by which you will be
posting your patches ?
Also are you planning to post counterpart patches for qemu or kvmtool ?
Thanks,
Pranav
> -----Original Message-----
> From: Pranavkumar Sawargaonkar [mailto:[email protected]]
> Sent: Tuesday, August 04, 2015 11:18 AM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; Alex Williamson; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Yoder Stuart-B08248
> Subject: Re: [RFC 0/2] VFIO: Add virtual MSI doorbell support.
>
> Hi Bharat,
>
> On 28 July 2015 at 23:28, Alex Williamson <[email protected]>
> wrote:
> > On Tue, 2015-07-28 at 17:23 +0000, Bhushan Bharat wrote:
> >> Hi Alex,
> >>
> >> > -----Original Message-----
> >> > From: Alex Williamson [mailto:[email protected]]
> >> > Sent: Tuesday, July 28, 2015 9:52 PM
> >> > To: Pranavkumar Sawargaonkar
> >> > Cc: [email protected]; [email protected]; linux-arm-
> >> > [email protected]; [email protected];
> >> > [email protected]; [email protected];
> >> > [email protected]; [email protected]; [email protected];
> >> > [email protected]; [email protected]; [email protected];
> >> > Bhushan Bharat-R65777; Yoder
> >> > Stuart-B08248
> >> > Subject: Re: [RFC 0/2] VFIO: Add virtual MSI doorbell support.
> >> >
> >> > On Fri, 2015-07-24 at 14:33 +0530, Pranavkumar Sawargaonkar wrote:
> >> > > In current VFIO MSI/MSI-X implementation, linux host kernel
> >> > > allocates MSI/MSI-X vectors when userspace requests through vfio
> ioctls.
> >> > > Vfio creates irqfd mappings to notify MSI/MSI-X interrupts to the
> >> > > userspace when raised.
> >> > > Guest OS will see emulated MSI/MSI-X controller and receives an
> >> > > interrupt when kernel notifies the same via irqfd.
> >> > >
> >> > > Host kernel allocates MSI/MSI-X using standard linux routines
> >> > > like
> >> > > pci_enable_msix_range() and pci_enable_msi_range().
> >> > > These routines along with requset_irq() in host kernel sets up
> >> > > MSI/MSI-X vectors with Physical MSI/MSI-X addresses provided by
> >> > > interrupt controller driver in host kernel.
> >> > >
> >> > > This means when a device is assigned with the guest OS, MSI/MSI-X
> >> > > addresses present in PCIe EP are the PAs programmed by the host
> >> > > linux
> >> > kernel.
> >> > >
> >> > > In x86 MSI/MSI-X physical address range is reserved and iommu is
> >> > > aware about these addreses and transalation is bypassed for these
> address range.
> >> > >
> >> > > Unlike x86, ARM/ARM64 does not reserve MSI/MSI-X Physical address
> >> > > range and all the transactions including MSI go through
> >> > > iommu/smmu
> >> > without bypass.
> >> > > This requires extending current vfio MSI layer with additional
> >> > > functionality for ARM/ARM64 by 1. Programing IOVA (referred as a
> >> > > MSI virtual doorbell address)
> >> > > in device's MSI vector as a MSI address.
> >> > > This IOVA will be provided by the userspace based on the
> >> > > MSI/MSI-X addresses reserved for the guest.
> >> > > 2. Create an IOMMU mapping between this IOVA and
> >> > > Physical address (PA) assigned to the MSI vector.
> >> > >
> >> > > This RFC is proposing a solution for MSI/MSI-X passthrough for
> >> > ARM/ARM64.
> >> >
> >> >
> >> > Hi Pranavkumar,
> >> >
> >> > Freescale has the same, or very similar, need, so any solution in
> >> > this space will need to work for both ARM and powerpc. I'm not a
> >> > big fan of this approach as it seems to require the user to
> >> > configure MSI/X via ioctl and then call a separate ioctl mapping
> >> > the doorbells. That's more code for the user, more code to get
> >> > wrong and potentially a gap between configuring MSI/X and enabling
> mappings where we could see IOMMU faults.
> >> >
> >> > If we know that doorbell mappings are required, why can't we set
> >> > aside a bank of IOVA space and have them mapped automatically as
> >> > MSI/X is being configured? Then the user's need for special
> >> > knowledge and handling of this case is limited to setup. The IOVA
> >> > space will be mapped and used as needed, we only need the user to
> >> > specify the IOVA space reserved for this. Thanks,
> >>
> >> We probably need a mix of both to support Freescale PowerPC and ARM
> >> based machines.
> >> In this mix mode kernel vfio driver will reserve some IOVA for
> >> mapping MSI page/s.
> >
> > If vfio is reserving pages independently from the user, this becomes
> > what Marc called "shaping" the VM and what x86 effectively does. An
> > interface extension should expose these implicit regions so the user
> > can avoid them for DMA memory mapping.
> >
> >> If any other iova mapping will overlap with this then it will return
> >> error and user-space. Ideally this should be choosen in such a way
> >> that it never overlap, which is easy on some systems but can be
> >> tricky on some other system like Freescale PowerPC. This is not
> >> sufficient for at-least Freescale PowerPC based SOC. This is because
> >> of hardware limitation, where we need to fit this reserved iova
> >> address within aperture decided by user-space. So if we allow
> >> user-space to change this reserved iova address to a value decided by
> >> user-spece itself then we can support both ARM/PowerPC based
> solutions.
> >
> > Yes, that's my intention, to allow userspace to specify the reserved
> > region. I believe you have some additional restrictions on the number
> > of MSI banks available and whether MSI banks can be shared, but I
> > would hope that doesn't preclude a shared interface with ARM.
> >
> >> I have some implementation ready/tested with this approach and if
> >> this approach looks good then I can submit a RFC patch.
> >
> > Yes, please post. Thanks,
>
> Could you please share a tentative timeline by which you will be posting your
> patches ?
I have not touched that code for a while, I am planning to send the patch in couple of weeks.
> Also are you planning to post counterpart patches for qemu or kvmtool ?
I will send only QEMU side changes.
Thanks
-Bharat
>
> Thanks,
> Pranav
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?