Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address. The
general problem is that we have several cases where we want to expose
variable sized information to the user, whether it's sparse mmaps for
a region, as implemented here, or DMA mapping ranges of an IOMMU, or
reserved MSI mapping ranges, etc. Extending data structures is hard;
extending them to report variable sized data is really hard. After
considering several options, I think the best approach is to copy how
PCI does capabilities. This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report. This method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.
Comments welcome, I'll also follow-up to the QEMU and KVM lists with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,
Alex
---
Alex Williamson (3):
vfio: Define capability chains
vfio: Define sparse mmap capability for regions
vfio/pci: Include sparse mmap capability for MSI-X table regions
drivers/vfio/pci/vfio_pci.c | 101 +++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 53 ++++++++++++++++++++++-
2 files changed, 152 insertions(+), 2 deletions(-)
We have a few cases where we need to extend the data returned from the
INFO ioctls in VFIO. For instance we already have devices exposed
through vfio-pci where VFIO_DEVICE_GET_REGION_INFO reports the region
as mmap-capable, but really only supports sparse mmaps, avoiding the
MSI-X table. If we wanted to provide in-kernel emulation or extended
functionality for devices, we'd also want the ability to tell the
user not to mmap various regions, rather than forcing them to figure
it out on their own.
Another example is VFIO_IOMMU_GET_INFO. We'd really like to expose
the actual IOVA capabilities of the IOMMU rather than letting the
user assume the address space they have available to them. We could
add IOVA base and size fields to struct vfio_iommu_type1_info, but
what if we have multiple IOVA ranges. For instance x86 uses a range
of addresses at 0xfee00000 for MSI vectors. These typically are not
available for standard DMA IOVA mappings and splits our available IOVA
space into two ranges. POWER systems have both an IOVA window below
4G as well as dynamic data window which they can use to remap all of
guest memory.
Representing variable sized arrays within a fixed structure makes it
very difficult to parse, we'd therefore like to put this data beyond
fixed fields within the data structures. One way to do this is to
emulate capabilities in PCI configuration space. A new flag indciates
whether capabilties are supported and a new fixed field reports the
offset of the first entry. Users can then walk the chain to find
capabilities, adding capabilities does not require additional fields
in the fixed structure, and parsing variable sized data becomes
trivial.
This patch outlines the theory and base header structure, which
should be shared by all future users.
Signed-off-by: Alex Williamson <[email protected]>
---
include/uapi/linux/vfio.h | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 751b69f..432569f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -59,6 +59,33 @@
#define VFIO_TYPE (';')
#define VFIO_BASE 100
+/*
+ * For extension of INFO ioctls, VFIO makes use of a capability chain
+ * designed after PCI/e capabilities. A flag bit indicates whether
+ * this capability chain is supported and a field defined in the fixed
+ * structure defines the offset of the first capability in the chain.
+ * This field is only valid when the corresponding bit in the flags
+ * bitmap is set. This offset field is relative to the start of the
+ * INFO buffer, as is the next field within each capability header.
+ * The id within the header is a shared address space per INFO ioctl,
+ * while the version field is specific to the capability id. The
+ * contents following the header are specific to the capability id.
+ */
+struct vfio_info_cap_header {
+ __u16 id; /* Identifies capability */
+ __u16 version; /* Version specific to the capability ID */
+ __u32 next; /* Offset of next capability */
+};
+
+/*
+ * Callers of INFO ioctls passing insufficiently sized buffers will see
+ * the capability chain flag bit set, a zero value for the first capability
+ * offset (if available within the provided argsz), and argsz will be
+ * updated to report the necessary buffer size. For compatibility, the
+ * INFO ioctl will not report error in this case, but the capability chain
+ * will not be available.
+ */
+
/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
/**
We can't always support mmap across an entire device region, for
example we deny mmaps covering the MSI-X table of PCI devices, but
we don't really have a way to report it. We expect the user to
implicitly know this restriction. We also can't split the region
because vfio-pci defines an API with fixed region index to BAR
number mapping. We therefore define a new capability which lists
areas within the region that may be mmap'd. In addition to the
MSI-X case, this potentially enables in-kernel emulation and
extensions to devices.
Signed-off-by: Alex Williamson <[email protected]>
---
include/uapi/linux/vfio.h | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 432569f..d3f6499 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -221,13 +221,37 @@ struct vfio_region_info {
#define VFIO_REGION_INFO_FLAG_READ (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP (1 << 2) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_CAPS (1 << 3) /* Info supports caps */
__u32 index; /* Region index */
- __u32 resv; /* Reserved for alignment */
+ __u32 cap_offset; /* Offset within info struct of first cap */
__u64 size; /* Region size (bytes) */
__u64 offset; /* Region offset from start of device fd */
};
#define VFIO_DEVICE_GET_REGION_INFO _IO(VFIO_TYPE, VFIO_BASE + 8)
+/*
+ * The sparse mmap capability allows finer granularity of specifying areas
+ * within a region with mmap support. When specified, the user should only
+ * mmap the offset ranges specified by the areas array. mmaps outside of the
+ * areas specified may fail (such as the range covering a PCI MSI-X table) or
+ * may result in improper device behavior.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_SPARSE_MMAP 1
+
+struct vfio_region_sparse_mmap_area {
+ __u64 offset; /* Offset of mmap'able area within region */
+ __u64 size; /* Size of mmap'able area */
+};
+
+struct vfio_region_info_cap_sparse_mmap {
+ struct vfio_info_cap_header header;
+ __u32 nr_areas;
+ __u32 reserved;
+ struct vfio_region_sparse_mmap_area areas[];
+};
+
/**
* VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
* struct vfio_irq_info)
vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this. Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 101 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 100 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 32b88bd..46e7aed 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -421,6 +421,77 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
return walk.ret;
}
+struct caps {
+ struct vfio_info_cap_header *buf;
+ size_t size;
+ size_t head;
+};
+
+static void *add_region_info_cap(struct caps *caps,
+ size_t size, u16 id, u16 version)
+{
+ void *tmp;
+ struct vfio_info_cap_header *header;
+
+ /* This would be ridiculous and exceeds the ioctl's abilities */
+ BUG_ON(caps->size + size + sizeof(struct vfio_region_info) > U32_MAX);
+
+ tmp = krealloc(caps->buf, caps->size + size, GFP_KERNEL);
+ if (!tmp) {
+ kfree(caps->buf);
+ caps->size = 0;
+ return ERR_PTR(-ENOMEM);
+ }
+
+ caps->buf = tmp;
+ header = tmp + caps->size;
+ header->id = id;
+ header->version = version;
+ header->next = caps->head;
+ caps->head = caps->size + sizeof(struct vfio_region_info);
+ caps->size += size;
+
+ return header;
+}
+
+static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev, struct caps *caps)
+{
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ size_t end, size;
+ int nr_areas = 2, i = 0;
+
+ end = pci_resource_len(vdev->pdev, vdev->msix_bar);
+
+ /* If MSI-X table is aligned to the start or end, only one area */
+ if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+ (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
+ nr_areas = 1;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+ sparse = add_region_info_cap(caps, size,
+ VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+ if (IS_ERR(sparse))
+ return PTR_ERR(sparse);
+
+ sparse->nr_areas = nr_areas;
+
+ if (vdev->msix_offset & PAGE_MASK) {
+ sparse->areas[i].offset = 0;
+ sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
+ i++;
+ }
+
+ if (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) < end) {
+ sparse->areas[i].offset = PAGE_ALIGN(vdev->msix_offset +
+ vdev->msix_size);
+ sparse->areas[i].size = end - sparse->areas[i].offset;
+ i++;
+ }
+
+ return 0;
+}
+
static long vfio_pci_ioctl(void *device_data,
unsigned int cmd, unsigned long arg)
{
@@ -451,6 +522,8 @@ static long vfio_pci_ioctl(void *device_data,
} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
struct pci_dev *pdev = vdev->pdev;
struct vfio_region_info info;
+ struct caps caps = { .buf = NULL, .size = 0, .head = 0 };
+ int ret;
minsz = offsetofend(struct vfio_region_info, offset);
@@ -479,8 +552,15 @@ static long vfio_pci_ioctl(void *device_data,
VFIO_REGION_INFO_FLAG_WRITE;
if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
pci_resource_flags(pdev, info.index) &
- IORESOURCE_MEM && info.size >= PAGE_SIZE)
+ IORESOURCE_MEM && info.size >= PAGE_SIZE) {
info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+ if (info.index == vdev->msix_bar) {
+ ret = msix_sparse_mmap_cap(vdev, &caps);
+ if (ret)
+ return ret;
+ }
+ }
+
break;
case VFIO_PCI_ROM_REGION_INDEX:
{
@@ -520,6 +600,25 @@ static long vfio_pci_ioctl(void *device_data,
return -EINVAL;
}
+ if (caps.size) {
+ info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ ret = copy_to_user((void __user *)arg +
+ sizeof(info), caps.buf,
+ caps.size);
+ if (ret) {
+ kfree(caps.buf);
+ return ret;
+ }
+ info.cap_offset = caps.head;
+ }
+
+ kfree(caps.buf);
+ }
+
return copy_to_user((void __user *)arg, &info, minsz);
} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
On 11/24/2015 07:43 AM, Alex Williamson wrote:
> Please see the commit log and comments in patch 1 for a general
> explanation of the problems that this series tries to address. The
> general problem is that we have several cases where we want to expose
> variable sized information to the user, whether it's sparse mmaps for
> a region, as implemented here, or DMA mapping ranges of an IOMMU, or
> reserved MSI mapping ranges, etc. Extending data structures is hard;
> extending them to report variable sized data is really hard. After
> considering several options, I think the best approach is to copy how
> PCI does capabilities. This allows the ioctl to only expose the
> capabilities that are relevant for them, avoids data structures that
> are too complicated to parse, and avoids creating a new ioctl each
> time we think of something else that we'd like to report. This method
> also doesn't preclude extensions to the fixed structure since the
> offset of these capabilities is entirely dynamic.
>
> Comments welcome, I'll also follow-up to the QEMU and KVM lists with
> an RFC making use of this for mmaps skipping over the MSI-X table.
> Thanks,
Out of curiosity - could this information be exposed to the userspace via
/sys/bus/pci/devices/xxxx:xx:xx:x/vfio_xxxx? It seems not to change after
vfio_pci driver is bound to a device.
> Alex
>
> ---
>
> Alex Williamson (3):
> vfio: Define capability chains
> vfio: Define sparse mmap capability for regions
> vfio/pci: Include sparse mmap capability for MSI-X table regions
>
>
> drivers/vfio/pci/vfio_pci.c | 101 +++++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/vfio.h | 53 ++++++++++++++++++++++-
> 2 files changed, 152 insertions(+), 2 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Alexey
On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:
> On 11/24/2015 07:43 AM, Alex Williamson wrote:
> > Please see the commit log and comments in patch 1 for a general
> > explanation of the problems that this series tries to address. The
> > general problem is that we have several cases where we want to
> > expose
> > variable sized information to the user, whether it's sparse mmaps
> > for
> > a region, as implemented here, or DMA mapping ranges of an IOMMU,
> > or
> > reserved MSI mapping ranges, etc. Extending data structures is
> > hard;
> > extending them to report variable sized data is really hard. After
> > considering several options, I think the best approach is to copy
> > how
> > PCI does capabilities. This allows the ioctl to only expose the
> > capabilities that are relevant for them, avoids data structures
> > that
> > are too complicated to parse, and avoids creating a new ioctl each
> > time we think of something else that we'd like to report. This
> > method
> > also doesn't preclude extensions to the fixed structure since the
> > offset of these capabilities is entirely dynamic.
> >
> > Comments welcome, I'll also follow-up to the QEMU and KVM lists
> > with
> > an RFC making use of this for mmaps skipping over the MSI-X table.
> > Thanks,
>
> Out of curiosity - could this information be exposed to the userspace
> via
> /sys/bus/pci/devices/xxxx:xx:xx:x/vfio_xxxx? It seems not to change
> after
> vfio_pci driver is bound to a device.
For what purpose? vfio doesn't have a sysfs interface, why start one?
Thanks,
Alex
On 12/18/2015 01:38 PM, Alex Williamson wrote:
> On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:
>> On 11/24/2015 07:43 AM, Alex Williamson wrote:
>>> Please see the commit log and comments in patch 1 for a general
>>> explanation of the problems that this series tries to address. The
>>> general problem is that we have several cases where we want to
>>> expose
>>> variable sized information to the user, whether it's sparse mmaps
>>> for
>>> a region, as implemented here, or DMA mapping ranges of an IOMMU,
>>> or
>>> reserved MSI mapping ranges, etc. Extending data structures is
>>> hard;
>>> extending them to report variable sized data is really hard. After
>>> considering several options, I think the best approach is to copy
>>> how
>>> PCI does capabilities. This allows the ioctl to only expose the
>>> capabilities that are relevant for them, avoids data structures
>>> that
>>> are too complicated to parse, and avoids creating a new ioctl each
>>> time we think of something else that we'd like to report. This
>>> method
>>> also doesn't preclude extensions to the fixed structure since the
>>> offset of these capabilities is entirely dynamic.
>>>
>>> Comments welcome, I'll also follow-up to the QEMU and KVM lists
>>> with
>>> an RFC making use of this for mmaps skipping over the MSI-X table.
>>> Thanks,
>>
>> Out of curiosity - could this information be exposed to the userspace
>> via
>> /sys/bus/pci/devices/xxxx:xx:xx:x/vfio_xxxx? It seems not to change
>> after
>> vfio_pci driver is bound to a device.
>
> For what purpose? vfio doesn't have a sysfs interface, why start one?
> Thanks,
well, it could simplify debugging a bit if this information was available
from the userspace without programming a test tool doing some ioctl()'s.
Not a big deal though...
--
Alexey