The series is rebased on top of Jason's VFIO refactoring collection:
https://github.com/jgunthorpe/linux/pull/3
We would like to receive review comments with respect to the mdev driver itself
and the common VFIO IMS support code that was suggested by Jason. The previous
version of the DEV-MSI/IMS code is still under review and also the IOASID code
is under design.
v6:
- Rebased on top of Jason's recent VFIO refactoring.
- Move VFIO IMS setup code to common (Jason)
- Changed patch ordering to minimize code stubs (Jason)
v5:
- Split out non driver IMS code to its own series.
- Removed common devsec code, Bjorn asked to deal with it post 5.11 and keep
custom code for now.
- Reworked irq_entries for IMS so emulated vector is also included.
- Reworked vidxd_send_interrupt() to take irq_entry directly (data ready for
consumption) (Thomas)
- Removed pointer to msi_entry in irq_entries (Thomas)
- Removed irq_domain check on free entries (Thomas)
- Split out irqbypass management code (Thomas)
- Fix EXPORT_SYMBOL to EXPORT_SYMBOL_GPL (Thomas)
v4:
dev-msi:
- Make interrupt remapping code more readable (Thomas)
- Add flush writes to unmask/write and reset ims slots (Thomas)
- Interrupt Message Storm-> Interrupt Message Store (Thomas)
- Merge in pasid programming code. (Thomas)
mdev:
- Fixed up domain assignment (Thomas)
- Define magic numbers (Thomas)
- Move siov detection code to PCI common (Thomas)
- Remove duplicated MSI entry info (Thomas)
- Convert code to use ims_slot (Thomas)
- Add explanation of pasid programming for IMS entry (Thomas)
- Add release int handle release support due to spec 1.1 update.
v3:
Dev-msi:
- No need to add support for 2 different dev-msi irq domains, a common
once can be used for both the cases(with IR enabled/disabled)
- Add arch specific function to specify additions to msi_prepare callback
instead of making the callback a weak function
- Call platform ops directly instead of a wrapper function
- Make mask/unmask callbacks as void functions
dev->msi_domain should be updated at the device driver level before
calling dev_msi_alloc_irqs()
dev_msi_alloc/free_irqs() cannot be used for PCI devices
Followed the generic layering scheme: infrastructure bits->arch
bits->enabling bits
Mdev:
- Remove set kvm group notifier (Yan Zhao)
- Fix VFIO irq trigger removal (Yan Zhao)
- Add mmio read flush to ims mask (Jason)
v2:
IMS (now dev-msi):
- With recommendations from Jason/Thomas/Dan on making IMS more generic:
- Pass a non-pci generic device(struct device) for IMS management instead of
mdev
- Remove all references to mdev and symbol_get/put
- Remove all references to IMS in common code and replace with dev-msi
- Remove dynamic allocation of platform-msi interrupts: no groups,no
new msi list or list helpers
- Create a generic dev-msi domain with and without interrupt remapping
enabled.
- Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
mdev:
- Removing unrelated bits from SVA enabling that’s not necessary for
the submission. (Kevin)
- Restructured entire mdev driver series to make reviewing easier (Kevin)
- Made rw emulation more robust (Kevin)
- Removed uuid wq type and added single dedicated wq type (Kevin)
- Locking fixes for vdev (Yan Zhao)
- VFIO MSIX trigger fixes (Yan Zhao)
This code series will match the support of the 5.6 kernel (stage 1) driver
but on guest.
The code has dependency on DEV_MSI/IMS enabling code:
https://lore.kernel.org/lkml/[email protected]/
The code has dependency on idxd driver sub-driver cleanup series:
https://lore.kernel.org/dmaengine/162163546245.260470.18336189072934823712.stgit@djiang5-desk3.ch.intel.com/T/#t
The code has dependency on Jason's VFIO refactoring:
https://lore.kernel.org/kvm/[email protected]/
Part 1 of the driver has been accepted in v5.6 kernel. It supports dedicated
workqueue (wq) without Shared Virtual Memory (SVM) support.
Part 2 of the driver supports shared wq and SVM and has been accepted in
kernel v5.11.
VFIO mediated device framework allows vendor drivers to wrap a portion of
device resources into virtual devices (mdev). Each mdev can be assigned
to different guest using the same set of VFIO uAPIs as assigning a
physical device. Accessing to the mdev resource is served with mixed
policies. For example, vendor drivers typically mark data-path interface
as pass-through for fast guest operations, and then trap-and-mediate the
control-path interface to avoid undesired interference between mdevs. Some
level of emulation is necessary behind vfio mdev to compose the virtual
device interface.
This series brings mdev to idxd driver to enable Intel Scalable IOV
(SIOV), a hardware-assisted mediated pass-through technology. SIOV makes
each DSA wq independently assignable through PASID-granular resource/DMA
isolation. It helps improve scalability and reduces mediation complexity
against purely software-based mdev implementations. Each assigned wq is
configured by host and exposed to the guest in a read-only configuration
mode, which allows the guest to use the wq w/o additional setup. This
design greatly reduces the emulation bits to focus on handling commands
from guests.
There are two possible avenues to support virtual device composition:
1. VFIO mediated device (mdev) or 2. User space DMA through char device
(or UACCE). Given the small portion of emulation to satisfy our needs
and VFIO mdev having the infrastructure already to support the device
passthrough, we feel that VFIO mdev is the better route. For more in depth
explanation, see documentation in Documents/driver-api/vfio/mdev-idxd.rst.
Introducing mdev types “1dwq-v1” type. This mdev type allows
allocation of a single dedicated wq from available dedicated wqs. After
a workqueue (wq) is enabled, the user will generate an uuid. On mdev
creation, the mdev driver code will find a dwq depending on the mdev
type. When the create operation is successful, the user generated uuid
can be passed to qemu. When the guest boots up, it should discover a
DSA device when doing PCI discovery.
For example of “1dwq-v1” type:
1. Enable wq with “mdev” wq type
2. A user generated uuid.
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
The wq exported through mdev will have the read only config bit set
for configuration. This means that the device does not require the
typical configuration. After enabling the device, the user must set the
WQ type and name. That is all is necessary to enable the WQ and start
using it. The single wq configuration is not the only way to create the
mdev. Multi wqs support for mdev will be in the future works.
The mdev utilizes Interrupt Message Store or IMS[3], a device-specific
MSI implementation, instead of MSIX for interrupts for the guest. This
preserves MSIX for host usages and also allows a significantly larger
number of interrupt vectors for guest usage.
The idxd driver implements IMS as on-device memory mapped unified
storage. Each interrupt message is stored as a DWORD size data payload
and a 64-bit address (same as MSI-X). Access to the IMS is through the
host idxd driver.
The idxd driver makes use of the generic IMS irq chip and domain which
stores the interrupt messages as an array in device memory. Allocation and
freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
interface. One only needs to ensure the interrupt domain is stored in
the underlying device struct.
The kernel tree can be found at [7].
[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage2.5
---
Dave Jiang (20):
vfio/mdev: idxd: add theory of operation documentation for idxd mdev
dmaengine: idxd: add external module driver support for dsa_bus_type
dmaengine: idxd: add IMS offset and size retrieval code
dmaengine: idxd: add portal offset for IMS portals
vfio: mdev: common lib code for setting up Interrupt Message Store
vfio/mdev: idxd: add PCI config for read/write for mdev
vfio/mdev: idxd: Add administrative commands emulation for mdev
vfio/mdev: idxd: Add mdev device context initialization
vfio/mdev: Add mmio read/write support for mdev
vfio/mdev: idxd: add mdev type as a new wq type
vfio/mdev: idxd: Add basic driver setup for idxd mdev
vfio: move VFIO PCI macros to common header
vfio/mdev: idxd: add mdev driver registration and helper functions
vfio/mdev: idxd: add 1dwq-v1 mdev type
vfio/mdev: idxd: ims domain setup for the vdcm
vfio/mdev: idxd: add new wq state for mdev
vfio/mdev: idxd: add error notification from host driver to mediated device
vfio: move vfio_pci_set_ctx_trigger_single to common code
vfio: mdev: Add device request interface
vfio/mdev: idxd: setup request interrupt
.../ABI/stable/sysfs-driver-dma-idxd | 6 +
drivers/dma/Kconfig | 1 +
drivers/dma/idxd/Makefile | 2 +
drivers/dma/idxd/cdev.c | 4 +-
drivers/dma/idxd/device.c | 102 +-
drivers/dma/idxd/idxd.h | 52 +-
drivers/dma/idxd/init.c | 59 +
drivers/dma/idxd/irq.c | 21 +-
drivers/dma/idxd/registers.h | 25 +-
drivers/dma/idxd/sysfs.c | 22 +
drivers/vfio/Makefile | 2 +-
drivers/vfio/mdev/Kconfig | 21 +
drivers/vfio/mdev/Makefile | 4 +
drivers/vfio/mdev/idxd/Makefile | 4 +
drivers/vfio/mdev/idxd/mdev.c | 958 ++++++++++++++++
drivers/vfio/mdev/idxd/mdev.h | 112 ++
drivers/vfio/mdev/idxd/vdev.c | 1016 +++++++++++++++++
drivers/vfio/mdev/mdev_irqs.c | 341 ++++++
drivers/vfio/pci/vfio_pci_intrs.c | 63 +-
drivers/vfio/pci/vfio_pci_private.h | 6 -
drivers/vfio/vfio_common.c | 74 ++
include/linux/mdev.h | 66 ++
include/linux/vfio.h | 10 +
include/uapi/linux/idxd.h | 2 +
24 files changed, 2890 insertions(+), 83 deletions(-)
create mode 100644 drivers/vfio/mdev/idxd/Makefile
create mode 100644 drivers/vfio/mdev/idxd/mdev.c
create mode 100644 drivers/vfio/mdev/idxd/mdev.h
create mode 100644 drivers/vfio/mdev/idxd/vdev.c
create mode 100644 drivers/vfio/mdev/mdev_irqs.c
create mode 100644 drivers/vfio/vfio_common.c
--
Add idxd vfio mediated device theory of operation documentation.
Provide description on mdev design, usage, and why vfio mdev was chosen.
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
Documentation/driver-api/vfio/mdev-idxd.rst | 379 +++++++++++++++++++++++++++
MAINTAINERS | 7
2 files changed, 386 insertions(+)
create mode 100644 Documentation/driver-api/vfio/mdev-idxd.rst
diff --git a/Documentation/driver-api/vfio/mdev-idxd.rst b/Documentation/driver-api/vfio/mdev-idxd.rst
new file mode 100644
index 000000000000..5c793638e176
--- /dev/null
+++ b/Documentation/driver-api/vfio/mdev-idxd.rst
@@ -0,0 +1,379 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+IDXD Overview
+=============
+IDXD (Intel Data Accelerator Driver) is the driver for the Intel Data
+Streaming Accelerator (DSA). Intel DSA is a high performance data copy
+and transformation accelerator. In addition to data move operations,
+the device also supports data fill, CRC generation, Data Integrity Field
+(DIF), and memory compare and delta generation. Intel DSA supports
+a variety of PCI-SIG defined capabilities such as Address Translation
+Services (ATS), Process address Space ID (PASID), Page Request Interface
+(PRI), Message Signalled Interrupts Extended (MSI-X), and Advanced Error
+Reporting (AER). Some of those capabilities enable the device to support
+Shared Virtual Memory (SVM), or also known as Shared Virtual Addressing
+(SVA). Intel DSA also supports Intel Scalable I/O Virtualization (SIOV)
+to improve scalability of device assignment.
+
+
+The Intel DSA device contains the following basic components:
+* Work queue (WQ)
+
+ A WQ is an on device storage to queue descriptors to the
+ device. Requests are added to a WQ by using new CPU instructions
+ (MOVDIR64B and ENQCMD(S)) to write the memory mapped “portal”
+ associated with each WQ.
+
+* Engine
+
+ Operation unit that pulls descriptors from WQs and processes them.
+
+* Group
+
+ Abstract container to associate one or more engines with one or more WQs.
+
+
+Two types of WQs are supported:
+* Dedicated WQ (DWQ)
+
+ Usually a single client owns this exclusively and can submit work
+ to it. The MOVDIR64B instruction is used to submit descriptors to
+ this type of WQ. The instruction is a posted write, therefore the
+ submitter must ensure not exceed the WQ length for submission. The
+ use of PASID is optional with DWQ. Multiple clients can submit to
+ a DWQ, but sychronization is required due to when the WQ is full,
+ the submission is silently dropped.
+
+* Shared WQ (SWQ)
+
+ Multiple clients can submit work to this WQ. The submitter must use
+ ENQMCDS (from supervisor mode) or ENQCMD (from user mode). These
+ instructions are non-posted writes. That means a response is
+ expected from the issued instruction. The EFLAGS.ZF bit will be set
+ when a failure (busy or fail) has occurred from the command.
+ The use of PASID is mandatory to identify the address space
+ of each client.
+
+
+For more information about the new instructions [1][2].
+
+The IDXD driver is broken down into following usages:
+* In kernel interface through dmaengine subsystem API.
+* Userspace DMA support through character device. mmap(2) is utilized
+ to map directly to mmio address (or portals) for descriptor submission.
+* VFIO Mediated device (mdev) supporting device passthrough usages.
+
+This document is only for the mdev usage.
+
+
+=================================
+Assignable Device Interface (ADI)
+=================================
+The term ADI is used to represent the minimal unit of assignment for
+Intel Scalable IOV device. Each ADI instance refers to the set of device
+backend resources that are allocated, configured and organized as an
+isolated unit.
+
+Intel DSA defines each WQ as an ADI. The MMIO registers of each work queue
+are partitioned into two categories:
+* MMIO registers accessed for data-path operations.
+* MMIO registers accessed for control-path operations.
+
+Data-path MMIO registers of each WQ are contained within
+one or more system page size aligned regions and can be mapped in the
+CPU page table for direct access from the guest. Control-path MMIO
+registers of all WQs are located together but segregated from data-path
+MMIO regions. Therefore, guest updates to control-path registers must
+be intercepted and then go through the host driver to be reflected in
+the device.
+
+Data-path MMIO registers of DSA WQ are portals for submitting descriptors
+to the device. There are four portals per WQ, each being 64 bytes
+in size and located on a separate 4KB page in BAR2. Each portal has
+different implications regarding interrupt message type (MSI vs. IMS)
+and occupancy control (limited vs. unlimited). It is not necessary to
+map all portals to the guest.
+
+Control-path MMIO registers of DSA WQ include global configurations
+(shared by all WQs) and WQ-specific configurations. The owner
+(e.g. the guest) of the WQ is expected to only change WQ-specific
+configurations. Intel DSA spec introduces a “Configuration Support”
+capability which, if cleared, indicates that some fields of WQ
+configuration registers are read-only thus pre-configured by the host.
+
+
+Interrupt Message Store (IMS)
+-----------------------------
+The ADI utilizes Interrupt Message Store (IMS), a device-specific MSI
+implementation, instead of MSIX for interrupts for the guest. This
+preserves MSIX for host usages and also allows a significantly larger
+number of interrupt vectors for large number of guests usage.
+
+Intel DSA device implements IMS as on-device memory mapped unified
+storage. Each interrupt message is stored as a DWORD size data payload
+and a 64-bit address (same as MSI-X). Access to the IMS is through the
+host idxd driver.
+
+
+ADI Isolation
+-------------
+Operations or functioning of one ADI must not affect the functioning
+of another ADI or the physical device. Upstream memory requests from
+different ADIs are distinguished using a Process Address Space Identifier
+(PASID). With the support of PASID-granular address translation in Intel
+VT-d, the address space targeted by a request from ADI can be a Host
+Virtual Address (HVA), Host I/O Virtual Address (HIOVA), Guest Physical
+Address (GPA), Guest Virtual Address (GVA), Guest I/O Virtual Address
+(GIOVA), etc. The PASID identity for an ADI is expected to be accessed
+or modified by privileged software through the host driver.
+
+=========================
+Virtual DSA (vDSA) Device
+=========================
+The DSA WQ itself is not a PCI device thus must be composed into a
+virtual DSA device to the guest.
+
+The composition logic needs to handle four main requirements:
+* Emulate PCI config space.
+* Map data-path portals for direct access from the guest.
+* Emulate control-path MMIO registers and selectively forward WQ
+ configuration requests through host driver to the device.
+* Forward and emulate WQ interrupts to the guest.
+
+The composition logic tells the guest which aspects of WQ are configurable
+through a combination of capability fields, e.g.:
+* Configuration Support (if cleared, most aspects are not modifiable).
+* WQ Mode Support (if cleared, cannot change between dedicated and
+ shared mode).
+* Dedicated Mode Support.
+* Shared Mode Support.
+* ...
+
+The virtual capability fields are set according to the vDSA
+type. Following is an example of vDSA types and related WQ configurability:
+* Type ‘1dwq-v1’
+ * One DSA gen1 dedicated WQ
+ * Guest cannot share the WQ between its clients (no guest SVA)
+ * Guest cannot change any WQ configuration
+
+Besides, the composition logic also needs to serve administrative commands
+(thru virtual CMD register) through host driver, including:
+* Drain/abort all descriptors submitted by this guest.
+* Drain/abort descriptors associated with a PASID.
+* Enable/disable/reset the WQ (when it’s not shared by multiple VMs).
+* Request interrupt handle.
+
+With this design, vDSA emulation is **greatly simplified**. Only limited
+configurability is handled with most registers emulated in simple
+READ-ONLY flavor.
+
+=======================================
+Mdev Framework Registration and Release
+=======================================
+
+Intel DSA reports support for Intel Scalable IOV via a PCI Express
+Designated Vendor Specific Extended Capability (DVSEC). In addition,
+PASID-granular address translation capability is required in the
+IOMMU. During host initialization, the IDXD driver should check the
+presence of both capabilities before calling mdev_register_device()
+to register with the VFIO mdev framework and provide a set of ops
+(struct vfio_device_ops). The IOMMU capability is indicated by the
+IOMMU_DEV_FEAT_AUX feature flag with iommu_dev_has_feature() and enabled
+with iommu_dev_enable_feature().
+
+On release, iommu_dev_disable_feature() is called after
+mdev_unregister_device() to disable the IOMMU_DEV_FEAT_AUX flag that
+the driver enabled during host initialization.
+
+The vfio_device_ops data structure is filled out by the driver to provide
+a number of ops called by VFIO core::
+
+ struct vfio_device_ops {
+ .open
+ .release
+ .read
+ .write
+ .mmap
+ .ioctl
+ };
+
+The mdev driver provides supported type group attributes. It also
+registers the mdev driver with probe and remove calls::
+
+ struct mdev_driver {
+ .probe
+ .remove
+ .supported_type_groups
+ };
+
+
+Supported_type_groups
+---------------------
+At the moment only one vDSA type is supported.
+
+“1dwq-v1”:
+ Single dedicated WQ (DSA 1.0) with read-only configuration exposed to
+ the guest. On the guest kernel, a vDSA device shows up with a single
+ WQ that is pre-configured by the host. The configuration for the WQ
+ is entirely read-only and cannot be reconfigured. There is no support
+ of guest SVA on this WQ.
+
+ PCI MSI-X vectors are surfaced from the mdev device to the guest kernel.
+ In the current implementation 2 vectors are supported. Vector 0 is used for
+ device misc operations (admin command completion, error report, etc.) just
+ like on the host. Vector 1 is used for descriptor completion. The vector 0
+ is emulated by the host driver. The second interrupt vector is backed by
+ an IMS vector on the host.
+
+probe
+------
+API function to create the mdev. mdev_set_iommu_device() is called to
+associate the mdev device to the parent PCI device. This function is
+where the driver sets up and initializes the resources to support a single
+mdev device. vfio_init_group_dev() and vfio_register_group_dev() are called
+in order to associate the 'struct vfio_device' with the 'struct device' from
+the mdev and the vfio_device_ops.
+
+remove
+------
+API function that mirrors the create() function and releases all the
+resources backing the mdev. vfio_unregister_group_dev() is called.
+
+open
+----
+API function that is called down from VFIO userspace when it is ready to claim
+and utilize the mdev.
+
+release
+-------
+The mirror function to open that releases the mdev by VFIO userspace.
+
+read / write
+------------
+This is where the Intel IDXD driver provides read/write emulation of
+the "slow" path of the mdev, including PCI config space and control-path
+MMIO registers. Typically configuration and administrative commands go
+through this path. This allows the mdev to show up as a virtual PCI
+device in the guest kernel.
+
+The emulation of PCI config space is nothing special, which is simply
+copied from kvmgt. In the future this part might be consolidated to
+reduce duplication.
+
+Emulating MMIO reads are simply memory copies. There is no side-effect
+to be emulated upon guest read.
+
+Emulating MMIO writes are required only for a few registers, due to
+read-only configuration on the ‘1dwq-v1’ type. Majority of composition
+logic is hooked in the CMD register for performing administrative commands
+such as WQ drain, abort, enable, disable and reset operations. The rest of
+the emulation is about handling errors (GENCTRL/SWERROR) and interrupts
+(INTCAUSE/MSIXPERM) on the vDSA device. Future mdev types might allow
+limited WQ configurability, which then requires additional emulation of
+the WQCFG register.
+
+mmap
+----
+This is the function that provides the setup to expose a portion of the
+hardware, also known as portals, for direct access for “fast” path
+operations through the mmap() syscall. A limited region of the hardware
+is mapped to the guest for direct I/O submission.
+
+There are four portals per WQ: unlimited MSI-X, limited MSI-X, unlimited
+IMS, limited IMS. Descriptors submitted to limited portals are subject
+to threshold configuration limitations for shared WQs. The MSI-X portals
+are used for host submissions, and the IMS portals are mapped to vm for
+guest submission. The host driver provides IMS portal through the mmap
+function to be mapped to the user space in order to expose it directly
+to the guest kernel.
+
+ioctl
+-----
+This API function does several things
+* Provides general device information to VFIO userspace.
+* Provides device region information (PCI, mmio, etc).
+* Get interrupts information
+* Setup interrupts for the mediated device.
+* Mdev device reset
+
+The PCI device presented by VFIO to the guest kernel will show that it
+supports MSIX vectors. The Intel idxd driver will support two vectors
+per mdev to back those MSIX vectors. The first vector is emulated by
+the host driver via eventfd in order to support various non I/O operations just
+like the actual device. The second vector is backed by IMS. IMS provides
+additional interrupt vectors on the device outside of PCI MSIX specification
+in order to support significantly more vectors. Eventfd is also used by
+the second vector to notify the guest kernel. However irq bypass manager is
+used to directly inject the interrupt in the guest. When the guest submits
+a descriptor through the IMS portal directly to the device, an IMS interrupt
+is triggered on completion and routed to the guest as an MSIX interrupt.
+
+The idxd driver makes use of the generic IMS irq chip and domain which
+stores the interrupt messages in an array in device memory. Allocation and
+freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
+interface. Driver only needs to ensure the interrupt domain is stored in
+the underlying device struct.
+
+To allocate IMS, we utilize the IMS array APIs. On host init, we need
+to create the MSI domain::
+
+ struct ims_array_info ims_info;
+ struct device *dev = &pci_dev->dev;
+
+ /* assign the device IMS size */
+ ims_info.max_slots = max_ims_size;
+ /* assign the MMIO base address for the IMS table */
+ ims_info.slots = mmio_base + ims_offset;
+ /* assign the MSI domain to the device */
+ dev->msi_domain = pci_ims_array_create_msi_irq_domain(pci_dev, &ims_info);
+
+When we are ready to allocate the interrupts via the mdev IMS common lib code::
+
+ struct device *dev = &mdev->dev;
+
+ irq_domain = dev_get_msi_domain(dev);
+ /* the irqs are allocated against device of mdev */
+ rc = msi_domain_alloc_irqs(irq_domain, dev, num_vecs);
+
+
+ /* we can retrieve the slot index from msi_entry */
+ irq = dev_msi_irq_vector(dev, vector);
+
+ request_irq(irq, interrupt_handler_function, 0, “ims”, context);
+
+
+The DSA device is structured such that MSI-X table entry 0 is used for
+admin commands completion, error reporting, and other misc commands. The
+remaining MSI-X table entries are used for WQ completion. For vm support,
+the virtual device also presents a similar layout. Therefore, vector 0
+is emulated by the software. Additional vector(s) are associated with IMS.
+
+The index (slot) for the per device IMS entry is managed by the MSI
+core. The index is the “interrupt handle” that the guest kernel
+needs to program into a DMA descriptor. That interrupt handle tells the
+hardware which IMS vector to trigger the interrupt on for the host.
+
+The virtual device presents an admin command called “request interrupt
+handle” that is not supported by the physical device. On probe of
+the DSA device on the guest kernel, the guest driver will issue the
+“request interrupt handle” command in order to get the interrupt
+handle for descriptor programming. The host driver will return the
+assigned slot for the IMS entry table to the guest.
+
+reset
+-----
+
+Device reset is emulated through the mdev. With mdev being a wq rather
+than the whole device, we would not reset the entire device on a reset
+request. The host driver will simulate a reset of the device by
+aborting all the outstanding descriptors on the wq and then disabling
+the wq. All MMIO registers are reset to pre-programmed values.
+
+==========
+References
+==========
+[1] https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
+[2] https://software.intel.com/en-us/articles/intel-sdm
+[3] https://software.intel.com/sites/default/files/managed/cc/0e/intel-scalable-io-virtualization-technical-specification.pdf
+[4] https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
diff --git a/MAINTAINERS b/MAINTAINERS
index 9450e052f1b1..20f91064a4d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18878,6 +18878,13 @@ F: drivers/vfio/mdev/
F: include/linux/mdev.h
F: samples/vfio-mdev/
+VFIO MEDIATED DEVICE IDXD DRIVER
+M: Dave Jiang <[email protected]>
+L: [email protected]
+S: Maintained
+F: Documentation/driver-api/vfio/mdev-idxd.rst
+F: drivers/vfio/mdev/idxd/
+
VFIO PLATFORM DRIVER
M: Eric Auger <[email protected]>
L: [email protected]
Device portal offsets are 4k apart laid out in the order of unlimited
MSIX portal, limited MSIX portal, unlimited IMS portal, limited IMS
portal. Add an additional parameter to calculate the IMS portal
offsets.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/cdev.c | 4 ++--
drivers/dma/idxd/device.c | 2 +-
drivers/dma/idxd/idxd.h | 11 +++--------
3 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index e3d29244f752..62a53123fd58 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -202,8 +202,8 @@ static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
return rc;
vma->vm_flags |= VM_DONTCOPY;
- pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
- IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
+ pfn = (base + idxd_get_wq_portal_offset(wq->id, IDXD_PORTAL_LIMITED,
+ IDXD_IRQ_MSIX)) >> PAGE_SHIFT;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = ctx;
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 3549a73fc7db..02e9a050b5bb 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -300,7 +300,7 @@ int idxd_wq_map_portal(struct idxd_wq *wq)
resource_size_t start;
start = pci_resource_start(pdev, IDXD_WQ_BAR);
- start += idxd_get_wq_portal_full_offset(wq->id, IDXD_PORTAL_LIMITED);
+ start += idxd_get_wq_portal_offset(wq->id, IDXD_PORTAL_LIMITED, IDXD_IRQ_MSIX);
wq->portal = devm_ioremap(dev, start, IDXD_PORTAL_SIZE);
if (!wq->portal)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 288e3fe15b3e..e5b90e6970aa 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -459,15 +459,10 @@ enum idxd_interrupt_type {
IDXD_IRQ_IMS,
};
-static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
+static inline int idxd_get_wq_portal_offset(int wq_id, enum idxd_portal_prot prot,
+ enum idxd_interrupt_type irq_type)
{
- return prot * 0x1000;
-}
-
-static inline int idxd_get_wq_portal_full_offset(int wq_id,
- enum idxd_portal_prot prot)
-{
- return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
+ return ((wq_id * 4) << PAGE_SHIFT) + prot * 0x1000 + irq_type * 0x2000;
}
static inline void idxd_wq_get(struct idxd_wq *wq)
Add support to allow an external driver to be registered to the
dsa_bus_type and also auto-loaded.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/idxd.h | 6 ++++++
drivers/dma/idxd/init.c | 2 ++
drivers/dma/idxd/sysfs.c | 6 ++++++
3 files changed, 14 insertions(+)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 0970d0e67976..22afaf7ee637 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -483,11 +483,17 @@ static inline int idxd_wq_refcount(struct idxd_wq *wq)
return wq->client_count;
};
+#define MODULE_ALIAS_IDXD_DEVICE(type) MODULE_ALIAS("idxd:t" __stringify(type) "*")
+#define IDXD_DEVICES_MODALIAS_FMT "idxd:t%d"
+
int __must_check __idxd_driver_register(struct idxd_device_driver *idxd_drv,
struct module *module, const char *mod_name);
#define idxd_driver_register(driver) \
__idxd_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
+#define module_idxd_driver(driver) \
+ module_driver(driver, idxd_driver_register, idxd_driver_unregister)
+
void idxd_driver_unregister(struct idxd_device_driver *idxd_drv);
int idxd_register_bus_type(void);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 30d3ab0c4051..bed9169152f9 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -843,8 +843,10 @@ int __idxd_driver_register(struct idxd_device_driver *idxd_drv, struct module *o
return driver_register(drv);
}
+EXPORT_SYMBOL_GPL(__idxd_driver_register);
void idxd_driver_unregister(struct idxd_device_driver *idxd_drv)
{
driver_unregister(&idxd_drv->drv);
}
+EXPORT_SYMBOL_GPL(idxd_driver_unregister);
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index ff2f1c97ed74..4fcb8833a4df 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -53,11 +53,17 @@ static int idxd_config_bus_remove(struct device *dev)
return 0;
}
+static int idxd_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+ return add_uevent_var(env, "MODALIAS=" IDXD_DEVICES_MODALIAS_FMT, 0);
+}
+
struct bus_type dsa_bus_type = {
.name = "dsa",
.match = idxd_config_bus_match,
.probe = idxd_config_bus_probe,
.remove = idxd_config_bus_remove,
+ .uevent = idxd_bus_uevent,
};
#define DRIVER_ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) \
Add common helper code to setup IMS once the MSI domain has been
setup by the device driver. The main helper function is
mdev_ims_set_msix_trigger() that is called by the VFIO ioctl
VFIO_DEVICE_SET_IRQS. The function deals with the setup and
teardown of emulated and IMS backed eventfd that gets exported
to the guest kernel via VFIO as MSIX vectors.
Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/vfio/mdev/Kconfig | 12 ++
drivers/vfio/mdev/Makefile | 3
drivers/vfio/mdev/mdev_irqs.c | 318 +++++++++++++++++++++++++++++++++++++++++
include/linux/mdev.h | 51 +++++++
4 files changed, 384 insertions(+)
create mode 100644 drivers/vfio/mdev/mdev_irqs.c
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 763c877a1318..82f79d99a7db 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,15 @@ config VFIO_MDEV
See Documentation/driver-api/vfio-mediated-device.rst for more details.
If you don't know what do here, say N.
+
+config VFIO_MDEV_IRQS
+ bool "Mediated device driver common lib code for interrupts"
+ depends on VFIO_MDEV
+ select IMS_MSI_ARRAY
+ select IRQ_BYPASS_MANAGER
+ default n
+ help
+ Provide common library code to deal with IMS interrupts for mediated
+ devices.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 7c236ba1b90e..c3f160cae192 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,7 @@
mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+mdev-$(CONFIG_VFIO_MDEV_IRQS) += mdev_irqs.o
+
obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_irqs.c b/drivers/vfio/mdev/mdev_irqs.c
new file mode 100644
index 000000000000..ed2d11a7c729
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_irqs.c
@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Mediate device IMS library code
+ *
+ * Copyright (c) 2021 Intel Corp. All rights reserved.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/irqchip/irq-ims-msi.h>
+#include <linux/eventfd.h>
+#include <linux/irqreturn.h>
+#include <linux/msi.h>
+#include <linux/vfio.h>
+#include <linux/irqbypass.h>
+#include <linux/mdev.h>
+
+static irqreturn_t mdev_irq_handler(int irq, void *arg)
+{
+ struct eventfd_ctx *trigger = arg;
+
+ eventfd_signal(trigger, 1);
+ return IRQ_HANDLED;
+}
+
+/*
+ * Common helper routine to send signal to the eventfd that has been setup.
+ *
+ * @mdev_irq [in] : struct mdev_irq context
+ * @vector [in] : vector index for eventfd
+ *
+ * No return value.
+ */
+void mdev_msix_send_signal(struct mdev_device *mdev, int vector)
+{
+ struct mdev_irq *mdev_irq = &mdev->mdev_irq;
+ struct eventfd_ctx *trigger = mdev_irq->irq_entries[vector].trigger;
+
+ if (!mdev_irq->irq_entries || !trigger) {
+ dev_warn(&mdev->dev, "EventFD %d trigger not setup, can't send!\n", vector);
+ return;
+ }
+ mdev_irq_handler(0, (void *)trigger);
+}
+EXPORT_SYMBOL_GPL(mdev_msix_send_signal);
+
+static int mdev_msix_set_vector_signal(struct mdev_irq *mdev_irq, int vector, int fd)
+{
+ int rc, irq;
+ struct mdev_device *mdev = irq_to_mdev(mdev_irq);
+ struct mdev_irq_entry *entry;
+ struct device *dev = &mdev->dev;
+ struct eventfd_ctx *trigger;
+ char *name;
+ bool pasid_en;
+ u32 auxval;
+
+ if (vector < 0 || vector >= mdev_irq->num)
+ return -EINVAL;
+
+ entry = &mdev_irq->irq_entries[vector];
+
+ if (entry->ims)
+ irq = dev_msi_irq_vector(dev, entry->ims_id);
+ else
+ irq = 0;
+
+ pasid_en = mdev_irq->pasid != INVALID_IOASID ? true : false;
+
+ /* IMS and invalid pasid is not a valid configuration */
+ if (entry->ims && !pasid_en)
+ return -EINVAL;
+
+ if (entry->trigger) {
+ if (irq) {
+ irq_bypass_unregister_producer(&entry->producer);
+ free_irq(irq, entry->trigger);
+ if (pasid_en) {
+ auxval = ims_ctrl_pasid_aux(0, false);
+ irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+ }
+ }
+ kfree(entry->name);
+ eventfd_ctx_put(entry->trigger);
+ entry->trigger = NULL;
+ }
+
+ if (fd < 0)
+ return 0;
+
+ name = kasprintf(GFP_KERNEL, "vfio-mdev-irq[%d](%s)", vector, dev_name(dev));
+ if (!name)
+ return -ENOMEM;
+
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger)) {
+ kfree(name);
+ return PTR_ERR(trigger);
+ }
+
+ entry->name = name;
+ entry->trigger = trigger;
+
+ if (!irq)
+ return 0;
+
+ if (pasid_en) {
+ auxval = ims_ctrl_pasid_aux(mdev_irq->pasid, true);
+ rc = irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+ if (rc < 0)
+ goto err;
+ }
+
+ rc = request_irq(irq, mdev_irq_handler, 0, name, trigger);
+ if (rc < 0)
+ goto irq_err;
+
+ entry->producer.token = trigger;
+ entry->producer.irq = irq;
+ rc = irq_bypass_register_producer(&entry->producer);
+ if (unlikely(rc)) {
+ dev_warn(dev, "irq bypass producer (token %p) registration fails: %d\n",
+ &entry->producer.token, rc);
+ entry->producer.token = NULL;
+ }
+
+ return 0;
+
+ irq_err:
+ if (pasid_en) {
+ auxval = ims_ctrl_pasid_aux(0, false);
+ irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+ }
+ err:
+ kfree(name);
+ eventfd_ctx_put(trigger);
+ entry->trigger = NULL;
+ return rc;
+}
+
+static int mdev_msix_set_vector_signals(struct mdev_irq *mdev_irq, unsigned int start,
+ unsigned int count, int *fds)
+{
+ int i, j, rc = 0;
+
+ if (start >= mdev_irq->num || start + count > mdev_irq->num)
+ return -EINVAL;
+
+ for (i = 0, j = start; j < count && !rc; i++, j++) {
+ int fd = fds ? fds[i] : -1;
+
+ rc = mdev_msix_set_vector_signal(mdev_irq, j, fd);
+ }
+
+ if (rc) {
+ for (--j; j >= (int)start; j--)
+ mdev_msix_set_vector_signal(mdev_irq, j, -1);
+ }
+
+ return rc;
+}
+
+static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
+{
+ struct mdev_device *mdev = irq_to_mdev(mdev_irq);
+ struct device *dev;
+ int rc;
+
+ if (nvec != mdev_irq->num)
+ return -EINVAL;
+
+ if (mdev_irq->ims_num) {
+ dev = &mdev->dev;
+ rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
+ if (rc < 0)
+ return rc;
+ }
+
+ mdev_irq->irq_type = VFIO_PCI_MSIX_IRQ_INDEX;
+ return 0;
+}
+
+static int mdev_msix_disable(struct mdev_irq *mdev_irq)
+{
+ struct mdev_device *mdev = irq_to_mdev(mdev_irq);
+ struct device *dev = &mdev->dev;
+ struct irq_domain *irq_domain;
+
+ mdev_msix_set_vector_signals(mdev_irq, 0, mdev_irq->num, NULL);
+ irq_domain = dev_get_msi_domain(&mdev->dev);
+ if (irq_domain)
+ msi_domain_free_irqs(irq_domain, dev);
+ mdev_irq->irq_type = VFIO_PCI_NUM_IRQS;
+ return 0;
+}
+
+/*
+ * Common helper function that sets up the MSIX vectors for the mdev device that are
+ * Interrupt Message Store (IMS) backed. Certain mdev devices can have the first
+ * vector emulated rather than backed by IMS.
+ *
+ * @mdev [in] : mdev device
+ * @index [in] : type of VFIO vectors to setup
+ * @start [in] : start position of the vector index
+ * @count [in] : number of vectors
+ * @flags [in] : VFIO_IRQ action to be taken
+ * @data [in] : data accompanied for the call
+ * Return error code on failure or 0 on success.
+ */
+
+int mdev_set_msix_trigger(struct mdev_device *mdev, unsigned int index,
+ unsigned int start, unsigned int count, u32 flags,
+ void *data)
+{
+ struct mdev_irq *mdev_irq = &mdev->mdev_irq;
+ int i, rc = 0;
+
+ if (count > mdev_irq->num)
+ count = mdev_irq->num;
+
+ if (!count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+ mdev_msix_disable(mdev_irq);
+ return 0;
+ }
+
+ if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ int *fds = data;
+
+ if (mdev_irq->irq_type == index)
+ return mdev_msix_set_vector_signals(mdev_irq, start, count, fds);
+
+ rc = mdev_msix_enable(mdev_irq, start + count);
+ if (rc < 0)
+ return rc;
+
+ rc = mdev_msix_set_vector_signals(mdev_irq, start, count, fds);
+ if (rc < 0)
+ mdev_msix_disable(mdev_irq);
+
+ return rc;
+ }
+
+ if (start + count > mdev_irq->num)
+ return -EINVAL;
+
+ for (i = start; i < start + count; i++) {
+ if (!mdev_irq->irq_entries[i].trigger)
+ continue;
+ if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ eventfd_signal(mdev_irq->irq_entries[i].trigger, 1);
+ } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+ u8 *bools = data;
+
+ if (bools[i - start])
+ eventfd_signal(mdev_irq->irq_entries[i].trigger, 1);
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mdev_set_msix_trigger);
+
+void mdev_irqs_set_pasid(struct mdev_device *mdev, u32 pasid)
+{
+ mdev->mdev_irq.pasid = pasid;
+}
+EXPORT_SYMBOL_GPL(mdev_irqs_set_pasid);
+
+/*
+ * Initialize and setup the mdev_irq context under mdev.
+ *
+ * @mdev [in] : mdev device
+ * @num [in] : number of vectors
+ * @ims_map [in] : bool array that indicates whether a guest MSIX vector is
+ * backed by an IMS vector or emulated
+ * Return error code on failure or 0 on success.
+ */
+int mdev_irqs_init(struct mdev_device *mdev, int num, bool *ims_map)
+{
+ struct mdev_irq *mdev_irq = &mdev->mdev_irq;
+ int i;
+
+ if (num < 1)
+ return -EINVAL;
+
+ mdev_irq->irq_type = VFIO_PCI_NUM_IRQS;
+ mdev_irq->num = num;
+ mdev_irq->pasid = INVALID_IOASID;
+
+ mdev_irq->irq_entries = kcalloc(num, sizeof(*mdev_irq->irq_entries), GFP_KERNEL);
+ if (!mdev_irq->irq_entries)
+ return -ENOMEM;
+
+ for (i = 0; i < num; i++) {
+ mdev_irq->irq_entries[i].ims = ims_map[i];
+ if (ims_map[i]) {
+ mdev_irq->irq_entries[i].ims_id = mdev_irq->ims_num;
+ mdev_irq->ims_num++;
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mdev_irqs_init);
+
+/*
+ * Free allocated memory in mdev_irq
+ *
+ * @mdev [in] : mdev device
+ */
+void mdev_irqs_free(struct mdev_device *mdev)
+{
+ kfree(mdev->mdev_irq.irq_entries);
+ memset(&mdev->mdev_irq, 0, sizeof(mdev->mdev_irq));
+}
+EXPORT_SYMBOL_GPL(mdev_irqs_free);
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 0cd8db2d3422..035c021e8068 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -10,8 +10,26 @@
#ifndef MDEV_H
#define MDEV_H
+#include <linux/irqbypass.h>
+
struct mdev_type;
+struct mdev_irq_entry {
+ struct eventfd_ctx *trigger;
+ struct irq_bypass_producer producer;
+ char *name;
+ bool ims;
+ int ims_id;
+};
+
+struct mdev_irq {
+ struct mdev_irq_entry *irq_entries;
+ int num;
+ int ims_num;
+ int irq_type;
+ int pasid;
+};
+
struct mdev_device {
struct device dev;
guid_t uuid;
@@ -19,8 +37,14 @@ struct mdev_device {
struct mdev_type *type;
struct device *iommu_device;
struct mutex creation_lock;
+ struct mdev_irq mdev_irq;
};
+static inline struct mdev_device *irq_to_mdev(struct mdev_irq *mdev_irq)
+{
+ return container_of(mdev_irq, struct mdev_device, mdev_irq);
+}
+
static inline struct mdev_device *to_mdev_device(struct device *dev)
{
return container_of(dev, struct mdev_device, dev);
@@ -99,4 +123,31 @@ static inline struct mdev_device *mdev_from_dev(struct device *dev)
return dev->bus == &mdev_bus_type ? to_mdev_device(dev) : NULL;
}
+#if IS_ENABLED(CONFIG_VFIO_MDEV_IRQS)
+int mdev_set_msix_trigger(struct mdev_device *mdev, unsigned int index,
+ unsigned int start, unsigned int count, u32 flags,
+ void *data);
+void mdev_msix_send_signal(struct mdev_device *mdev, int vector);
+int mdev_irqs_init(struct mdev_device *mdev, int num, bool *ims_map);
+void mdev_irqs_free(struct mdev_device *mdev);
+void mdev_irqs_set_pasid(struct mdev_device *mdev, u32 pasid);
+#else
+static inline int mdev_set_msix_trigger(struct mdev_device *mdev, unsigned int index,
+ unsigned int start, unsigned int count, u32 flags,
+ void *data)
+{
+ return -EOPNOTSUPP;
+}
+
+void mdev_msix_send_signal(struct mdev_device *mdev, int vector) {}
+
+static inline int mdev_irqs_init(struct mdev_device *mdev, int num, bool *ims_map)
+{
+ return -EOPNOTSUPP;
+}
+
+void mdev_irqs_free(struct mdev_device *mdev) {}
+void mdev_irqs_set_pasid(struct mdev_device *mdev, u32 pasid) {}
+#endif /* CONFIG_VFIO_MDEV_IMS */
+
#endif /* MDEV_H */
Add support functions to initialize the vdcm context and the PCI
config space region and the MMIO region. These regions are to
support the emulation paths for the mdev.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/registers.h | 3 +
drivers/vfio/mdev/idxd/mdev.h | 4 +
drivers/vfio/mdev/idxd/vdev.c | 214 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 220 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index c2d558e37baf..8ac2be4e174b 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -88,6 +88,9 @@ struct opcap {
u64 bits[4];
};
+#define OPCAP_OFS(op) (op - (0x40 * (op >> 6)))
+#define OPCAP_BIT(op) (BIT_ULL(OPCAP_OFS(op)))
+
#define IDXD_OPCAP_OFFSET 0x40
#define IDXD_TABLE_OFFSET 0x60
diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
index e52b50760ee7..91cb2662abd6 100644
--- a/drivers/vfio/mdev/idxd/mdev.h
+++ b/drivers/vfio/mdev/idxd/mdev.h
@@ -16,6 +16,7 @@
#define VIDXD_MSIX_TBL_SZ 0x90
#define VIDXD_MSIX_PERM_TBL_SZ 0x48
+#define VIDXD_VERSION_OFFSET 0
#define VIDXD_MSIX_PERM_OFFSET 0x300
#define VIDXD_GRPCFG_OFFSET 0x400
#define VIDXD_WQCFG_OFFSET 0x500
@@ -74,8 +75,9 @@ static inline u8 vidxd_state(struct vdcm_idxd *vidxd)
int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32 *pasid);
+void vidxd_init(struct vdcm_idxd *vidxd);
void vidxd_reset(struct vdcm_idxd *vidxd);
-
+void vidxd_mmio_init(struct vdcm_idxd *vidxd);
int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
#endif
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
index 4ead50947047..78cc2377e637 100644
--- a/drivers/vfio/mdev/idxd/vdev.c
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -21,6 +21,62 @@
#include "idxd.h"
#include "mdev.h"
+static u64 idxd_pci_config[] = {
+ 0x0010000000008086ULL,
+ 0x0080000008800000ULL,
+ 0x000000000000000cULL,
+ 0x000000000000000cULL,
+ 0x0000000000000000ULL,
+ 0x2010808600000000ULL,
+ 0x0000004000000000ULL,
+ 0x000000ff00000000ULL,
+ 0x0000060000015011ULL, /* MSI-X capability, hardcoded 2 entries, Encoded as N-1 */
+ 0x0000070000000000ULL,
+ 0x0000000000920010ULL, /* PCIe capability */
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+};
+
+static void vidxd_reset_config(struct vdcm_idxd *vidxd)
+{
+ u16 *devid = (u16 *)(vidxd->cfg + PCI_DEVICE_ID);
+ struct idxd_device *idxd = vidxd->idxd;
+
+ memset(vidxd->cfg, 0, VIDXD_MAX_CFG_SPACE_SZ);
+ memcpy(vidxd->cfg, idxd_pci_config, sizeof(idxd_pci_config));
+
+ if (idxd->data->type == IDXD_TYPE_DSA)
+ *devid = PCI_DEVICE_ID_INTEL_DSA_SPR0;
+ else if (idxd->data->type == IDXD_TYPE_IAX)
+ *devid = PCI_DEVICE_ID_INTEL_IAX_SPR0;
+}
+
+static inline void vidxd_reset_mmio(struct vdcm_idxd *vidxd)
+{
+ memset(&vidxd->bar0, 0, VIDXD_MAX_MMIO_SPACE_SZ);
+}
+
+void vidxd_init(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq = vidxd->wq;
+
+ vidxd_reset_config(vidxd);
+ vidxd_reset_mmio(vidxd);
+
+ vidxd->bar_size[0] = VIDXD_BAR0_SIZE;
+ vidxd->bar_size[1] = VIDXD_BAR2_SIZE;
+
+ vidxd_mmio_init(vidxd);
+
+ if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+ idxd_wq_disable(wq);
+}
+
void vidxd_send_interrupt(struct vdcm_idxd *vidxd, int vector)
{
struct mdev_device *mdev = vidxd->mdev;
@@ -252,6 +308,163 @@ int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsign
return 0;
}
+static void vidxd_mmio_init_grpcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union group_cap_reg *grp_cap = (union group_cap_reg *)(bar0 + IDXD_GRPCAP_OFFSET);
+
+ /* single group for current implementation */
+ grp_cap->num_groups = 1;
+}
+
+static void vidxd_mmio_init_grpcfg(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ struct grpcfg *grpcfg = (struct grpcfg *)(bar0 + VIDXD_GRPCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_group *group = wq->group;
+ int i;
+
+ /*
+ * At this point, we are only exporting a single workqueue for
+ * each mdev.
+ */
+ grpcfg->wqs[0] = BIT(0);
+ for (i = 0; i < group->num_engines; i++)
+ grpcfg->engines |= BIT(i);
+ grpcfg->flags.bits = group->grpcfg.flags.bits;
+}
+
+static void vidxd_mmio_init_wqcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ struct idxd_wq *wq = vidxd->wq;
+ union wq_cap_reg *wq_cap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+
+ wq_cap->total_wq_size = wq->size;
+ wq_cap->num_wqs = 1;
+ wq_cap->dedicated_mode = 1;
+}
+
+static void vidxd_mmio_init_wqcfg(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ struct idxd_wq *wq = vidxd->wq;
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+
+ wqcfg->wq_size = wq->size;
+ wqcfg->wq_thresh = wq->threshold;
+ wqcfg->mode = WQCFG_MODE_DEDICATED;
+ wqcfg->priority = wq->priority;
+ wqcfg->max_xfer_shift = idxd->hw.gen_cap.max_xfer_shift;
+ wqcfg->max_batch_shift = idxd->hw.gen_cap.max_batch_shift;
+}
+
+static void vidxd_mmio_init_engcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union engine_cap_reg *engcap = (union engine_cap_reg *)(bar0 + IDXD_ENGCAP_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_group *group = wq->group;
+
+ engcap->num_engines = group->num_engines;
+}
+
+static void vidxd_mmio_init_gencap(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u8 *bar0 = vidxd->bar0;
+ union gen_cap_reg *gencap = (union gen_cap_reg *)(bar0 + IDXD_GENCAP_OFFSET);
+
+ gencap->overlap_copy = idxd->hw.gen_cap.overlap_copy;
+ gencap->cache_control_mem = idxd->hw.gen_cap.cache_control_mem;
+ gencap->cache_control_cache = idxd->hw.gen_cap.cache_control_cache;
+ gencap->cmd_cap = 1;
+ gencap->dest_readback = idxd->hw.gen_cap.dest_readback;
+ gencap->drain_readback = idxd->hw.gen_cap.drain_readback;
+ gencap->max_xfer_shift = idxd->hw.gen_cap.max_xfer_shift;
+ gencap->max_batch_shift = idxd->hw.gen_cap.max_batch_shift;
+ gencap->max_descs_per_engine = idxd->hw.gen_cap.max_descs_per_engine;
+}
+
+static void vidxd_mmio_init_cmdcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ u32 *cmdcap = (u32 *)(bar0 + IDXD_CMDCAP_OFFSET);
+
+ *cmdcap |= BIT(IDXD_CMD_ENABLE_DEVICE) | BIT(IDXD_CMD_DISABLE_DEVICE) |
+ BIT(IDXD_CMD_DRAIN_ALL) | BIT(IDXD_CMD_ABORT_ALL) |
+ BIT(IDXD_CMD_RESET_DEVICE) | BIT(IDXD_CMD_ENABLE_WQ) |
+ BIT(IDXD_CMD_DISABLE_WQ) | BIT(IDXD_CMD_DRAIN_WQ) |
+ BIT(IDXD_CMD_ABORT_WQ) | BIT(IDXD_CMD_RESET_WQ) |
+ BIT(IDXD_CMD_DRAIN_PASID) | BIT(IDXD_CMD_ABORT_PASID) |
+ BIT(IDXD_CMD_REQUEST_INT_HANDLE) | BIT(IDXD_CMD_RELEASE_INT_HANDLE);
+}
+
+static void vidxd_mmio_init_opcap(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u64 opcode;
+ u8 *bar0 = vidxd->bar0;
+ u64 *opcap = (u64 *)(bar0 + IDXD_OPCAP_OFFSET);
+
+ if (idxd->data->type == IDXD_TYPE_DSA) {
+ opcode = BIT_ULL(DSA_OPCODE_NOOP) | BIT_ULL(DSA_OPCODE_BATCH) |
+ BIT_ULL(DSA_OPCODE_DRAIN) | BIT_ULL(DSA_OPCODE_MEMMOVE) |
+ BIT_ULL(DSA_OPCODE_MEMFILL) | BIT_ULL(DSA_OPCODE_COMPARE) |
+ BIT_ULL(DSA_OPCODE_COMPVAL) | BIT_ULL(DSA_OPCODE_CR_DELTA) |
+ BIT_ULL(DSA_OPCODE_AP_DELTA) | BIT_ULL(DSA_OPCODE_DUALCAST) |
+ BIT_ULL(DSA_OPCODE_CRCGEN) | BIT_ULL(DSA_OPCODE_COPY_CRC) |
+ BIT_ULL(DSA_OPCODE_DIF_CHECK) | BIT_ULL(DSA_OPCODE_DIF_INS) |
+ BIT_ULL(DSA_OPCODE_DIF_STRP) | BIT_ULL(DSA_OPCODE_DIF_UPDT) |
+ BIT_ULL(DSA_OPCODE_CFLUSH);
+ *opcap = opcode;
+ } else if (idxd->data->type == IDXD_TYPE_IAX) {
+ opcode = BIT_ULL(IAX_OPCODE_NOOP) | BIT_ULL(IAX_OPCODE_DRAIN) |
+ BIT_ULL(IAX_OPCODE_MEMMOVE);
+ *opcap = opcode;
+ opcap++;
+ opcode = OPCAP_BIT(IAX_OPCODE_DECOMPRESS) |
+ OPCAP_BIT(IAX_OPCODE_COMPRESS);
+ *opcap = opcode;
+ }
+}
+
+static void vidxd_mmio_init_version(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u32 *version;
+
+ version = (u32 *)(vidxd->bar0 + VIDXD_VERSION_OFFSET);
+ *version = idxd->hw.version;
+}
+
+void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union offsets_reg *offsets;
+
+ memset(vidxd->bar0, 0, VIDXD_BAR0_SIZE);
+
+ vidxd_mmio_init_version(vidxd);
+ vidxd_mmio_init_gencap(vidxd);
+ vidxd_mmio_init_wqcap(vidxd);
+ vidxd_mmio_init_grpcap(vidxd);
+ vidxd_mmio_init_engcap(vidxd);
+ vidxd_mmio_init_opcap(vidxd);
+
+ offsets = (union offsets_reg *)(bar0 + IDXD_TABLE_OFFSET);
+ offsets->grpcfg = VIDXD_GRPCFG_OFFSET / 0x100;
+ offsets->wqcfg = VIDXD_WQCFG_OFFSET / 0x100;
+ offsets->msix_perm = VIDXD_MSIX_PERM_OFFSET / 0x100;
+
+ vidxd_mmio_init_cmdcap(vidxd);
+ memset(bar0 + VIDXD_MSIX_PERM_OFFSET, 0, VIDXD_MSIX_PERM_TBL_SZ);
+ vidxd_mmio_init_grpcfg(vidxd);
+ vidxd_mmio_init_wqcfg(vidxd);
+}
+
static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
{
u8 *bar0 = vidxd->bar0;
@@ -396,6 +609,7 @@ void vidxd_reset(struct vdcm_idxd *vidxd)
}
}
+ vidxd_mmio_init(vidxd);
vwqcfg->wq_state = IDXD_WQ_DISABLED;
gensts->state = IDXD_DEVICE_STATE_DISABLED;
idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
Administrative commands are issued to the command register on the
accelerator device. For the mediated device, the MMIO path is emulated.
Add the command emulation support functions for the mdev.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/Makefile | 2
drivers/dma/idxd/device.c | 81 ++++++++
drivers/dma/idxd/idxd.h | 7 +
drivers/dma/idxd/registers.h | 12 +
drivers/vfio/mdev/idxd/Makefile | 2
drivers/vfio/mdev/idxd/mdev.c | 44 ++++
drivers/vfio/mdev/idxd/mdev.h | 5
drivers/vfio/mdev/idxd/vdev.c | 406 +++++++++++++++++++++++++++++++++++++++
8 files changed, 555 insertions(+), 4 deletions(-)
create mode 100644 drivers/vfio/mdev/idxd/mdev.c
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 6d11558756f8..4d5352b1b5ce 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,3 +1,5 @@
+ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=IDXD
+
obj-$(CONFIG_INTEL_IDXD) += idxd.o
idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 02e9a050b5bb..99542c9cbc47 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -233,6 +233,7 @@ int idxd_wq_enable(struct idxd_wq *wq)
dev_dbg(dev, "WQ %d enabled\n", wq->id);
return 0;
}
+EXPORT_SYMBOL_GPL(idxd_wq_enable);
int idxd_wq_disable(struct idxd_wq *wq)
{
@@ -259,6 +260,7 @@ int idxd_wq_disable(struct idxd_wq *wq)
dev_dbg(dev, "WQ %d disabled\n", wq->id);
return 0;
}
+EXPORT_SYMBOL_GPL(idxd_wq_disable);
void idxd_wq_drain(struct idxd_wq *wq)
{
@@ -329,7 +331,31 @@ void idxd_wqs_unmap_portal(struct idxd_device *idxd)
}
}
-int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
+int idxd_wq_abort(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand, stat;
+
+ if (wq->state != IDXD_WQ_ENABLED) {
+ dev_dbg(dev, "WQ %d not active\n", wq->id);
+ return -ENXIO;
+ }
+
+ operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+ dev_dbg(dev, "cmd: %u (wq abort) operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &stat);
+
+ if (stat != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "WQ abort failed: %#x\n", stat);
+ return -ENXIO;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(idxd_wq_abort);
+
+int idxd_wq_set_pasid(struct idxd_wq *wq, u32 pasid)
{
struct idxd_device *idxd = wq->idxd;
int rc;
@@ -425,6 +451,48 @@ void idxd_wq_quiesce(struct idxd_wq *wq)
percpu_ref_exit(&wq->wq_active);
}
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+
+ /* PASID fields are 8 bytes into the WQCFG register */
+ offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PASID_IDX);
+ wq->wqcfg->pasid_en = 1;
+ wq->wqcfg->pasid = pasid;
+ iowrite32(wq->wqcfg->bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
+}
+EXPORT_SYMBOL_GPL(idxd_wq_setup_pasid);
+
+void idxd_wq_clear_pasid(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+ offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PASID_IDX);
+ wq->wqcfg->pasid = 0;
+ wq->wqcfg->pasid_en = 0;
+ iowrite32(wq->wqcfg->bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
+}
+EXPORT_SYMBOL_GPL(idxd_wq_clear_pasid);
+
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+
+ /* priv field is 8 bytes into the WQCFG register */
+ offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PRIV_IDX);
+ wq->wqcfg->priv = !!priv;
+ iowrite32(wq->wqcfg->bits[WQCFG_PRIV_IDX], idxd->reg_base + offset);
+}
+EXPORT_SYMBOL_GPL(idxd_wq_setup_priv);
+
/* Device control bits */
static inline bool idxd_is_enabled(struct idxd_device *idxd)
{
@@ -613,6 +681,17 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
dev_dbg(dev, "pasid %d drained\n", pasid);
}
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid)
+{
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand;
+
+ operand = pasid;
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_PASID, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_PASID, operand, NULL);
+ dev_dbg(dev, "pasid %d aborted\n", pasid);
+}
+
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
enum idxd_interrupt_type irq_type)
{
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index e5b90e6970aa..34ffa6dad53a 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -529,6 +529,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid);
int idxd_device_load_config(struct idxd_device *idxd);
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
enum idxd_interrupt_type irq_type);
@@ -546,10 +547,14 @@ void idxd_wq_reset(struct idxd_wq *wq);
int idxd_wq_map_portal(struct idxd_wq *wq);
void idxd_wq_unmap_portal(struct idxd_wq *wq);
void idxd_wq_disable_cleanup(struct idxd_wq *wq);
-int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
+int idxd_wq_set_pasid(struct idxd_wq *wq, u32 pasid);
int idxd_wq_disable_pasid(struct idxd_wq *wq);
void idxd_wq_quiesce(struct idxd_wq *wq);
int idxd_wq_init_percpu_ref(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq);
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid);
+void idxd_wq_clear_pasid(struct idxd_wq *wq);
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv);
/* submission */
int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc);
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index c699c72bd8d2..c2d558e37baf 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -167,6 +167,7 @@ union idxd_command_reg {
};
u32 bits;
} __packed;
+#define IDXD_CMD_INT_MASK 0x80000000
enum idxd_cmd {
IDXD_CMD_ENABLE_DEVICE = 1,
@@ -233,6 +234,7 @@ enum idxd_cmdsts_err {
/* request interrupt handle */
IDXD_CMDSTS_ERR_INVAL_INT_IDX = 0x41,
IDXD_CMDSTS_ERR_NO_HANDLE,
+ IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE,
};
#define IDXD_CMDCAP_OFFSET 0xb0
@@ -352,8 +354,16 @@ union wqcfg {
u32 bits[8];
} __packed;
-#define WQCFG_PASID_IDX 2
+enum idxd_wq_hw_state {
+ IDXD_WQ_DEV_DISABLED = 0,
+ IDXD_WQ_DEV_ENABLED,
+ IDXD_WQ_DEV_BUSY,
+};
+#define WQCFG_PASID_IDX 2
+#define WQCFG_PRIV_IDX 2
+#define WQCFG_MODE_DEDICATED 1
+#define WQCFG_MODE_SHARED 0
/*
* This macro calculates the offset into the WQCFG register
* idxd - struct idxd *
diff --git a/drivers/vfio/mdev/idxd/Makefile b/drivers/vfio/mdev/idxd/Makefile
index ccd3bc1c7ab6..27a08621d120 100644
--- a/drivers/vfio/mdev/idxd/Makefile
+++ b/drivers/vfio/mdev/idxd/Makefile
@@ -1,4 +1,4 @@
ccflags-y += -I$(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE=IDXD
obj-$(CONFIG_VFIO_MDEV_IDXD) += idxd_mdev.o
-idxd_mdev-y := vdev.o
+idxd_mdev-y := mdev.o vdev.o
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
new file mode 100644
index 000000000000..90ff7cedb8b4
--- /dev/null
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2021 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "mdev.h"
+
+int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32 *pasid)
+{
+ struct vfio_group *vfio_group = vdev->group;
+ struct iommu_domain *iommu_domain;
+ struct device *iommu_device = mdev_get_iommu_device(mdev);
+ int rc;
+
+ iommu_domain = vfio_group_iommu_domain(vfio_group);
+ if (IS_ERR_OR_NULL(iommu_domain))
+ return -ENODEV;
+
+ rc = iommu_aux_get_pasid(iommu_domain, iommu_device);
+ if (rc < 0)
+ return -ENODEV;
+
+ *pasid = (u32)rc;
+ return 0;
+}
+
+MODULE_IMPORT_NS(IDXD);
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
index 120c2dc29ba7..e52b50760ee7 100644
--- a/drivers/vfio/mdev/idxd/mdev.h
+++ b/drivers/vfio/mdev/idxd/mdev.h
@@ -30,6 +30,7 @@
#define VIDXD_MAX_WQS 1
struct vdcm_idxd {
+ struct vfio_device vdev;
struct idxd_device *idxd;
struct idxd_wq *wq;
struct mdev_device *mdev;
@@ -71,6 +72,10 @@ static inline u8 vidxd_state(struct vdcm_idxd *vidxd)
return gensts->state;
}
+int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32 *pasid);
+
+void vidxd_reset(struct vdcm_idxd *vidxd);
+
int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
#endif
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
index aca4a1228a97..4ead50947047 100644
--- a/drivers/vfio/mdev/idxd/vdev.c
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -252,4 +252,410 @@ int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsign
return 0;
}
+static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
+{
+ u8 *bar0 = vidxd->bar0;
+ u32 *cmd = (u32 *)(bar0 + IDXD_CMD_OFFSET);
+ u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+ u32 *intcause = (u32 *)(bar0 + IDXD_INTCAUSE_OFFSET);
+ struct device *dev = &vidxd->mdev->dev;
+
+ *cmdsts = val;
+ dev_dbg(dev, "%s: cmd: %#x status: %#x\n", __func__, *cmd, val);
+
+ if (*cmd & IDXD_CMD_INT_MASK) {
+ *intcause |= IDXD_INTC_CMD;
+ vidxd_send_interrupt(vidxd, 0);
+ }
+}
+
+static void vidxd_enable(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+
+ if (gensts->state == IDXD_DEVICE_STATE_ENABLED)
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_ENABLED);
+
+ /* Check PCI configuration */
+ if (!(vidxd->cfg[PCI_COMMAND] & PCI_COMMAND_MASTER))
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_BUSMASTER_EN);
+
+ gensts->state = IDXD_DEVICE_STATE_ENABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_disable(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq;
+ union wqcfg *vwqcfg;
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct mdev_device *mdev = vidxd->mdev;
+ struct device *dev = &mdev->dev;
+ int rc;
+
+ if (gensts->state == IDXD_DEVICE_STATE_DISABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+ return;
+ }
+
+ vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ wq = vidxd->wq;
+
+ rc = idxd_wq_disable(wq);
+ if (rc < 0) {
+ dev_warn(dev, "vidxd disable (wq disable) failed.\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DISABLED;
+ gensts->state = IDXD_DEVICE_STATE_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_drain_all(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq = vidxd->wq;
+
+ idxd_wq_drain(wq);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_drain(struct vdcm_idxd *vidxd, int val)
+{
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+ return;
+ }
+
+ idxd_wq_drain(wq);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_abort_all(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq = vidxd->wq;
+ int rc;
+
+ rc = idxd_wq_abort(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_abort(struct vdcm_idxd *vidxd, int val)
+{
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ int rc;
+
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+ return;
+ }
+
+ rc = idxd_wq_abort(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+void vidxd_reset(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ union wqcfg *vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq;
+ int rc;
+
+ gensts->state = IDXD_DEVICE_STATE_DRAIN;
+ wq = vidxd->wq;
+
+ if (wq->state == IDXD_WQ_ENABLED) {
+ rc = idxd_wq_abort(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ rc = idxd_wq_disable(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DISABLED;
+ gensts->state = IDXD_DEVICE_STATE_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_reset(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ int rc;
+
+ wq = vidxd->wq;
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+ return;
+ }
+
+ rc = idxd_wq_abort(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ rc = idxd_wq_disable(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_alloc_int_handle(struct vdcm_idxd *vidxd, int operand)
+{
+ bool ims = !!(operand & CMD_INT_HANDLE_IMS);
+ u32 cmdsts;
+ struct mdev_device *mdev = vidxd->mdev;
+ struct device *dev = &mdev->dev;
+ int ims_idx, vidx;
+
+ vidx = operand & GENMASK(15, 0);
+
+ /* vidx cannot be 0 since that's emulated and does not require IMS handle */
+ if (vidx <= 0 || vidx >= VIDXD_MAX_MSIX_VECS) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX);
+ return;
+ }
+
+ if (ims) {
+ dev_warn(dev, "IMS allocation is not implemented yet\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_NO_HANDLE);
+ return;
+ }
+
+ /*
+ * The index coming from the guest driver will start at 1. Vector 0 is
+ * the command interrupt and is emulated by the vdcm. Here we are asking
+ * for the IMS index that's backing the I/O vectors from the relative
+ * index to the mdev device. This index would start at 0. So for a
+ * passed in vidx that is 1, we pass 0 to dev_msi_hwirq() and so forth.
+ */
+ ims_idx = dev_msi_hwirq(dev, vidx - 1);
+ cmdsts = ims_idx << IDXD_CMDSTS_RES_SHIFT;
+ dev_dbg(dev, "requested index %d handle %d\n", vidx, ims_idx);
+ idxd_complete_command(vidxd, cmdsts);
+}
+
+static void vidxd_release_int_handle(struct vdcm_idxd *vidxd, int operand)
+{
+ struct mdev_device *mdev = vidxd->mdev;
+ struct device *dev = &mdev->dev;
+ bool ims = !!(operand & CMD_INT_HANDLE_IMS);
+ int handle, i;
+ bool found = false;
+
+ handle = operand & GENMASK(15, 0);
+ if (ims) {
+ dev_dbg(dev, "IMS allocation is not implemented yet\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE);
+ return;
+ }
+
+ /* IMS backed entry start at 1, 0 is emulated vector */
+ for (i = 1; i < VIDXD_MAX_MSIX_VECS; i++) {
+ if (dev_msi_hwirq(dev, i) == handle) {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found) {
+ dev_dbg(dev, "Freeing unallocated int handle.\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE);
+ }
+
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_enable(struct vdcm_idxd *vidxd, int wq_id)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wq_cap_reg *wqcap;
+ struct mdev_device *mdev = vidxd->mdev;
+ struct device *dev = &mdev->dev;
+ struct idxd_device *idxd;
+ union wqcfg *vwqcfg;
+ unsigned long flags;
+ u32 wq_pasid;
+ int priv, rc;
+
+ if (wq_id >= VIDXD_MAX_WQS) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_WQIDX);
+ return;
+ }
+
+ idxd = vidxd->idxd;
+ wq = vidxd->wq;
+
+ vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET + wq_id * 32);
+ wqcap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+
+ if (vidxd_state(vidxd) != IDXD_DEVICE_STATE_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOTEN);
+ return;
+ }
+
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_DISABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_ENABLED);
+ return;
+ }
+
+ if (wq_dedicated(wq) && wqcap->dedicated_mode == 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_MODE);
+ return;
+ }
+
+ priv = 1;
+ rc = idxd_mdev_get_pasid(mdev, &vidxd->vdev, &wq_pasid);
+ if (rc < 0) {
+ dev_warn(dev, "idxd pasid setup failed wq %d: %d\n", wq->id, rc);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_PASID_EN);
+ return;
+ }
+
+ dev_dbg(dev, "program pasid %d in wq %d\n", wq_pasid, wq->id);
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ idxd_wq_setup_pasid(wq, wq_pasid);
+ idxd_wq_setup_priv(wq, priv);
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ rc = idxd_wq_enable(wq);
+ if (rc < 0) {
+ dev_dbg(dev, "vidxd enable wq %d failed\n", wq->id);
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ idxd_wq_clear_pasid(wq);
+ idxd_wq_setup_priv(wq, 0);
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DEV_ENABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_disable(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ int rc;
+
+ wq = vidxd->wq;
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOT_EN);
+ return;
+ }
+
+ rc = idxd_wq_disable(wq);
+ if (rc < 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static bool command_supported(struct vdcm_idxd *vidxd, u32 cmd)
+{
+ u8 *bar0 = vidxd->bar0;
+ u32 *cmd_cap = (u32 *)(bar0 + IDXD_CMDCAP_OFFSET);
+
+ return !!(*cmd_cap & BIT(cmd));
+}
+
+static void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
+{
+ union idxd_command_reg *reg = (union idxd_command_reg *)(vidxd->bar0 + IDXD_CMD_OFFSET);
+ union gensts_reg *gensts = (union gensts_reg *)(vidxd->bar0 + IDXD_GENSTATS_OFFSET);
+ struct mdev_device *mdev = vidxd->mdev;
+ struct device *dev = &mdev->dev;
+
+ reg->bits = val;
+
+ dev_dbg(dev, "%s: cmd code: %u reg: %x\n", __func__, reg->cmd, reg->bits);
+ if (!command_supported(vidxd, reg->cmd)) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_CMD);
+ return;
+ }
+
+ if (gensts->state == IDXD_DEVICE_STATE_HALT) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_HW_ERR);
+ return;
+ }
+
+ switch (reg->cmd) {
+ case IDXD_CMD_ENABLE_DEVICE:
+ vidxd_enable(vidxd);
+ break;
+ case IDXD_CMD_DISABLE_DEVICE:
+ vidxd_disable(vidxd);
+ break;
+ case IDXD_CMD_DRAIN_ALL:
+ vidxd_drain_all(vidxd);
+ break;
+ case IDXD_CMD_ABORT_ALL:
+ vidxd_abort_all(vidxd);
+ break;
+ case IDXD_CMD_RESET_DEVICE:
+ vidxd_reset(vidxd);
+ break;
+ case IDXD_CMD_ENABLE_WQ:
+ vidxd_wq_enable(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_DISABLE_WQ:
+ vidxd_wq_disable(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_DRAIN_WQ:
+ vidxd_wq_drain(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_ABORT_WQ:
+ vidxd_wq_abort(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_RESET_WQ:
+ vidxd_wq_reset(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_REQUEST_INT_HANDLE:
+ vidxd_alloc_int_handle(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_RELEASE_INT_HANDLE:
+ vidxd_release_int_handle(vidxd, reg->operand);
+ break;
+ default:
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_CMD);
+ break;
+ }
+}
+
+MODULE_IMPORT_NS(IDXD);
MODULE_LICENSE("GPL v2");
The BAR0 MMIO path for the mdev is emulated. Add read/write support
functions to deal with MMIO access when the guest read/write to
the device. The support functions deals with the BAR0 MMIO region
of the mdev.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/device.c | 1
drivers/dma/idxd/registers.h | 6 ++
drivers/vfio/mdev/idxd/mdev.h | 3 +
drivers/vfio/mdev/idxd/vdev.c | 119 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/idxd.h | 1
5 files changed, 129 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 99542c9cbc47..2ea6015e0d53 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -277,6 +277,7 @@ void idxd_wq_drain(struct idxd_wq *wq)
operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
idxd_cmd_exec(idxd, IDXD_CMD_DRAIN_WQ, operand, NULL);
}
+EXPORT_SYMBOL_GPL(idxd_wq_drain);
void idxd_wq_reset(struct idxd_wq *wq)
{
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index 8ac2be4e174b..cf3d513a18e0 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -201,7 +201,9 @@ union cmdsts_reg {
};
u32 bits;
} __packed;
-#define IDXD_CMDSTS_ACTIVE 0x80000000
+
+#define IDXD_CMDS_ACTIVE_BIT 31
+#define IDXD_CMDSTS_ACTIVE BIT(IDXD_CMDS_ACTIVE_BIT)
#define IDXD_CMDSTS_ERR_MASK 0xff
#define IDXD_CMDSTS_RES_SHIFT 8
@@ -285,6 +287,8 @@ union msix_perm {
u32 bits;
} __packed;
+#define IDXD_MSIX_PERM_MASK 0xfffff00c
+#define IDXD_MSIX_PERM_IGNORE 0x3
#define MSIX_ENTRY_MASK_INT 0x1
#define MSIX_ENTRY_CTRL_BYTE 12
diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
index 91cb2662abd6..f696fe38e374 100644
--- a/drivers/vfio/mdev/idxd/mdev.h
+++ b/drivers/vfio/mdev/idxd/mdev.h
@@ -80,4 +80,7 @@ void vidxd_reset(struct vdcm_idxd *vidxd);
void vidxd_mmio_init(struct vdcm_idxd *vidxd);
int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+
#endif
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
index 78cc2377e637..d2416765ce7e 100644
--- a/drivers/vfio/mdev/idxd/vdev.c
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -42,6 +42,8 @@ static u64 idxd_pci_config[] = {
0x0000000000000000ULL,
};
+static void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
+
static void vidxd_reset_config(struct vdcm_idxd *vidxd)
{
u16 *devid = (u16 *)(vidxd->cfg + PCI_DEVICE_ID);
@@ -141,6 +143,123 @@ static void vidxd_report_pci_error(struct vdcm_idxd *vidxd)
send_halt_interrupt(vidxd);
}
+static void vidxd_report_swerror(struct vdcm_idxd *vidxd, unsigned int error)
+{
+ vidxd_set_swerr(vidxd, error);
+ send_swerr_interrupt(vidxd);
+}
+
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ u32 offset = pos & (vidxd->bar_size[0] - 1);
+ u8 *bar0 = vidxd->bar0;
+ struct device *dev = &vidxd->mdev->dev;
+
+ dev_dbg(dev, "vidxd mmio W %d %x %x: %llx\n", vidxd->wq->id, size,
+ offset, get_reg_val(buf, size));
+
+ if (((size & (size - 1)) != 0) || (offset & (size - 1)) != 0)
+ return -EINVAL;
+
+ /* If we don't limit this, we potentially can write out of bound */
+ if (size > sizeof(u32))
+ return -EINVAL;
+
+ switch (offset) {
+ case IDXD_GENCFG_OFFSET ... IDXD_GENCFG_OFFSET + 3:
+ /* Write only when device is disabled. */
+ if (vidxd_state(vidxd) == IDXD_DEVICE_STATE_DISABLED) {
+ dev_warn(dev, "Guest writes to unsupported GENCFG register\n");
+ memcpy(bar0 + offset, buf, size);
+ }
+ break;
+
+ case IDXD_GENCTRL_OFFSET:
+ memcpy(bar0 + offset, buf, size);
+ break;
+
+ case IDXD_INTCAUSE_OFFSET:
+ bar0[offset] &= ~(get_reg_val(buf, 1) & GENMASK(4, 0));
+ break;
+
+ case IDXD_CMD_OFFSET: {
+ u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+ u32 val = get_reg_val(buf, size);
+
+ if (size != sizeof(u32))
+ return -EINVAL;
+
+ /* Check and set command in progress */
+ if (test_and_set_bit(IDXD_CMDS_ACTIVE_BIT, (unsigned long *)cmdsts) == 0)
+ vidxd_do_command(vidxd, val);
+ else
+ vidxd_report_swerror(vidxd, DSA_ERR_CMD_REG);
+ break;
+ }
+
+ case IDXD_SWERR_OFFSET:
+ /* W1C */
+ bar0[offset] &= ~(get_reg_val(buf, 1) & GENMASK(1, 0));
+ break;
+
+ case VIDXD_MSIX_TABLE_OFFSET ... VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ - 1: {
+ int index = (offset - VIDXD_MSIX_TABLE_OFFSET) / 0x10;
+ u8 *msix_entry = &bar0[VIDXD_MSIX_TABLE_OFFSET + index * 0x10];
+ u64 *pba = (u64 *)(bar0 + VIDXD_MSIX_PBA_OFFSET);
+ u8 ctrl, new_mask;
+ int ims_index, ims_off;
+ u32 ims_ctrl, ims_mask;
+ struct idxd_device *idxd = vidxd->idxd;
+
+ memcpy(bar0 + offset, buf, size);
+ ctrl = msix_entry[MSIX_ENTRY_CTRL_BYTE];
+
+ new_mask = ctrl & MSIX_ENTRY_MASK_INT;
+ if (!new_mask && test_and_clear_bit(index, (unsigned long *)pba))
+ vidxd_send_interrupt(vidxd, index);
+
+ if (index == 0)
+ break;
+
+ ims_index = dev_msi_hwirq(dev, index - 1);
+ ims_off = idxd->ims_offset + ims_index * 16 + sizeof(u64);
+ ims_ctrl = ioread32(idxd->reg_base + ims_off);
+ ims_mask = ims_ctrl & MSIX_ENTRY_MASK_INT;
+
+ if (new_mask == ims_mask)
+ break;
+
+ if (new_mask)
+ ims_ctrl |= MSIX_ENTRY_MASK_INT;
+ else
+ ims_ctrl &= ~MSIX_ENTRY_MASK_INT;
+
+ iowrite32(ims_ctrl, idxd->reg_base + ims_off);
+ /* readback to flush */
+ ims_ctrl = ioread32(idxd->reg_base + ims_off);
+ break;
+ }
+
+ case VIDXD_MSIX_PERM_OFFSET ... VIDXD_MSIX_PERM_OFFSET + VIDXD_MSIX_PERM_TBL_SZ - 1:
+ memcpy(bar0 + offset, buf, size);
+ break;
+ } /* offset */
+
+ return 0;
+}
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ u32 offset = pos & (vidxd->bar_size[0] - 1);
+ struct device *dev = &vidxd->mdev->dev;
+
+ memcpy(buf, vidxd->bar0 + offset, size);
+
+ dev_dbg(dev, "vidxd mmio R %d %x %x: %llx\n",
+ vidxd->wq->id, size, offset, get_reg_val(buf, size));
+ return 0;
+}
+
int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
{
u32 offset = pos & 0xfff;
diff --git a/include/uapi/linux/idxd.h b/include/uapi/linux/idxd.h
index 751f6107217c..e8c39849a526 100644
--- a/include/uapi/linux/idxd.h
+++ b/include/uapi/linux/idxd.h
@@ -90,6 +90,7 @@ enum dsa_completion_status {
DSA_COMP_HW_ERR_DRB,
DSA_COMP_TRANSLATION_FAIL,
DSA_ERR_PCI_CFG = 0x51,
+ DSA_ERR_CMD_REG,
};
enum iax_completion_status {
Add "mdev" wq type and support helpers. The mdev wq type marks the wq
to be utilized as a VFIO mediated device.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/idxd.h | 6 ++++++
drivers/dma/idxd/sysfs.c | 5 +++++
2 files changed, 11 insertions(+)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 34ffa6dad53a..cbb046c2921f 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -129,6 +129,7 @@ enum idxd_wq_type {
IDXD_WQT_NONE = 0,
IDXD_WQT_KERNEL,
IDXD_WQT_USER,
+ IDXD_WQT_MDEV,
};
struct idxd_cdev {
@@ -424,6 +425,11 @@ static inline bool is_idxd_wq_user(struct idxd_wq *wq)
return wq->type == IDXD_WQT_USER;
}
+static inline bool is_idxd_wq_mdev(struct idxd_wq *wq)
+{
+ return (wq->type == IDXD_WQT_MDEV);
+}
+
static inline bool wq_dedicated(struct idxd_wq *wq)
{
return test_bit(WQ_FLAG_DEDICATED, &wq->flags);
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 6583c9c2e992..3d3a84be2c9b 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -17,6 +17,7 @@ static char *idxd_wq_type_names[] = {
[IDXD_WQT_NONE] = "none",
[IDXD_WQT_KERNEL] = "kernel",
[IDXD_WQT_USER] = "user",
+ [IDXD_WQT_MDEV] = "mdev",
};
static bool is_idxd_dev_drv(struct device_driver *drv)
@@ -860,6 +861,8 @@ static ssize_t wq_type_show(struct device *dev,
return sysfs_emit(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_KERNEL]);
case IDXD_WQT_USER:
return sysfs_emit(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_USER]);
+ case IDXD_WQT_MDEV:
+ return sysfs_emit(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_MDEV]);
case IDXD_WQT_NONE:
default:
return sysfs_emit(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_NONE]);
@@ -885,6 +888,8 @@ static ssize_t wq_type_store(struct device *dev,
wq->type = IDXD_WQT_KERNEL;
else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_USER]))
wq->type = IDXD_WQT_USER;
+ else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_MDEV]))
+ wq->type = IDXD_WQT_MDEV;
else
return -EINVAL;
The mediated device will emulate the PCI config access (read/write)
from the guest. Add PCI config read/write functions to support
the config read/write accesses from the guest.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/registers.h | 4 +
drivers/vfio/mdev/Kconfig | 9 +
drivers/vfio/mdev/Makefile | 1
drivers/vfio/mdev/idxd/Makefile | 4 +
drivers/vfio/mdev/idxd/mdev.h | 76 ++++++++++++
drivers/vfio/mdev/idxd/vdev.c | 255 +++++++++++++++++++++++++++++++++++++++
include/uapi/linux/idxd.h | 1
7 files changed, 350 insertions(+)
create mode 100644 drivers/vfio/mdev/idxd/Makefile
create mode 100644 drivers/vfio/mdev/idxd/mdev.h
create mode 100644 drivers/vfio/mdev/idxd/vdev.c
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index c970c3f025f0..c699c72bd8d2 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -155,6 +155,7 @@ enum idxd_device_reset_type {
#define IDXD_INTC_CMD 0x02
#define IDXD_INTC_OCCUPY 0x04
#define IDXD_INTC_PERFMON_OVFL 0x08
+#define IDXD_INTC_HALT 0x10
#define IDXD_CMD_OFFSET 0xa0
union idxd_command_reg {
@@ -279,6 +280,9 @@ union msix_perm {
u32 bits;
} __packed;
+#define MSIX_ENTRY_MASK_INT 0x1
+#define MSIX_ENTRY_CTRL_BYTE 12
+
union group_flags {
struct {
u32 tc_a:3;
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 82f79d99a7db..57f89457e9a7 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -21,3 +21,12 @@ config VFIO_MDEV_IRQS
devices.
If you don't know what to do here, say N.
+
+config VFIO_MDEV_IDXD
+ tristate "VFIO Mediated device driver for Intel IDXD"
+ depends on VFIO && VFIO_MDEV && X86_64
+ select VFIO_MDEV_IMS
+ select IMS_MSI_ARRAY
+ default n
+ help
+ VFIO based mediated device driver for Intel Accelerator Devices driver.
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index c3f160cae192..7e3f5fae4bf1 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -6,3 +6,4 @@ mdev-$(CONFIG_VFIO_MDEV_IRQS) += mdev_irqs.o
obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_IDXD) += idxd/
diff --git a/drivers/vfio/mdev/idxd/Makefile b/drivers/vfio/mdev/idxd/Makefile
new file mode 100644
index 000000000000..ccd3bc1c7ab6
--- /dev/null
+++ b/drivers/vfio/mdev/idxd/Makefile
@@ -0,0 +1,4 @@
+ccflags-y += -I$(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE=IDXD
+
+obj-$(CONFIG_VFIO_MDEV_IDXD) += idxd_mdev.o
+idxd_mdev-y := vdev.o
diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
new file mode 100644
index 000000000000..120c2dc29ba7
--- /dev/null
+++ b/drivers/vfio/mdev/idxd/mdev.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_MDEV_H_
+#define _IDXD_MDEV_H_
+
+/* two 64-bit BARs implemented */
+#define VIDXD_MAX_BARS 2
+#define VIDXD_MAX_CFG_SPACE_SZ 4096
+#define VIDXD_MAX_MMIO_SPACE_SZ 8192
+#define VIDXD_MSIX_TBL_SZ_OFFSET 0x42
+#define VIDXD_CAP_CTRL_SZ 0x100
+#define VIDXD_GRP_CTRL_SZ 0x100
+#define VIDXD_WQ_CTRL_SZ 0x100
+#define VIDXD_WQ_OCPY_INT_SZ 0x20
+#define VIDXD_MSIX_TBL_SZ 0x90
+#define VIDXD_MSIX_PERM_TBL_SZ 0x48
+
+#define VIDXD_MSIX_PERM_OFFSET 0x300
+#define VIDXD_GRPCFG_OFFSET 0x400
+#define VIDXD_WQCFG_OFFSET 0x500
+#define VIDXD_MSIX_TABLE_OFFSET 0x600
+#define VIDXD_MSIX_PBA_OFFSET 0x700
+#define VIDXD_IMS_OFFSET 0x1000
+
+#define VIDXD_BAR0_SIZE 0x2000
+#define VIDXD_BAR2_SIZE 0x2000
+#define VIDXD_MAX_MSIX_VECS 2
+#define VIDXD_MAX_MSIX_ENTRIES VIDXD_MAX_MSIX_VECS
+#define VIDXD_MAX_WQS 1
+
+struct vdcm_idxd {
+ struct idxd_device *idxd;
+ struct idxd_wq *wq;
+ struct mdev_device *mdev;
+ int num_wqs;
+
+ u64 bar_val[VIDXD_MAX_BARS];
+ u64 bar_size[VIDXD_MAX_BARS];
+ u8 cfg[VIDXD_MAX_CFG_SPACE_SZ];
+ u8 bar0[VIDXD_MAX_MMIO_SPACE_SZ];
+ struct mutex dev_lock; /* lock for vidxd resources */
+};
+
+static inline u64 get_reg_val(void *buf, int size)
+{
+ u64 val = 0;
+
+ switch (size) {
+ case 8:
+ val = *(u64 *)buf;
+ break;
+ case 4:
+ val = *(u32 *)buf;
+ break;
+ case 2:
+ val = *(u16 *)buf;
+ break;
+ case 1:
+ val = *(u8 *)buf;
+ break;
+ }
+
+ return val;
+}
+
+static inline u8 vidxd_state(struct vdcm_idxd *vidxd)
+{
+ union gensts_reg *gensts = (union gensts_reg *)(vidxd->bar0 + IDXD_GENSTATS_OFFSET);
+
+ return gensts->state;
+}
+
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
+#endif
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
new file mode 100644
index 000000000000..aca4a1228a97
--- /dev/null
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -0,0 +1,255 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "mdev.h"
+
+void vidxd_send_interrupt(struct vdcm_idxd *vidxd, int vector)
+{
+ struct mdev_device *mdev = vidxd->mdev;
+ u8 *bar0 = vidxd->bar0;
+ u8 *msix_entry = &bar0[VIDXD_MSIX_TABLE_OFFSET + vector * 0x10];
+ u64 *pba = (u64 *)(bar0 + VIDXD_MSIX_PBA_OFFSET);
+ u8 ctrl;
+
+ ctrl = msix_entry[MSIX_ENTRY_CTRL_BYTE];
+ if (ctrl & MSIX_ENTRY_MASK_INT)
+ set_bit(vector, (unsigned long *)pba);
+ else
+ mdev_msix_send_signal(mdev, vector);
+}
+
+static void vidxd_set_swerr(struct vdcm_idxd *vidxd, unsigned int error)
+{
+ union sw_err_reg *swerr = (union sw_err_reg *)(vidxd->bar0 + IDXD_SWERR_OFFSET);
+
+ if (!swerr->valid) {
+ memset(swerr, 0, sizeof(*swerr));
+ swerr->valid = 1;
+ swerr->error = error;
+ } else if (!swerr->overflow) {
+ swerr->overflow = 1;
+ }
+}
+
+static inline void send_swerr_interrupt(struct vdcm_idxd *vidxd)
+{
+ union genctrl_reg *genctrl = (union genctrl_reg *)(vidxd->bar0 + IDXD_GENCTRL_OFFSET);
+ u32 *intcause = (u32 *)(vidxd->bar0 + IDXD_INTCAUSE_OFFSET);
+
+ if (!genctrl->softerr_int_en)
+ return;
+
+ *intcause |= IDXD_INTC_ERR;
+ vidxd_send_interrupt(vidxd, 0);
+}
+
+static inline void send_halt_interrupt(struct vdcm_idxd *vidxd)
+{
+ union genctrl_reg *genctrl = (union genctrl_reg *)(vidxd->bar0 + IDXD_GENCTRL_OFFSET);
+ u32 *intcause = (u32 *)(vidxd->bar0 + IDXD_INTCAUSE_OFFSET);
+
+ if (!genctrl->halt_int_en)
+ return;
+
+ *intcause |= IDXD_INTC_HALT;
+ vidxd_send_interrupt(vidxd, 0);
+}
+
+static void vidxd_report_pci_error(struct vdcm_idxd *vidxd)
+{
+ union gensts_reg *gensts = (union gensts_reg *)(vidxd->bar0 + IDXD_GENSTATS_OFFSET);
+
+ vidxd_set_swerr(vidxd, DSA_ERR_PCI_CFG);
+ /* set device to halt */
+ gensts->reset_type = IDXD_DEVICE_RESET_FLR;
+ gensts->state = IDXD_DEVICE_STATE_HALT;
+
+ send_halt_interrupt(vidxd);
+}
+
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
+{
+ u32 offset = pos & 0xfff;
+ struct device *dev = &vidxd->mdev->dev;
+
+ memcpy(buf, &vidxd->cfg[offset], count);
+
+ dev_dbg(dev, "vidxd pci R %d %x %x: %llx\n",
+ vidxd->wq->id, count, offset, get_reg_val(buf, count));
+
+ return 0;
+}
+
+/*
+ * Much of the emulation code has been borrowed from Intel i915 cfg space
+ * emulation code.
+ * drivers/gpu/drm/i915/gvt/cfg_space.c:
+ */
+
+/*
+ * Bitmap for writable bits (RW or RW1C bits, but cannot co-exist in one
+ * byte) byte by byte in standard pci configuration space. (not the full
+ * 256 bytes.)
+ */
+static const u8 pci_cfg_space_rw_bmp[PCI_INTERRUPT_LINE + 4] = {
+ [PCI_COMMAND] = 0xff, 0x07,
+ [PCI_STATUS] = 0x00, 0xf9, /* the only one RW1C byte */
+ [PCI_CACHE_LINE_SIZE] = 0xff,
+ [PCI_BASE_ADDRESS_0 ... PCI_CARDBUS_CIS - 1] = 0xff,
+ [PCI_ROM_ADDRESS] = 0x01, 0xf8, 0xff, 0xff,
+ [PCI_INTERRUPT_LINE] = 0xff,
+};
+
+static void _pci_cfg_mem_write(struct vdcm_idxd *vidxd, unsigned int off, u8 *src,
+ unsigned int bytes)
+{
+ u8 *cfg_base = vidxd->cfg;
+ u8 mask, new, old;
+ int i = 0;
+
+ for (; i < bytes && (off + i < sizeof(pci_cfg_space_rw_bmp)); i++) {
+ mask = pci_cfg_space_rw_bmp[off + i];
+ old = cfg_base[off + i];
+ new = src[i] & mask;
+
+ /**
+ * The PCI_STATUS high byte has RW1C bits, here
+ * emulates clear by writing 1 for these bits.
+ * Writing a 0b to RW1C bits has no effect.
+ */
+ if (off + i == PCI_STATUS + 1)
+ new = (~new & old) & mask;
+
+ cfg_base[off + i] = (old & ~mask) | new;
+ }
+
+ /* For other configuration space directly copy as it is. */
+ if (i < bytes)
+ memcpy(cfg_base + off + i, src + i, bytes - i);
+}
+
+static inline void _write_pci_bar(struct vdcm_idxd *vidxd, u32 offset, u32 val, bool low)
+{
+ u32 *pval;
+
+ /* BAR offset should be 32 bits algiend */
+ offset = rounddown(offset, 4);
+ pval = (u32 *)(vidxd->cfg + offset);
+
+ if (low) {
+ /*
+ * only update bit 31 - bit 4,
+ * leave the bit 3 - bit 0 unchanged.
+ */
+ *pval = (val & GENMASK(31, 4)) | (*pval & GENMASK(3, 0));
+ } else {
+ *pval = val;
+ }
+}
+
+static int _pci_cfg_bar_write(struct vdcm_idxd *vidxd, unsigned int offset, void *p_data,
+ unsigned int bytes)
+{
+ u32 new = *(u32 *)(p_data);
+ bool lo = IS_ALIGNED(offset, 8);
+ u64 size;
+ unsigned int bar_id;
+
+ /*
+ * Power-up software can determine how much address
+ * space the device requires by writing a value of
+ * all 1's to the register and then reading the value
+ * back. The device will return 0's in all don't-care
+ * address bits.
+ */
+ if (new == 0xffffffff) {
+ switch (offset) {
+ case PCI_BASE_ADDRESS_0:
+ case PCI_BASE_ADDRESS_1:
+ case PCI_BASE_ADDRESS_2:
+ case PCI_BASE_ADDRESS_3:
+ bar_id = (offset - PCI_BASE_ADDRESS_0) / 8;
+ size = vidxd->bar_size[bar_id];
+ _write_pci_bar(vidxd, offset, size >> (lo ? 0 : 32), lo);
+ break;
+ default:
+ /* Unimplemented BARs */
+ _write_pci_bar(vidxd, offset, 0x0, false);
+ }
+ } else {
+ switch (offset) {
+ case PCI_BASE_ADDRESS_0:
+ case PCI_BASE_ADDRESS_1:
+ case PCI_BASE_ADDRESS_2:
+ case PCI_BASE_ADDRESS_3:
+ _write_pci_bar(vidxd, offset, new, lo);
+ break;
+ default:
+ break;
+ }
+ }
+ return 0;
+}
+
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size)
+{
+ struct device *dev = &vidxd->idxd->pdev->dev;
+
+ if (size > 4)
+ return -EINVAL;
+
+ if (pos + size > VIDXD_MAX_CFG_SPACE_SZ)
+ return -EINVAL;
+
+ dev_dbg(dev, "vidxd pci W %d %x %x: %llx\n", vidxd->wq->id, size, pos,
+ get_reg_val(buf, size));
+
+ /* First check if it's PCI_COMMAND */
+ if (IS_ALIGNED(pos, 2) && pos == PCI_COMMAND) {
+ bool new_bme;
+ bool bme;
+
+ if (size > 2)
+ return -EINVAL;
+
+ new_bme = !!(get_reg_val(buf, 2) & PCI_COMMAND_MASTER);
+ bme = !!(vidxd->cfg[pos] & PCI_COMMAND_MASTER);
+ _pci_cfg_mem_write(vidxd, pos, buf, size);
+
+ /* Flag error if turning off BME while device is enabled */
+ if ((bme && !new_bme) && vidxd_state(vidxd) == IDXD_DEVICE_STATE_ENABLED)
+ vidxd_report_pci_error(vidxd);
+ return 0;
+ }
+
+ switch (pos) {
+ case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5:
+ if (!IS_ALIGNED(pos, 4))
+ return -EINVAL;
+ return _pci_cfg_bar_write(vidxd, pos, buf, size);
+
+ default:
+ _pci_cfg_mem_write(vidxd, pos, buf, size);
+ }
+ return 0;
+}
+
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/idxd.h b/include/uapi/linux/idxd.h
index e33997b4d750..751f6107217c 100644
--- a/include/uapi/linux/idxd.h
+++ b/include/uapi/linux/idxd.h
@@ -89,6 +89,7 @@ enum dsa_completion_status {
DSA_COMP_HW_ERR1,
DSA_COMP_HW_ERR_DRB,
DSA_COMP_TRANSLATION_FAIL,
+ DSA_ERR_PCI_CFG = 0x51,
};
enum iax_completion_status {
When a device error occurs, the mediated device need to be notified in
order to notify the guest of device error. Add support to notify the
specific mdev when an error is wq specific and broadcast errors to all mdev
when it's a generic device error.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/idxd.h | 7 +++++++
drivers/dma/idxd/irq.c | 21 +++++++++++++++++++--
drivers/vfio/mdev/idxd/mdev.c | 5 +++++
drivers/vfio/mdev/idxd/mdev.h | 1 +
drivers/vfio/mdev/idxd/vdev.c | 23 +++++++++++++++++++++++
5 files changed, 55 insertions(+), 2 deletions(-)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 5e8da9019c46..b1d94463fce5 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -50,10 +50,17 @@ enum idxd_type {
#define IDXD_NAME_SIZE 128
#define IDXD_PMU_EVENT_MAX 64
+struct idxd_wq;
+
+struct idxd_device_ops {
+ void (*notify_error)(struct idxd_wq *wq);
+};
+
struct idxd_device_driver {
const char *name;
int (*probe)(struct device *dev);
void (*remove)(struct device *dev);
+ struct idxd_device_ops *ops;
struct device_driver drv;
};
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index d1a635ecc7f3..b8b6c93f4480 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -104,6 +104,7 @@ static int idxd_device_schedule_fault_process(struct idxd_device *idxd,
static int process_misc_interrupts(struct idxd_device *idxd, u32 cause)
{
+ struct idxd_device_driver *wq_drv;
struct device *dev = &idxd->pdev->dev;
union gensts_reg gensts;
u32 val = 0;
@@ -123,16 +124,32 @@ static int process_misc_interrupts(struct idxd_device *idxd, u32 cause)
int id = idxd->sw_err.wq_idx;
struct idxd_wq *wq = idxd->wqs[id];
- if (wq->type == IDXD_WQT_USER)
+ if (is_idxd_wq_user(wq)) {
wake_up_interruptible(&wq->err_queue);
+ } else if (is_idxd_wq_mdev(wq)) {
+ struct device *conf_dev = wq_confdev(wq);
+ struct device_driver *drv = conf_dev->driver;
+
+ wq_drv = container_of(drv, struct idxd_device_driver, drv);
+ if (wq_drv)
+ wq_drv->ops->notify_error(wq);
+ }
} else {
int i;
for (i = 0; i < idxd->max_wqs; i++) {
struct idxd_wq *wq = idxd->wqs[i];
- if (wq->type == IDXD_WQT_USER)
+ if (is_idxd_wq_user(wq)) {
wake_up_interruptible(&wq->err_queue);
+ } else if (is_idxd_wq_mdev(wq)) {
+ struct device *conf_dev = wq_confdev(wq);
+ struct device_driver *drv = conf_dev->driver;
+
+ wq_drv = container_of(drv, struct idxd_device_driver, drv);
+ if (wq_drv)
+ wq_drv->ops->notify_error(wq);
+ }
}
}
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index 3257505fb7c7..25d1ac67b0c9 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -904,10 +904,15 @@ static void idxd_mdev_drv_remove(struct device *dev)
put_device(dev);
}
+static struct idxd_device_ops mdev_wq_ops = {
+ .notify_error = idxd_wq_vidxd_send_errors,
+};
+
static struct idxd_device_driver idxd_mdev_driver = {
.probe = idxd_mdev_drv_probe,
.remove = idxd_mdev_drv_remove,
.name = idxd_mdev_drv_name,
+ .ops = &mdev_wq_ops,
};
static int __init idxd_mdev_init(void)
diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
index dd4290bce772..10358831da6a 100644
--- a/drivers/vfio/mdev/idxd/mdev.h
+++ b/drivers/vfio/mdev/idxd/mdev.h
@@ -107,5 +107,6 @@ int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigne
int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
#endif
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
index a444a0af8b5f..2a066e483dd8 100644
--- a/drivers/vfio/mdev/idxd/vdev.c
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -151,6 +151,29 @@ static void vidxd_report_swerror(struct vdcm_idxd *vidxd, unsigned int error)
send_swerr_interrupt(vidxd);
}
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq)
+{
+ struct vdcm_idxd *vidxd = wq->vidxd;
+ struct idxd_device *idxd = vidxd->idxd;
+ u8 *bar0 = vidxd->bar0;
+ union sw_err_reg *swerr = (union sw_err_reg *)(bar0 + IDXD_SWERR_OFFSET);
+ int i;
+
+ if (swerr->valid) {
+ if (!swerr->overflow) {
+ swerr->overflow = 1;
+ send_swerr_interrupt(vidxd);
+ }
+ return;
+ }
+
+ lockdep_assert_held(&idxd->dev_lock);
+ for (i = 0; i < 4; i++)
+ swerr->bits[i] = idxd->sw_err.bits[i];
+
+ send_swerr_interrupt(vidxd);
+}
+
int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
{
u32 offset = pos & (vidxd->bar_size[0] - 1);
Add retrieval code for Interrupt Message Store (IMS) related info
(table offset and size). IMS is used to back the MSIX vectors that
support the descriptor completion interrupt for the mediated device.
In the SIOV spec [1], IMS is specified as detected via DVSEC. Here's the
upstream discussion WRT having the device driver doing the detection
vs a platform detection feature: [2]. The latest agreement is that IMS
should be done from platform perspective. Given that DSA 1.0 and
any foreseeable future devices is expected to support IMS, the driver
will just check the ims size field to determine if IMS is supported.
[1]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[2]: https://lore.kernel.org/dmaengine/[email protected]/
Signed-off-by: Dave Jiang <[email protected]>
---
Documentation/ABI/stable/sysfs-driver-dma-idxd | 6 ++++++
drivers/dma/idxd/idxd.h | 2 ++
drivers/dma/idxd/init.c | 4 ++++
drivers/dma/idxd/sysfs.c | 9 +++++++++
4 files changed, 21 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-driver-dma-idxd b/Documentation/ABI/stable/sysfs-driver-dma-idxd
index 55285c136cf0..884065b2e85c 100644
--- a/Documentation/ABI/stable/sysfs-driver-dma-idxd
+++ b/Documentation/ABI/stable/sysfs-driver-dma-idxd
@@ -129,6 +129,12 @@ KernelVersion: 5.10.0
Contact: [email protected]
Description: The last executed device administrative command's status/error.
+What: /sys/bus/dsa/devices/dsa<m>/ims_size
+Date: May 3, 2021
+KernelVersion: 5.14.0
+Contact: [email protected]
+Description: The total number of vectors available for Interrupt Message Store.
+
What: /sys/bus/dsa/devices/wq<m>.<n>/block_on_fault
Date: Oct 27, 2020
KernelVersion: 5.11.0
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 22afaf7ee637..288e3fe15b3e 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -266,6 +266,7 @@ struct idxd_device {
int num_groups;
+ u32 ims_offset;
u32 msix_perm_offset;
u32 wqcfg_offset;
u32 grpcfg_offset;
@@ -273,6 +274,7 @@ struct idxd_device {
u64 max_xfer_bytes;
u32 max_batch_size;
+ int ims_size;
int max_groups;
int max_engines;
int max_tokens;
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index bed9169152f9..16ff37be2d26 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -381,6 +381,8 @@ static void idxd_read_table_offsets(struct idxd_device *idxd)
dev_dbg(dev, "IDXD Work Queue Config Offset: %#x\n", idxd->wqcfg_offset);
idxd->msix_perm_offset = offsets.msix_perm * IDXD_TABLE_MULT;
dev_dbg(dev, "IDXD MSIX Permission Offset: %#x\n", idxd->msix_perm_offset);
+ idxd->ims_offset = offsets.ims * IDXD_TABLE_MULT;
+ dev_dbg(dev, "IDXD IMS Offset: %#x\n", idxd->ims_offset);
idxd->perfmon_offset = offsets.perfmon * IDXD_TABLE_MULT;
dev_dbg(dev, "IDXD Perfmon Offset: %#x\n", idxd->perfmon_offset);
}
@@ -403,6 +405,8 @@ static void idxd_read_caps(struct idxd_device *idxd)
dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
dev_dbg(dev, "max batch size: %u\n", idxd->max_batch_size);
+ idxd->ims_size = idxd->hw.gen_cap.max_ims_mult * 256ULL;
+ dev_dbg(dev, "IMS size: %u\n", idxd->ims_size);
if (idxd->hw.gen_cap.config_en)
set_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags);
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 4fcb8833a4df..6583c9c2e992 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -1166,6 +1166,14 @@ static ssize_t numa_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(numa_node);
+static ssize_t ims_size_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ struct idxd_device *idxd = confdev_to_idxd(dev);
+
+ return sysfs_emit(buf, "%u\n", idxd->ims_size);
+}
+static DEVICE_ATTR_RO(ims_size);
+
static ssize_t max_batch_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1352,6 +1360,7 @@ static struct attribute *idxd_device_attributes[] = {
&dev_attr_max_work_queues_size.attr,
&dev_attr_max_engines.attr,
&dev_attr_numa_node.attr,
+ &dev_attr_ims_size.attr,
&dev_attr_max_batch_size.attr,
&dev_attr_max_transfer_size.attr,
&dev_attr_op_cap.attr,
With mdev needing to also use the same function, move the function to a
common place where vfio pci and mdev can both utilize.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/vfio/Makefile | 2 +
drivers/vfio/pci/vfio_pci_intrs.c | 63 ++------------------------------
drivers/vfio/vfio_common.c | 74 +++++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 4 ++
4 files changed, 83 insertions(+), 60 deletions(-)
create mode 100644 drivers/vfio/vfio_common.c
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index fee73f3d9480..fc7fd2412dee 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
vfio_virqfd-y := virqfd.o
-obj-$(CONFIG_VFIO) += vfio.o
+obj-$(CONFIG_VFIO) += vfio.o vfio_common.o
obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9c8efad3a859..926cff00146c 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -561,61 +561,6 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_device *vdev,
return 0;
}
-static int vfio_pci_set_ctx_trigger_single(struct eventfd_ctx **ctx,
- unsigned int count, uint32_t flags,
- void *data)
-{
- /* DATA_NONE/DATA_BOOL enables loopback testing */
- if (flags & VFIO_IRQ_SET_DATA_NONE) {
- if (*ctx) {
- if (count) {
- eventfd_signal(*ctx, 1);
- } else {
- eventfd_ctx_put(*ctx);
- *ctx = NULL;
- }
- return 0;
- }
- } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
- uint8_t trigger;
-
- if (!count)
- return -EINVAL;
-
- trigger = *(uint8_t *)data;
- if (trigger && *ctx)
- eventfd_signal(*ctx, 1);
-
- return 0;
- } else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
- int32_t fd;
-
- if (!count)
- return -EINVAL;
-
- fd = *(int32_t *)data;
- if (fd == -1) {
- if (*ctx)
- eventfd_ctx_put(*ctx);
- *ctx = NULL;
- } else if (fd >= 0) {
- struct eventfd_ctx *efdctx;
-
- efdctx = eventfd_ctx_fdget(fd);
- if (IS_ERR(efdctx))
- return PTR_ERR(efdctx);
-
- if (*ctx)
- eventfd_ctx_put(*ctx);
-
- *ctx = efdctx;
- }
- return 0;
- }
-
- return -EINVAL;
-}
-
static int vfio_pci_set_err_trigger(struct vfio_pci_device *vdev,
unsigned index, unsigned start,
unsigned count, uint32_t flags, void *data)
@@ -623,8 +568,8 @@ static int vfio_pci_set_err_trigger(struct vfio_pci_device *vdev,
if (index != VFIO_PCI_ERR_IRQ_INDEX || start != 0 || count > 1)
return -EINVAL;
- return vfio_pci_set_ctx_trigger_single(&vdev->err_trigger,
- count, flags, data);
+ return vfio_set_ctx_trigger_single(&vdev->err_trigger,
+ count, flags, data);
}
static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev,
@@ -634,8 +579,8 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev,
if (index != VFIO_PCI_REQ_IRQ_INDEX || start != 0 || count > 1)
return -EINVAL;
- return vfio_pci_set_ctx_trigger_single(&vdev->req_trigger,
- count, flags, data);
+ return vfio_set_ctx_trigger_single(&vdev->req_trigger,
+ count, flags, data);
}
int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
diff --git a/drivers/vfio/vfio_common.c b/drivers/vfio/vfio_common.c
new file mode 100644
index 000000000000..b209d57c7312
--- /dev/null
+++ b/drivers/vfio/vfio_common.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 Intel, Corp. All rights reserved.
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ * VFIO common helper functions
+ */
+
+#include <linux/eventfd.h>
+#include <linux/vfio.h>
+
+/*
+ * Common helper to set single eventfd trigger
+ *
+ * @ctx [out] : address of eventfd ctx to be written to
+ * @count [in] : number of vectors (should be 1)
+ * @flags [in] : VFIO IRQ flags
+ * @data [in] : data from ioctl
+ */
+int vfio_set_ctx_trigger_single(struct eventfd_ctx **ctx,
+ unsigned int count, u32 flags,
+ void *data)
+{
+ /* DATA_NONE/DATA_BOOL enables loopback testing */
+ if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ if (*ctx) {
+ if (count) {
+ eventfd_signal(*ctx, 1);
+ } else {
+ eventfd_ctx_put(*ctx);
+ *ctx = NULL;
+ }
+ return 0;
+ }
+ } else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
+ u8 trigger;
+
+ if (!count)
+ return -EINVAL;
+
+ trigger = *(uint8_t *)data;
+ if (trigger && *ctx)
+ eventfd_signal(*ctx, 1);
+
+ return 0;
+ } else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ s32 fd;
+
+ if (!count)
+ return -EINVAL;
+
+ fd = *(s32 *)data;
+ if (fd == -1) {
+ if (*ctx)
+ eventfd_ctx_put(*ctx);
+ *ctx = NULL;
+ } else if (fd >= 0) {
+ struct eventfd_ctx *efdctx;
+
+ efdctx = eventfd_ctx_fdget(fd);
+ if (IS_ERR(efdctx))
+ return PTR_ERR(efdctx);
+
+ if (*ctx)
+ eventfd_ctx_put(*ctx);
+
+ *ctx = efdctx;
+ }
+ return 0;
+ }
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(vfio_set_ctx_trigger_single);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ed5ca027eb49..aa7cb0e1b8b2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -235,4 +235,8 @@ extern int vfio_virqfd_enable(void *opaque,
void *data, struct virqfd **pvirqfd, int fd);
extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
+/* common lib functions */
+extern int vfio_set_ctx_trigger_single(struct eventfd_ctx **ctx,
+ unsigned int count, u32 flags,
+ void *data);
#endif /* VFIO_H */
When a dedicated wq is enabled as mdev, we must disable the wq on the
device in order to program the pasid to the wq. Introduce a wq state
IDXD_WQ_LOCKED that is software state only in order to prevent the user
from modifying the configuration while mdev wq is in this state. While
in this state, the wq is not in DISABLED state and will prevent any
modifications to the configuration. It is also not in the ENABLED state
and therefore prevents any actions allowed in the ENABLED state.
For mdev, the dwq is disabled and set to LOCKED state upon the mdev
creation. When ->open() is called on the mdev and a pasid is programmed to
the WQCFG, the dwq is enabled again and goes to the ENABLED state.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/device.c | 16 ++++++++++++++++
drivers/dma/idxd/idxd.h | 1 +
drivers/dma/idxd/sysfs.c | 2 ++
drivers/vfio/mdev/idxd/mdev.c | 5 +++++
drivers/vfio/mdev/idxd/vdev.c | 4 +++-
5 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index cef41a273cc1..c46b6bc055bd 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -243,6 +243,14 @@ int idxd_wq_disable(struct idxd_wq *wq)
dev_dbg(dev, "Disabling WQ %d\n", wq->id);
+ /*
+ * When the wq is in LOCKED state, it means it is disabled but
+ * also at the same time is "enabled" as far as the user is
+ * concerned. So a call to disable the hardware can be skipped.
+ */
+ if (wq->state == IDXD_WQ_LOCKED)
+ return 0;
+
if (wq->state != IDXD_WQ_ENABLED) {
dev_dbg(dev, "WQ %d in wrong state: %d\n", wq->id, wq->state);
return 0;
@@ -285,6 +293,14 @@ void idxd_wq_reset(struct idxd_wq *wq)
struct device *dev = &idxd->pdev->dev;
u32 operand;
+ /*
+ * When the wq is in LOCKED state, it means it is disabled but
+ * also at the same time is "enabled" as far as the user is
+ * concerned. So a call to reset the hardware can be skipped.
+ */
+ if (wq->state == IDXD_WQ_LOCKED)
+ return;
+
if (wq->state != IDXD_WQ_ENABLED) {
dev_dbg(dev, "WQ %d in wrong state: %d\n", wq->id, wq->state);
return;
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 81c78add74dd..5e8da9019c46 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -120,6 +120,7 @@ struct idxd_pmu {
enum idxd_wq_state {
IDXD_WQ_DISABLED = 0,
IDXD_WQ_ENABLED,
+ IDXD_WQ_LOCKED,
};
enum idxd_wq_flag {
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 3d3a84be2c9b..435ad3c62ad6 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -584,6 +584,8 @@ static ssize_t wq_state_show(struct device *dev,
return sysfs_emit(buf, "disabled\n");
case IDXD_WQ_ENABLED:
return sysfs_emit(buf, "enabled\n");
+ case IDXD_WQ_LOCKED:
+ return sysfs_emit(buf, "locked\n");
}
return sysfs_emit(buf, "unknown\n");
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index 7dac024e2852..3257505fb7c7 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -894,6 +894,11 @@ static void idxd_mdev_drv_remove(struct device *dev)
struct idxd_device *idxd = wq->idxd;
drv_disable_wq(wq);
+ mutex_lock(&wq->wq_lock);
+ if (wq->state == IDXD_WQ_LOCKED)
+ wq->state = IDXD_WQ_DISABLED;
+ mutex_unlock(&wq->wq_lock);
+
dev_info(dev, "wq %s disabled\n", dev_name(dev));
kref_put_mutex(&idxd->mdev_kref, idxd_mdev_host_release, &idxd->kref_lock);
put_device(dev);
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
index 67da4c122a96..a444a0af8b5f 100644
--- a/drivers/vfio/mdev/idxd/vdev.c
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -75,8 +75,10 @@ void vidxd_init(struct vdcm_idxd *vidxd)
vidxd_mmio_init(vidxd);
- if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+ if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED) {
idxd_wq_disable(wq);
+ wq->state = IDXD_WQ_LOCKED;
+ }
}
void vidxd_send_interrupt(struct vdcm_idxd *vidxd, int vector)
Similar to commit 6140a8f56238 ("vfio-pci: Add device request interface").
Add request interface for mdev to allow userspace to opt in to receive
a device request notification, indicating that the device should be
released.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/vfio/mdev/mdev_irqs.c | 23 +++++++++++++++++++++++
include/linux/mdev.h | 15 +++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/drivers/vfio/mdev/mdev_irqs.c b/drivers/vfio/mdev/mdev_irqs.c
index ed2d11a7c729..11b1f8df020c 100644
--- a/drivers/vfio/mdev/mdev_irqs.c
+++ b/drivers/vfio/mdev/mdev_irqs.c
@@ -316,3 +316,26 @@ void mdev_irqs_free(struct mdev_device *mdev)
memset(&mdev->mdev_irq, 0, sizeof(mdev->mdev_irq));
}
EXPORT_SYMBOL_GPL(mdev_irqs_free);
+
+void vfio_mdev_request(struct vfio_device *vdev, unsigned int count)
+{
+ struct device *dev = vdev->dev;
+ struct mdev_device *mdev = to_mdev_device(dev);
+
+ if (mdev->req_trigger) {
+ dev_dbg(dev, "Requesting device from user\n");
+ eventfd_signal(mdev->req_trigger, 1);
+ }
+}
+EXPORT_SYMBOL_GPL(vfio_mdev_request);
+
+int vfio_mdev_set_req_trigger(struct mdev_device *mdev, unsigned int index,
+ unsigned int start, unsigned int count, u32 flags,
+ void *data)
+{
+ if (index != VFIO_PCI_REQ_IRQ_INDEX || start != 0 || count != 1)
+ return -EINVAL;
+
+ return vfio_set_ctx_trigger_single(&mdev->req_trigger, count, flags, data);
+}
+EXPORT_SYMBOL_GPL(vfio_mdev_set_req_trigger);
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 035c021e8068..db73d58f5e81 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -11,6 +11,8 @@
#define MDEV_H
#include <linux/irqbypass.h>
+#include <linux/eventfd.h>
+#include <linux/vfio.h>
struct mdev_type;
@@ -38,6 +40,7 @@ struct mdev_device {
struct device *iommu_device;
struct mutex creation_lock;
struct mdev_irq mdev_irq;
+ struct eventfd_ctx *req_trigger;
};
static inline struct mdev_device *irq_to_mdev(struct mdev_irq *mdev_irq)
@@ -131,6 +134,10 @@ void mdev_msix_send_signal(struct mdev_device *mdev, int vector);
int mdev_irqs_init(struct mdev_device *mdev, int num, bool *ims_map);
void mdev_irqs_free(struct mdev_device *mdev);
void mdev_irqs_set_pasid(struct mdev_device *mdev, u32 pasid);
+void vfio_mdev_request(struct vfio_device *vdev, unsigned int count);
+int vfio_mdev_set_req_trigger(struct mdev_device *mdev, unsigned int index,
+ unsigned int start, unsigned int count, u32 flags,
+ void *data);
#else
static inline int mdev_set_msix_trigger(struct mdev_device *mdev, unsigned int index,
unsigned int start, unsigned int count, u32 flags,
@@ -148,6 +155,14 @@ static inline int mdev_irqs_init(struct mdev_device *mdev, int num, bool *ims_ma
void mdev_irqs_free(struct mdev_device *mdev) {}
void mdev_irqs_set_pasid(struct mdev_device *mdev, u32 pasid) {}
+void vfio_mdev_request(struct vfio_device *vdev, unsigned int count) {}
+
+int vfio_mdev_set_req_trigger(struct mdev_device *mdev, unsigned int index,
+ unsigned int start, unsigned int count, u32 flags,
+ void *data)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_VFIO_MDEV_IMS */
#endif /* MDEV_H */
Move some VFIO_PCI macros to a common header as they will be shared between
mdev and vfio_pci.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/vfio/pci/vfio_pci_private.h | 6 ------
include/linux/vfio.h | 6 ++++++
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index a17943911fcb..e644f981509c 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -18,12 +18,6 @@
#ifndef VFIO_PCI_PRIVATE_H
#define VFIO_PCI_PRIVATE_H
-#define VFIO_PCI_OFFSET_SHIFT 40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off) (off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
/* Special capability IDs predefined access */
#define PCI_CAP_ID_INVALID 0xFF /* default raw access */
#define PCI_CAP_ID_INVALID_VIRT 0xFE /* default virt access */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 3b372fa57ef4..ed5ca027eb49 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -15,6 +15,12 @@
#include <linux/poll.h>
#include <uapi/linux/vfio.h>
+#define VFIO_PCI_OFFSET_SHIFT 40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off) ((off) >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
struct vfio_device {
struct device *dev;
const struct vfio_device_ops *ops;
Create a mediated device through the VFIO mediated device framework. The
mdev framework allows creation of an mediated device by the driver with
portion of the device's resources. The driver will emulate the slow path
such as the PCI config space, MMIO bar, and the command registers. The
descriptor submission portal(s) will be mmaped to the guest in order to
submit descriptors directly by the guest kernel or apps. The mediated
device support code in the idxd will be referred to as the Virtual
Device Composition Module (vdcm). Add basic plumbing to fill out the
mdev_parent_ops struct that VFIO mdev requires to support a mediated
device.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/idxd.h | 1
drivers/vfio/mdev/idxd/mdev.c | 638 +++++++++++++++++++++++++++++++++++++++++
drivers/vfio/mdev/idxd/mdev.h | 25 ++
3 files changed, 664 insertions(+)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 4d2532175705..0d9e2710fc76 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -198,6 +198,7 @@ struct idxd_wq {
u64 max_xfer_bytes;
u32 max_batch_size;
bool ats_dis;
+ struct vdcm_idxd *vidxd;
};
struct idxd_engine {
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index 25cd62b803f8..e484095baeea 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -41,12 +41,650 @@ int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32
return 0;
}
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+ unsigned int index, unsigned int start,
+ unsigned int count, void *data);
+
+static int idxd_vdcm_get_irq_count(struct vfio_device *vdev, int type)
+{
+ if (type == VFIO_PCI_MSIX_IRQ_INDEX)
+ return VIDXD_MAX_MSIX_VECS;
+
+ return 0;
+}
+
+static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
+ struct vdcm_idxd_type *type)
+{
+ struct vdcm_idxd *vidxd;
+ struct idxd_wq *wq = NULL;
+
+ if (!wq)
+ return ERR_PTR(-ENODEV);
+
+ vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
+ if (!vidxd)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&vidxd->dev_lock);
+ vidxd->idxd = idxd;
+ vidxd->mdev = mdev;
+ vidxd->type = type;
+ vidxd->num_wqs = VIDXD_MAX_WQS;
+
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_get(wq);
+ wq->vidxd = vidxd;
+ vidxd->wq = wq;
+ mutex_unlock(&wq->wq_lock);
+ vidxd_init(vidxd);
+
+ return vidxd;
+}
+
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+
+static struct vdcm_idxd_type *idxd_vdcm_get_type(struct mdev_device *mdev)
+{
+ return &idxd_mdev_types[mdev_get_type_group_id(mdev)];
+}
+
+static const struct vfio_device_ops idxd_mdev_ops;
+
+static int idxd_vdcm_probe(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd;
+ struct vdcm_idxd_type *type;
+ struct device *dev, *parent;
+ struct idxd_device *idxd;
+ bool ims_map[VIDXD_MAX_MSIX_VECS];
+ int rc;
+
+ parent = mdev_parent_dev(mdev);
+ idxd = dev_get_drvdata(parent);
+ dev = &mdev->dev;
+ mdev_set_iommu_device(mdev, parent);
+ type = idxd_vdcm_get_type(mdev);
+
+ vidxd = vdcm_vidxd_create(idxd, mdev, type);
+ if (IS_ERR(vidxd)) {
+ dev_err(dev, "failed to create vidxd: %ld\n", PTR_ERR(vidxd));
+ return PTR_ERR(vidxd);
+ }
+
+ vfio_init_group_dev(&vidxd->vdev, &mdev->dev, &idxd_mdev_ops);
+
+ ims_map[0] = 0;
+ ims_map[1] = 1;
+ rc = mdev_irqs_init(mdev, VIDXD_MAX_MSIX_VECS, ims_map);
+ if (rc < 0)
+ goto err;
+
+ rc = vfio_register_group_dev(&vidxd->vdev);
+ if (rc < 0)
+ goto err_group_register;
+ dev_set_drvdata(dev, vidxd);
+
+ return 0;
+
+err_group_register:
+ mdev_irqs_free(mdev);
+err:
+ kfree(vidxd);
+ return rc;
+}
+
+static void idxd_vdcm_remove(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = dev_get_drvdata(&mdev->dev);
+ struct idxd_wq *wq = vidxd->wq;
+
+ vfio_unregister_group_dev(&vidxd->vdev);
+ mdev_irqs_free(mdev);
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_put(wq);
+ mutex_unlock(&wq->wq_lock);
+
+ kfree(vidxd);
+}
+
+static int idxd_vdcm_open(struct vfio_device *vdev)
+{
+ return 0;
+}
+
+static void idxd_vdcm_close(struct vfio_device *vdev)
+{
+ struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
+
+ mutex_lock(&vidxd->dev_lock);
+ idxd_vdcm_set_irqs(vidxd, VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+ VFIO_PCI_MSIX_IRQ_INDEX, 0, 0, NULL);
+
+ /* Re-initialize the VIDXD to a pristine state for re-use */
+ vidxd_init(vidxd);
+ mutex_unlock(&vidxd->dev_lock);
+}
+
+static ssize_t idxd_vdcm_rw(struct vfio_device *vdev, char *buf, size_t count, loff_t *ppos,
+ enum idxd_vdcm_rw mode)
+{
+ struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ struct device *dev = vdev->dev;
+ int rc = -EINVAL;
+
+ if (index >= VFIO_PCI_NUM_REGIONS) {
+ dev_err(dev, "invalid index: %u\n", index);
+ return -EINVAL;
+ }
+
+ switch (index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ if (mode == IDXD_VDCM_WRITE)
+ rc = vidxd_cfg_write(vidxd, pos, buf, count);
+ else
+ rc = vidxd_cfg_read(vidxd, pos, buf, count);
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ if (mode == IDXD_VDCM_WRITE)
+ rc = vidxd_mmio_write(vidxd, vidxd->bar_val[0] + pos, buf, count);
+ else
+ rc = vidxd_mmio_read(vidxd, vidxd->bar_val[0] + pos, buf, count);
+ break;
+ case VFIO_PCI_BAR2_REGION_INDEX:
+ case VFIO_PCI_BAR3_REGION_INDEX:
+ case VFIO_PCI_BAR4_REGION_INDEX:
+ case VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ default:
+ dev_err(dev, "unsupported region: %u\n", index);
+ }
+
+ return rc == 0 ? count : rc;
+}
+
+static ssize_t idxd_vdcm_read(struct vfio_device *vdev, char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
+ unsigned int done = 0;
+ int rc;
+
+ mutex_lock(&vidxd->dev_lock);
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ rc = idxd_vdcm_rw(vdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 4;
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ rc = idxd_vdcm_rw(vdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ rc = idxd_vdcm_rw(vdev, &val, sizeof(val), ppos,
+ IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+
+ mutex_unlock(&vidxd->dev_lock);
+ return done;
+
+ read_err:
+ mutex_unlock(&vidxd->dev_lock);
+ return -EFAULT;
+}
+
+static ssize_t idxd_vdcm_write(struct vfio_device *vdev, const char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
+ unsigned int done = 0;
+ int rc;
+
+ mutex_lock(&vidxd->dev_lock);
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(vdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 4;
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(vdev, (char *)&val,
+ sizeof(val), ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(vdev, &val, sizeof(val),
+ ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+
+ mutex_unlock(&vidxd->dev_lock);
+ return done;
+
+write_err:
+ mutex_unlock(&vidxd->dev_lock);
+ return -EFAULT;
+}
+
+static int idxd_vdcm_mmap(struct vfio_device *vdev, struct vm_area_struct *vma)
+{
+ unsigned int wq_idx;
+ unsigned long req_size, pgoff = 0, offset;
+ pgprot_t pg_prot;
+ struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_device *idxd = vidxd->idxd;
+ enum idxd_portal_prot virt_portal, phys_portal;
+ phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
+ struct device *dev = vdev->dev;
+
+ if (!(vma->vm_flags & VM_SHARED))
+ return -EINVAL;
+
+ pg_prot = vma->vm_page_prot;
+ req_size = vma->vm_end - vma->vm_start;
+ if (req_size > PAGE_SIZE)
+ return -EINVAL;
+
+ vma->vm_flags |= VM_DONTCOPY;
+
+ offset = (vma->vm_pgoff << PAGE_SHIFT) &
+ ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
+
+ wq_idx = offset >> (PAGE_SHIFT + 2);
+ if (wq_idx >= 1) {
+ dev_err(dev, "mapping invalid wq %d off %lx\n",
+ wq_idx, offset);
+ return -EINVAL;
+ }
+
+ /*
+ * Check and see if the guest wants to map to the limited or unlimited portal.
+ * The driver will allow mapping to unlimited portal only if the wq is a
+ * dedicated wq. Otherwise, it goes to limited.
+ */
+ virt_portal = ((offset >> PAGE_SHIFT) & 0x3) == 1;
+ phys_portal = IDXD_PORTAL_LIMITED;
+ if (virt_portal == IDXD_PORTAL_UNLIMITED && wq_dedicated(wq))
+ phys_portal = IDXD_PORTAL_UNLIMITED;
+
+ /* We always map IMS portals to the guest */
+ pgoff = (base + idxd_get_wq_portal_offset(wq->id, phys_portal,
+ IDXD_IRQ_IMS)) >> PAGE_SHIFT;
+
+ dev_dbg(dev, "mmap %lx %lx %lx %lx\n", vma->vm_start, pgoff, req_size,
+ pgprot_val(pg_prot));
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_pgoff = pgoff;
+
+ return remap_pfn_range(vma, vma->vm_start, pgoff, req_size, pg_prot);
+}
+
+static void vidxd_vdcm_reset(struct vdcm_idxd *vidxd)
+{
+ vidxd_reset(vidxd);
+}
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+ unsigned int index, unsigned int start,
+ unsigned int count, void *data)
+{
+ struct mdev_device *mdev = vidxd->mdev;
+
+ switch (index) {
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ case VFIO_PCI_MSI_IRQ_INDEX:
+ break;
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ return mdev_set_msix_trigger(mdev, index, start, count, flags, data);
+ }
+ break;
+ }
+
+ return -ENOTTY;
+}
+
+static long idxd_vdcm_ioctl(struct vfio_device *vdev, unsigned int cmd, unsigned long arg)
+{
+ struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
+ unsigned long minsz;
+ int rc = -EINVAL;
+ struct device *dev = vdev->dev;
+
+ dev_dbg(dev, "vidxd %p ioctl, cmd: %d\n", vidxd, cmd);
+
+ mutex_lock(&vidxd->dev_lock);
+ if (cmd == VFIO_DEVICE_GET_INFO) {
+ struct vfio_device_info info;
+
+ minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ info.flags = VFIO_DEVICE_FLAGS_PCI;
+ info.flags |= VFIO_DEVICE_FLAGS_RESET;
+ info.num_regions = VFIO_PCI_NUM_REGIONS;
+ info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+ struct vfio_region_info info;
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
+ size_t size;
+ int nr_areas = 1;
+ int cap_type_id = 0;
+
+ minsz = offsetofend(struct vfio_region_info, offset);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ switch (info.index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = VIDXD_MAX_CFG_SPACE_SZ;
+ info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = vidxd->bar_size[info.index];
+ if (!info.size) {
+ info.flags = 0;
+ break;
+ }
+
+ info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = 0;
+ info.flags = 0;
+ break;
+ case VFIO_PCI_BAR2_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.flags = VFIO_REGION_INFO_FLAG_CAPS | VFIO_REGION_INFO_FLAG_MMAP |
+ VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ info.size = vidxd->bar_size[1];
+
+ /*
+ * Every WQ has two areas for unlimited and limited
+ * MSI-X portals. IMS portals are not reported
+ */
+ nr_areas = 2;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+ cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+
+ /* Unlimited portal */
+ sparse->areas[0].offset = 0;
+ sparse->areas[0].size = PAGE_SIZE;
+
+ /* Limited portal */
+ sparse->areas[1].offset = PAGE_SIZE;
+ sparse->areas[1].size = PAGE_SIZE;
+ break;
+
+ case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = 0;
+ info.flags = 0;
+ dev_dbg(dev, "get region info bar:%d\n", info.index);
+ break;
+
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ dev_dbg(dev, "get region info index:%d\n", info.index);
+ break;
+ default: {
+ if (info.index >= VFIO_PCI_NUM_REGIONS)
+ rc = -EINVAL;
+ else
+ rc = 0;
+ goto out;
+ } /* default */
+ } /* info.index switch */
+
+ if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
+ if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP) {
+ rc = vfio_info_add_capability(&caps, &sparse->header,
+ sizeof(*sparse) + (sparse->nr_areas *
+ sizeof(*sparse->areas)));
+ kfree(sparse);
+ if (rc)
+ goto out;
+ }
+ }
+
+ if (caps.size) {
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user((void __user *)arg + sizeof(info),
+ caps.buf, caps.size)) {
+ kfree(caps.buf);
+ rc = -EFAULT;
+ goto out;
+ }
+ info.cap_offset = sizeof(info);
+ }
+
+ kfree(caps.buf);
+ }
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+ struct vfio_irq_info info;
+ u32 pasid;
+
+ rc = idxd_mdev_get_pasid(vidxd->mdev, vdev, &pasid);
+ if (rc < 0)
+ goto out;
+ mdev_irqs_set_pasid(vidxd->mdev, pasid);
+
+ minsz = offsetofend(struct vfio_irq_info, count);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+ switch (info.index) {
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ info.flags |= VFIO_IRQ_INFO_NORESIZE;
+ break;
+ default:
+ rc = -EINVAL;
+ goto out;
+ } /* switch(info.index) */
+
+ info.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_NORESIZE;
+ info.count = idxd_vdcm_get_irq_count(vdev, info.index);
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+ struct vfio_irq_set hdr;
+ u8 *data = NULL;
+ size_t data_size = 0;
+
+ minsz = offsetofend(struct vfio_irq_set, count);
+
+ if (copy_from_user(&hdr, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+ int max = idxd_vdcm_get_irq_count(vdev, hdr.index);
+
+ rc = vfio_set_irqs_validate_and_prepare(&hdr, max, VFIO_PCI_NUM_IRQS,
+ &data_size);
+ if (rc) {
+ dev_err(dev, "intel:vfio_set_irqs_validate_and_prepare failed\n");
+ rc = -EINVAL;
+ goto out;
+ }
+
+ if (data_size) {
+ data = memdup_user((void __user *)(arg + minsz), data_size);
+ if (IS_ERR(data)) {
+ rc = PTR_ERR(data);
+ goto out;
+ }
+ }
+ }
+
+ if (!data) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ rc = idxd_vdcm_set_irqs(vidxd, hdr.flags, hdr.index, hdr.start, hdr.count, data);
+ kfree(data);
+ goto out;
+ } else if (cmd == VFIO_DEVICE_RESET) {
+ vidxd_vdcm_reset(vidxd);
+ }
+
+ out:
+ mutex_unlock(&vidxd->dev_lock);
+ return rc;
+}
+
+static const struct vfio_device_ops idxd_mdev_ops = {
+ .name = "vfio-mdev",
+ .open = idxd_vdcm_open,
+ .release = idxd_vdcm_close,
+ .read = idxd_vdcm_read,
+ .write = idxd_vdcm_write,
+ .mmap = idxd_vdcm_mmap,
+ .ioctl = idxd_vdcm_ioctl,
+};
+
static struct mdev_driver idxd_vdcm_driver = {
.driver = {
.name = "idxd-mdev",
.owner = THIS_MODULE,
.mod_name = KBUILD_MODNAME,
},
+ .probe = idxd_vdcm_probe,
+ .remove = idxd_vdcm_remove,
};
static int idxd_mdev_drv_probe(struct device *dev)
diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
index f696fe38e374..dd4290bce772 100644
--- a/drivers/vfio/mdev/idxd/mdev.h
+++ b/drivers/vfio/mdev/idxd/mdev.h
@@ -30,11 +30,26 @@
#define VIDXD_MAX_MSIX_ENTRIES VIDXD_MAX_MSIX_VECS
#define VIDXD_MAX_WQS 1
+#define IDXD_MDEV_NAME_LEN 64
+#define IDXD_MDEV_TYPES 2
+
+enum idxd_mdev_type {
+ IDXD_MDEV_TYPE_DSA_1_DWQ = 0,
+ IDXD_MDEV_TYPE_IAX_1_DWQ,
+};
+
+struct vdcm_idxd_type {
+ const char *name;
+ enum idxd_mdev_type type;
+ unsigned int avail_instance;
+};
+
struct vdcm_idxd {
struct vfio_device vdev;
struct idxd_device *idxd;
struct idxd_wq *wq;
struct mdev_device *mdev;
+ struct vdcm_idxd_type *type;
int num_wqs;
u64 bar_val[VIDXD_MAX_BARS];
@@ -44,6 +59,16 @@ struct vdcm_idxd {
struct mutex dev_lock; /* lock for vidxd resources */
};
+enum idxd_vdcm_rw {
+ IDXD_VDCM_READ = 0,
+ IDXD_VDCM_WRITE,
+};
+
+static inline struct vdcm_idxd *vdev_to_vidxd(struct vfio_device *vdev)
+{
+ return container_of(vdev, struct vdcm_idxd, vdev);
+}
+
static inline u64 get_reg_val(void *buf, int size)
{
u64 val = 0;
Add request interrupt support for idxd-mdev driver to support requesting
release of device.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/vfio/mdev/idxd/mdev.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index 25d1ac67b0c9..6bf2ec43656c 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -52,6 +52,8 @@ static int idxd_vdcm_get_irq_count(struct vfio_device *vdev, int type)
{
if (type == VFIO_PCI_MSIX_IRQ_INDEX)
return VIDXD_MAX_MSIX_VECS;
+ else if (type == VFIO_PCI_REQ_IRQ_INDEX)
+ return 1;
return 0;
}
@@ -486,6 +488,12 @@ static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
return mdev_set_msix_trigger(mdev, index, start, count, flags, data);
}
break;
+ case VFIO_PCI_REQ_IRQ_INDEX:
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ return vfio_mdev_set_req_trigger(mdev, index, start, count, flags, data);
+ }
+ break;
}
return -ENOTTY;
@@ -678,6 +686,7 @@ static long idxd_vdcm_ioctl(struct vfio_device *vdev, unsigned int cmd, unsigned
switch (info.index) {
case VFIO_PCI_MSIX_IRQ_INDEX:
+ case VFIO_PCI_REQ_IRQ_INDEX:
info.flags |= VFIO_IRQ_INFO_NORESIZE;
break;
default:
@@ -750,6 +759,7 @@ static const struct vfio_device_ops idxd_mdev_ops = {
.write = idxd_vdcm_write,
.mmap = idxd_vdcm_mmap,
.ioctl = idxd_vdcm_ioctl,
+ .request = vfio_mdev_request,
};
static ssize_t name_show(struct mdev_type *mtype, struct mdev_type_attribute *attr, char *buf)
Add the basic registration and initialization for the mdev idxd driver.
To register with the mdev framework, the driver must register the
pci_dev. Add the registration as part of the idxd mdev driver probe.
The host init is setup to be called on the first wq device probe. And
when the last wq device releases the driver, the unregistration also
happens.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/Kconfig | 1
drivers/dma/idxd/device.c | 2 +
drivers/dma/idxd/idxd.h | 10 ++++
drivers/dma/idxd/init.c | 39 +++++++++++++++++
drivers/vfio/mdev/idxd/mdev.c | 95 +++++++++++++++++++++++++++++++++++++++++
drivers/vfio/mdev/idxd/vdev.c | 3 -
6 files changed, 147 insertions(+), 3 deletions(-)
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 84f996dd339d..390227027878 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -281,6 +281,7 @@ config INTEL_IDXD
depends on PCI && X86_64
depends on PCI_MSI
depends on SBITMAP
+ depends on IMS_MSI_ARRAY
select DMA_ENGINE
help
Enable support for the Intel(R) data accelerators present
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 2ea6015e0d53..cef41a273cc1 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -1257,6 +1257,7 @@ int drv_enable_wq(struct idxd_wq *wq)
mutex_unlock(&wq->wq_lock);
return rc;
}
+EXPORT_SYMBOL_GPL(drv_enable_wq);
void __drv_disable_wq(struct idxd_wq *wq)
{
@@ -1283,6 +1284,7 @@ void drv_disable_wq(struct idxd_wq *wq)
__drv_disable_wq(wq);
mutex_unlock(&wq->wq_lock);
}
+EXPORT_SYMBOL_GPL(drv_disable_wq);
int idxd_device_drv_probe(struct device *dev)
{
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index cbb046c2921f..4d2532175705 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -11,6 +11,7 @@
#include <linux/idr.h>
#include <linux/pci.h>
#include <linux/perf_event.h>
+#include <linux/mdev.h>
#include "registers.h"
#define IDXD_DRIVER_VERSION "1.00"
@@ -60,6 +61,7 @@ static const char idxd_dsa_drv_name[] = "dsa";
static const char idxd_dev_drv_name[] = "idxd";
static const char idxd_dmaengine_drv_name[] = "dmaengine";
static const char idxd_user_drv_name[] = "user";
+static const char idxd_mdev_drv_name[] = "mdev";
struct idxd_irq_entry {
struct idxd_device *idxd;
@@ -297,6 +299,10 @@ struct idxd_device {
int *int_handles;
struct idxd_pmu *idxd_pmu;
+
+ struct kref mdev_kref;
+ struct mutex kref_lock; /* lock for the mdev_kref */
+ bool mdev_host_init;
};
/* IDXD software descriptor */
@@ -587,6 +593,10 @@ int idxd_cdev_get_major(struct idxd_device *idxd);
int idxd_wq_add_cdev(struct idxd_wq *wq);
void idxd_wq_del_cdev(struct idxd_wq *wq);
+/* mdev host */
+int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv);
+void idxd_mdev_host_release(struct kref *kref);
+
/* perfmon */
#if IS_ENABLED(CONFIG_INTEL_IDXD_PERFMON)
int perfmon_pmu_init(struct idxd_device *idxd);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 16ff37be2d26..809ca1827772 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -63,6 +63,41 @@ static struct pci_device_id idxd_pci_tbl[] = {
};
MODULE_DEVICE_TABLE(pci, idxd_pci_tbl);
+int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
+{
+ struct device *dev = &idxd->pdev->dev;
+ int rc;
+
+ if (!idxd->ims_size)
+ return -EOPNOTSUPP;
+
+ rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ if (rc < 0) {
+ dev_warn(dev, "Failed to enable aux-domain: %d\n", rc);
+ return rc;
+ }
+
+ rc = mdev_register_device(dev, drv);
+ if (rc < 0) {
+ iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ return rc;
+ }
+
+ idxd->mdev_host_init = true;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(idxd_mdev_host_init);
+
+void idxd_mdev_host_release(struct kref *kref)
+{
+ struct idxd_device *idxd = container_of(kref, struct idxd_device, mdev_kref);
+ struct device *dev = &idxd->pdev->dev;
+
+ mdev_unregister_device(dev);
+ iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+}
+EXPORT_SYMBOL_GPL(idxd_mdev_host_release);
+
static int idxd_setup_interrupts(struct idxd_device *idxd)
{
struct pci_dev *pdev = idxd->pdev;
@@ -352,6 +387,9 @@ static int idxd_setup_internals(struct idxd_device *idxd)
goto err_wkq_create;
}
+ kref_init(&idxd->mdev_kref);
+ mutex_init(&idxd->kref_lock);
+
return 0;
err_wkq_create:
@@ -741,6 +779,7 @@ static void idxd_remove(struct pci_dev *pdev)
dev_dbg(&pdev->dev, "%s called\n", __func__);
idxd_shutdown(pdev);
+ kref_put_mutex(&idxd->mdev_kref, idxd_mdev_host_release, &idxd->kref_lock);
if (device_pasid_enabled(idxd))
idxd_disable_system_pasid(idxd);
idxd_unregister_devices(idxd);
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index 90ff7cedb8b4..25cd62b803f8 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -16,6 +16,7 @@
#include <linux/intel-svm.h>
#include <linux/kvm_host.h>
#include <linux/eventfd.h>
+#include <linux/irqchip/irq-ims-msi.h>
#include <uapi/linux/idxd.h>
#include "registers.h"
#include "idxd.h"
@@ -40,5 +41,99 @@ int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32
return 0;
}
+static struct mdev_driver idxd_vdcm_driver = {
+ .driver = {
+ .name = "idxd-mdev",
+ .owner = THIS_MODULE,
+ .mod_name = KBUILD_MODNAME,
+ },
+};
+
+static int idxd_mdev_drv_probe(struct device *dev)
+{
+ struct idxd_wq *wq = confdev_to_wq(dev);
+ struct idxd_device *idxd = wq->idxd;
+ int rc;
+
+ if (!is_idxd_wq_mdev(wq))
+ return -ENODEV;
+
+ rc = drv_enable_wq(wq);
+ if (rc < 0)
+ return rc;
+
+ /*
+ * The kref count starts at 1 on initialization. So the first device gets
+ * probed, we want to setup the mdev and do the host initialization. The
+ * follow on probes the driver want to just take a kref. On the remove side, once
+ * the kref hits 0, the driver will do the host cleanup and unregister from the
+ * mdev framework.
+ */
+ mutex_lock(&idxd->kref_lock);
+ if (!idxd->mdev_host_init) {
+ rc = idxd_mdev_host_init(idxd, &idxd_vdcm_driver);
+ if (rc < 0) {
+ mutex_unlock(&idxd->kref_lock);
+ drv_disable_wq(wq);
+ dev_warn(dev, "mdev device init failed!\n");
+ return -ENXIO;
+ }
+ idxd->mdev_host_init = true;
+ } else {
+ kref_get(&idxd->mdev_kref);
+ }
+ mutex_unlock(&idxd->kref_lock);
+
+ get_device(dev);
+ dev_info(dev, "wq %s enabled\n", dev_name(dev));
+ return 0;
+}
+
+static void idxd_mdev_drv_remove(struct device *dev)
+{
+ struct idxd_wq *wq = confdev_to_wq(dev);
+ struct idxd_device *idxd = wq->idxd;
+
+ drv_disable_wq(wq);
+ dev_info(dev, "wq %s disabled\n", dev_name(dev));
+ kref_put_mutex(&idxd->mdev_kref, idxd_mdev_host_release, &idxd->kref_lock);
+ put_device(dev);
+}
+
+static struct idxd_device_driver idxd_mdev_driver = {
+ .probe = idxd_mdev_drv_probe,
+ .remove = idxd_mdev_drv_remove,
+ .name = idxd_mdev_drv_name,
+};
+
+static int __init idxd_mdev_init(void)
+{
+ int rc;
+
+ rc = idxd_driver_register(&idxd_mdev_driver);
+ if (rc < 0)
+ return rc;
+
+ rc = mdev_register_driver(&idxd_vdcm_driver);
+ if (rc < 0) {
+ idxd_driver_unregister(&idxd_mdev_driver);
+ return rc;
+ }
+
+ return 0;
+}
+
+static void __exit idxd_mdev_exit(void)
+{
+ mdev_unregister_driver(&idxd_vdcm_driver);
+ idxd_driver_unregister(&idxd_mdev_driver);
+}
+
+module_init(idxd_mdev_init);
+module_exit(idxd_mdev_exit);
+
MODULE_IMPORT_NS(IDXD);
+MODULE_SOFTDEP("pre: idxd");
MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
+MODULE_ALIAS_IDXD_DEVICE(0);
diff --git a/drivers/vfio/mdev/idxd/vdev.c b/drivers/vfio/mdev/idxd/vdev.c
index d2416765ce7e..67da4c122a96 100644
--- a/drivers/vfio/mdev/idxd/vdev.c
+++ b/drivers/vfio/mdev/idxd/vdev.c
@@ -989,6 +989,3 @@ static void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
break;
}
}
-
-MODULE_IMPORT_NS(IDXD);
-MODULE_LICENSE("GPL v2");
Add setup code for the IMS domain. This feeds the MSI subsystem the
relevant information for device IMS. The allocation of the IMS vectors are
done in common VFIO code if the correct domain set for the
mdev device.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/idxd.h | 1 +
drivers/dma/idxd/init.c | 14 ++++++++++++++
drivers/vfio/mdev/idxd/mdev.c | 2 ++
3 files changed, 17 insertions(+)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 0d9e2710fc76..81c78add74dd 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -297,6 +297,7 @@ struct idxd_device {
struct workqueue_struct *wq;
struct work_struct work;
+ struct irq_domain *ims_domain;
int *int_handles;
struct idxd_pmu *idxd_pmu;
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 809ca1827772..ead46761b23e 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -16,6 +16,8 @@
#include <linux/idr.h>
#include <linux/intel-svm.h>
#include <linux/iommu.h>
+#include <linux/irqdomain.h>
+#include <linux/irqchip/irq-ims-msi.h>
#include <uapi/linux/idxd.h>
#include <linux/dmaengine.h>
#include "../dmaengine.h"
@@ -66,6 +68,7 @@ MODULE_DEVICE_TABLE(pci, idxd_pci_tbl);
int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
{
struct device *dev = &idxd->pdev->dev;
+ struct ims_array_info ims_info;
int rc;
if (!idxd->ims_size)
@@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
return rc;
}
+ ims_info.max_slots = idxd->ims_size;
+ ims_info.slots = idxd->reg_base + idxd->ims_offset;
+ idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
+ if (!idxd->ims_domain) {
+ dev_warn(dev, "Fail to acquire IMS domain\n");
+ iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ return -ENODEV;
+ }
+
rc = mdev_register_device(dev, drv);
if (rc < 0) {
+ irq_domain_remove(idxd->ims_domain);
iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
return rc;
}
@@ -93,6 +106,7 @@ void idxd_mdev_host_release(struct kref *kref)
struct idxd_device *idxd = container_of(kref, struct idxd_device, mdev_kref);
struct device *dev = &idxd->pdev->dev;
+ irq_domain_remove(idxd->ims_domain);
mdev_unregister_device(dev);
iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
}
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index 9f6c4997ec24..7dac024e2852 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -111,6 +111,7 @@ static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev
struct vdcm_idxd_type *type)
{
struct vdcm_idxd *vidxd;
+ struct device *dev = &mdev->dev;
struct idxd_wq *wq = NULL;
int rc;
@@ -129,6 +130,7 @@ static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev
vidxd->mdev = mdev;
vidxd->type = type;
vidxd->num_wqs = VIDXD_MAX_WQS;
+ dev_set_msi_domain(dev, idxd->ims_domain);
mutex_lock(&wq->wq_lock);
idxd_wq_get(wq);
Add mdev device type "1dwq-v1" support code. 1dwq-v1 is defined as a
single DSA gen1 dedicated WQ. This WQ cannot be shared between guests. The
guest also cannot change any WQ configuration.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/vfio/mdev/idxd/mdev.c | 173 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 166 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
index e484095baeea..9f6c4997ec24 100644
--- a/drivers/vfio/mdev/idxd/mdev.c
+++ b/drivers/vfio/mdev/idxd/mdev.c
@@ -22,6 +22,13 @@
#include "idxd.h"
#include "mdev.h"
+static const char idxd_dsa_1dwq_name[] = "dsa-1dwq-v1";
+static const char idxd_iax_1dwq_name[] = "iax-1dwq-v1";
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+ unsigned int index, unsigned int start,
+ unsigned int count, void *data);
+
int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32 *pasid)
{
struct vfio_group *vfio_group = vdev->group;
@@ -41,10 +48,6 @@ int idxd_mdev_get_pasid(struct mdev_device *mdev, struct vfio_device *vdev, u32
return 0;
}
-static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
- unsigned int index, unsigned int start,
- unsigned int count, void *data);
-
static int idxd_vdcm_get_irq_count(struct vfio_device *vdev, int type)
{
if (type == VFIO_PCI_MSIX_IRQ_INDEX)
@@ -53,18 +56,73 @@ static int idxd_vdcm_get_irq_count(struct vfio_device *vdev, int type)
return 0;
}
+static struct idxd_wq *find_any_dwq(struct idxd_device *idxd, struct vdcm_idxd_type *type)
+{
+ int i;
+ struct idxd_wq *wq;
+ unsigned long flags;
+
+ switch (type->type) {
+ case IDXD_MDEV_TYPE_DSA_1_DWQ:
+ if (idxd->data->type != IDXD_TYPE_DSA)
+ return NULL;
+ break;
+ case IDXD_MDEV_TYPE_IAX_1_DWQ:
+ if (idxd->data->type != IDXD_TYPE_IAX)
+ return NULL;
+ break;
+ default:
+ return NULL;
+ }
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < idxd->max_wqs; i++) {
+ wq = idxd->wqs[i];
+
+ if (wq->state != IDXD_WQ_ENABLED)
+ continue;
+
+ if (!wq_dedicated(wq))
+ continue;
+
+ if (!is_idxd_wq_mdev(wq))
+ continue;
+
+ if (idxd_wq_refcount(wq) != 0)
+ continue;
+
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ mutex_lock(&wq->wq_lock);
+ if (idxd_wq_refcount(wq)) {
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ continue;
+ }
+
+ idxd_wq_get(wq);
+ mutex_unlock(&wq->wq_lock);
+ return wq;
+ }
+
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ return NULL;
+}
+
static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
struct vdcm_idxd_type *type)
{
struct vdcm_idxd *vidxd;
struct idxd_wq *wq = NULL;
+ int rc;
+ wq = find_any_dwq(idxd, type);
if (!wq)
return ERR_PTR(-ENODEV);
vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
- if (!vidxd)
- return ERR_PTR(-ENOMEM);
+ if (!vidxd) {
+ rc = -ENOMEM;
+ goto err;
+ }
mutex_init(&vidxd->dev_lock);
vidxd->idxd = idxd;
@@ -80,9 +138,24 @@ static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev
vidxd_init(vidxd);
return vidxd;
+
+ err:
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_put(wq);
+ mutex_unlock(&wq->wq_lock);
+ return ERR_PTR(rc);
}
-static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES] = {
+ {
+ .name = idxd_dsa_1dwq_name,
+ .type = IDXD_MDEV_TYPE_DSA_1_DWQ,
+ },
+ {
+ .name = idxd_iax_1dwq_name,
+ .type = IDXD_MDEV_TYPE_IAX_1_DWQ,
+ },
+};
static struct vdcm_idxd_type *idxd_vdcm_get_type(struct mdev_device *mdev)
{
@@ -677,6 +750,91 @@ static const struct vfio_device_ops idxd_mdev_ops = {
.ioctl = idxd_vdcm_ioctl,
};
+static ssize_t name_show(struct mdev_type *mtype, struct mdev_type_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%s\n", idxd_mdev_types[mtype_get_type_group_id(mtype)].name);
+}
+static MDEV_TYPE_ATTR_RO(name);
+
+static int find_available_mdev_instances(struct idxd_device *idxd, struct vdcm_idxd_type *type)
+{
+ int count = 0, i;
+ unsigned long flags;
+
+ switch (type->type) {
+ case IDXD_MDEV_TYPE_DSA_1_DWQ:
+ if (idxd->data->type != IDXD_TYPE_DSA)
+ return 0;
+ break;
+ case IDXD_MDEV_TYPE_IAX_1_DWQ:
+ if (idxd->data->type != IDXD_TYPE_IAX)
+ return 0;
+ break;
+ default:
+ return 0;
+ }
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq;
+
+ wq = idxd->wqs[i];
+ if (!is_idxd_wq_mdev(wq) || !wq_dedicated(wq) || idxd_wq_refcount(wq))
+ continue;
+
+ count++;
+ }
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+ return count;
+}
+
+static ssize_t available_instances_show(struct mdev_type *mtype,
+ struct mdev_type_attribute *attr,
+ char *buf)
+{
+ struct device *dev = mtype_get_parent_dev(mtype);
+ struct idxd_device *idxd = dev_get_drvdata(dev);
+ int count;
+ struct vdcm_idxd_type *type;
+
+ type = &idxd_mdev_types[mtype_get_type_group_id(mtype)];
+ count = find_available_mdev_instances(idxd, type);
+
+ return sysfs_emit(buf, "%d\n", count);
+}
+static MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct mdev_type *mtype, struct mdev_type_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+static MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *idxd_mdev_types_attrs[] = {
+ &mdev_type_attr_name.attr,
+ &mdev_type_attr_device_api.attr,
+ &mdev_type_attr_available_instances.attr,
+ NULL,
+};
+
+static struct attribute_group idxd_mdev_type_dsa_group0 = {
+ .name = idxd_dsa_1dwq_name,
+ .attrs = idxd_mdev_types_attrs,
+};
+
+static struct attribute_group idxd_mdev_type_iax_group0 = {
+ .name = idxd_iax_1dwq_name,
+ .attrs = idxd_mdev_types_attrs,
+};
+
+static struct attribute_group *idxd_mdev_type_groups[] = {
+ &idxd_mdev_type_dsa_group0,
+ &idxd_mdev_type_iax_group0,
+ NULL,
+};
+
static struct mdev_driver idxd_vdcm_driver = {
.driver = {
.name = "idxd-mdev",
@@ -685,6 +843,7 @@ static struct mdev_driver idxd_vdcm_driver = {
},
.probe = idxd_vdcm_probe,
.remove = idxd_vdcm_remove,
+ .supported_type_groups = idxd_mdev_type_groups,
};
static int idxd_mdev_drv_probe(struct device *dev)
On Fri, May 21, 2021 at 05:21:03PM -0700, Dave Jiang wrote:
> Similar to commit 6140a8f56238 ("vfio-pci: Add device request interface").
> Add request interface for mdev to allow userspace to opt in to receive
> a device request notification, indicating that the device should be
> released.
>
> Signed-off-by: Dave Jiang <[email protected]>
> ---
> drivers/vfio/mdev/mdev_irqs.c | 23 +++++++++++++++++++++++
> include/linux/mdev.h | 15 +++++++++++++++
> 2 files changed, 38 insertions(+)
Please don't add new things to mdev, put the req_trigger in the vdcm_idxd
struct vfio_device class.
> diff --git a/drivers/vfio/mdev/mdev_irqs.c b/drivers/vfio/mdev/mdev_irqs.c
> index ed2d11a7c729..11b1f8df020c 100644
> --- a/drivers/vfio/mdev/mdev_irqs.c
> +++ b/drivers/vfio/mdev/mdev_irqs.c
and similarly this shouldn't be called mdev_irqs and the code in here
should have nothign to do with mdevs. Providing the special IRQ
emulation stuff is just generic vfio_device functionality with no
linkage to mdev.
> @@ -316,3 +316,26 @@ void mdev_irqs_free(struct mdev_device *mdev)
> memset(&mdev->mdev_irq, 0, sizeof(mdev->mdev_irq));
> }
> EXPORT_SYMBOL_GPL(mdev_irqs_free);
> +
> +void vfio_mdev_request(struct vfio_device *vdev, unsigned int count)
> +{
> + struct device *dev = vdev->dev;
> + struct mdev_device *mdev = to_mdev_device(dev);
Yuk, don't do stuff like that, if it needs a mdev then pass in a mdev.
Jason
On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
> The code has dependency on DEV_MSI/IMS enabling code:
> https://lore.kernel.org/lkml/[email protected]/
>
> The code has dependency on idxd driver sub-driver cleanup series:
> https://lore.kernel.org/dmaengine/162163546245.260470.18336189072934823712.stgit@djiang5-desk3.ch.intel.com/T/#t
>
> The code has dependency on Jason's VFIO refactoring:
> https://lore.kernel.org/kvm/[email protected]/
That is quite an inter-tangled list, do you have a submission plan??
> Introducing mdev types “1dwq-v1” type. This mdev type allows
> allocation of a single dedicated wq from available dedicated wqs. After
> a workqueue (wq) is enabled, the user will generate an uuid. On mdev
> creation, the mdev driver code will find a dwq depending on the mdev
> type. When the create operation is successful, the user generated uuid
> can be passed to qemu. When the guest boots up, it should discover a
> DSA device when doing PCI discovery.
>
> For example of “1dwq-v1” type:
> 1. Enable wq with “mdev” wq type
> 2. A user generated uuid.
> 3. The uuid is written to the mdev class sysfs path:
> echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
> 4. Pass the following parameter to qemu:
> "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
So the idxd core driver knows to create a "vfio" wq with its own much
machinery but you still want to involve the horrible mdev guid stuff?
Why??
Jason
On Fri, May 21, 2021 at 05:20:13PM -0700, Dave Jiang wrote:
> +int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> +{
> + struct device *dev = &idxd->pdev->dev;
> + int rc;
> +
> + if (!idxd->ims_size)
> + return -EOPNOTSUPP;
> +
> + rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_AUX);
> + if (rc < 0) {
> + dev_warn(dev, "Failed to enable aux-domain: %d\n", rc);
> + return rc;
> + }
> +
> + rc = mdev_register_device(dev, drv);
> + if (rc < 0) {
> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> + return rc;
> + }
Don't call mdev_register_device from drivers/dma/idxd/init.c - vfio
stuff all belongs under drivers/vfio.
> +void idxd_mdev_host_release(struct kref *kref)
> +{
> + struct idxd_device *idxd = container_of(kref, struct idxd_device, mdev_kref);
> + struct device *dev = &idxd->pdev->dev;
> +
> + mdev_unregister_device(dev);
> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> +}
> +EXPORT_SYMBOL_GPL(idxd_mdev_host_release);
> +
> static int idxd_setup_interrupts(struct idxd_device *idxd)
> {
> struct pci_dev *pdev = idxd->pdev;
> @@ -352,6 +387,9 @@ static int idxd_setup_internals(struct idxd_device *idxd)
> goto err_wkq_create;
> }
>
> + kref_init(&idxd->mdev_kref);
> + mutex_init(&idxd->kref_lock);
> +
> return 0;
>
> err_wkq_create:
> @@ -741,6 +779,7 @@ static void idxd_remove(struct pci_dev *pdev)
>
> dev_dbg(&pdev->dev, "%s called\n", __func__);
> idxd_shutdown(pdev);
> + kref_put_mutex(&idxd->mdev_kref, idxd_mdev_host_release, &idxd->kref_lock);
I didn't look closely at why this is like this, but please try to
avoid kref_put_mutex(), it should only be needed in exceptional
cases and this shouldn't be exceptional.
If you need to lock a kref before using it, it isn't a kref anymore,
just use an 'int'.
Especially since the kref is calling mdev_unregister_device(),
something is really upside down to motivate refcounting that.
Jason
On Fri, May 21, 2021 at 05:20:26PM -0700, Dave Jiang wrote:
> +static int idxd_vdcm_mmap(struct vfio_device *vdev, struct vm_area_struct *vma)
> +{
> + unsigned int wq_idx;
> + unsigned long req_size, pgoff = 0, offset;
> + pgprot_t pg_prot;
> + struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
> + struct idxd_wq *wq = vidxd->wq;
> + struct idxd_device *idxd = vidxd->idxd;
> + enum idxd_portal_prot virt_portal, phys_portal;
> + phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
> + struct device *dev = vdev->dev;
> +
> + if (!(vma->vm_flags & VM_SHARED))
> + return -EINVAL;
> +
> + pg_prot = vma->vm_page_prot;
> + req_size = vma->vm_end - vma->vm_start;
> + if (req_size > PAGE_SIZE)
> + return -EINVAL;
> +
> + vma->vm_flags |= VM_DONTCOPY;
> +
> + offset = (vma->vm_pgoff << PAGE_SHIFT) &
> + ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
> +
> + wq_idx = offset >> (PAGE_SHIFT + 2);
> + if (wq_idx >= 1) {
> + dev_err(dev, "mapping invalid wq %d off %lx\n",
> + wq_idx, offset);
> + return -EINVAL;
> + }
This is a really wonky and leaky way to say that the vm_pgoff can only
be one of two values? It is uAPI, be thorough.
> +static long idxd_vdcm_ioctl(struct vfio_device *vdev, unsigned int cmd, unsigned long arg)
> +{
> + struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
> + unsigned long minsz;
> + int rc = -EINVAL;
> + struct device *dev = vdev->dev;
> +
> + dev_dbg(dev, "vidxd %p ioctl, cmd: %d\n", vidxd, cmd);
> +
> + mutex_lock(&vidxd->dev_lock);
> + if (cmd == VFIO_DEVICE_GET_INFO) {
Sigh.. This ioctl stuff really needs splitting into proper functions
called by the core code instead of cut&pasting all of this
> static int idxd_mdev_drv_probe(struct device *dev)
> diff --git a/drivers/vfio/mdev/idxd/mdev.h b/drivers/vfio/mdev/idxd/mdev.h
> index f696fe38e374..dd4290bce772 100644
> +++ b/drivers/vfio/mdev/idxd/mdev.h
> @@ -30,11 +30,26 @@
> #define VIDXD_MAX_MSIX_ENTRIES VIDXD_MAX_MSIX_VECS
> #define VIDXD_MAX_WQS 1
>
> +#define IDXD_MDEV_NAME_LEN 64
This is never used. Check everything..
Jason
On Fri, May 21, 2021 at 05:20:31PM -0700, Dave Jiang wrote:
> Add mdev device type "1dwq-v1" support code. 1dwq-v1 is defined as a
> single DSA gen1 dedicated WQ. This WQ cannot be shared between guests. The
> guest also cannot change any WQ configuration.
>
> Signed-off-by: Dave Jiang <[email protected]>
> drivers/vfio/mdev/idxd/mdev.c | 173 +++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 166 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vfio/mdev/idxd/mdev.c b/drivers/vfio/mdev/idxd/mdev.c
> index e484095baeea..9f6c4997ec24 100644
> +++ b/drivers/vfio/mdev/idxd/mdev.c
> @@ -22,6 +22,13 @@
> #include "idxd.h"
> #include "mdev.h"
>
> +static const char idxd_dsa_1dwq_name[] = "dsa-1dwq-v1";
> +static const char idxd_iax_1dwq_name[] = "iax-1dwq-v1";
Dare I ask why this is "v1"? If you need to significantly change
something you should make a whole new mdev.
Jason
On Fri, May 21, 2021 at 05:20:26PM -0700, Dave Jiang wrote:
> +static int idxd_vdcm_probe(struct mdev_device *mdev)
> +{
> + struct vdcm_idxd *vidxd;
> + struct vdcm_idxd_type *type;
> + struct device *dev, *parent;
> + struct idxd_device *idxd;
> + bool ims_map[VIDXD_MAX_MSIX_VECS];
> + int rc;
> +
> + parent = mdev_parent_dev(mdev);
> + idxd = dev_get_drvdata(parent);
> + dev = &mdev->dev;
> + mdev_set_iommu_device(mdev, parent);
> + type = idxd_vdcm_get_type(mdev);
This makes my head hurt. There is a kref guarding
mdev_unregister_device() but probe reaches into the parent idxd
device's drvdata? I'm skeptical any of this is locked right
> +static void idxd_vdcm_remove(struct mdev_device *mdev)
> +{
> + struct vdcm_idxd *vidxd = dev_get_drvdata(&mdev->dev);
> + struct idxd_wq *wq = vidxd->wq;
> +
> + vfio_unregister_group_dev(&vidxd->vdev);
> + mdev_irqs_free(mdev);
> + mutex_lock(&wq->wq_lock);
> + idxd_wq_put(wq);
> + mutex_unlock(&wq->wq_lock);
It is also really weird to see something called put that requires the
caller to hold a mutex... Don't use refcount language for something
tha tis not acting like any sort of refcount.
> +static int idxd_vdcm_open(struct vfio_device *vdev)
> +{
> + return 0;
> +}
> +
> +static void idxd_vdcm_close(struct vfio_device *vdev)
> +{
> + struct vdcm_idxd *vidxd = vdev_to_vidxd(vdev);
> +
> + mutex_lock(&vidxd->dev_lock);
> + idxd_vdcm_set_irqs(vidxd, VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
> + VFIO_PCI_MSIX_IRQ_INDEX, 0, 0, NULL);
> +
> + /* Re-initialize the VIDXD to a pristine state for re-use */
> + vidxd_init(vidxd);
> + mutex_unlock(&vidxd->dev_lock);
This is split up weird. open should be doing basic init stuff and
close should just be doing the reset stuff..
Jason
On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> return rc;
> }
>
> + ims_info.max_slots = idxd->ims_size;
> + ims_info.slots = idxd->reg_base + idxd->ims_offset;
> + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> + if (!idxd->ims_domain) {
> + dev_warn(dev, "Fail to acquire IMS domain\n");
> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> + return -ENODEV;
> + }
I'm quite surprised that every mdev doesn't create its own ims_domain
in its probe function.
This places a global total limit on the # of vectors which makes me
ask what was the point of using IMS in the first place ?
The entire idea for IMS was to make the whole allocation system fully
dynamic based on demand.
> rc = mdev_register_device(dev, drv);
> if (rc < 0) {
> + irq_domain_remove(idxd->ims_domain);
> iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> return rc;
> }
This really wants a goto error unwind
Jason
On Fri, May 21, 2021 at 05:20:13PM -0700, Dave Jiang wrote:
> +static int idxd_mdev_drv_probe(struct device *dev)
> +{
> + struct idxd_wq *wq = confdev_to_wq(dev);
> + struct idxd_device *idxd = wq->idxd;
Something has gone wrong that the probe function for a
idxd_device_driver has this generic signature..
Jason
On Fri, May 21, 2021 at 05:19:36PM -0700, Dave Jiang wrote:
> Add common helper code to setup IMS once the MSI domain has been
> setup by the device driver. The main helper function is
> mdev_ims_set_msix_trigger() that is called by the VFIO ioctl
> VFIO_DEVICE_SET_IRQS. The function deals with the setup and
> teardown of emulated and IMS backed eventfd that gets exported
> to the guest kernel via VFIO as MSIX vectors.
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Dave Jiang <[email protected]>
> ---
> drivers/vfio/mdev/Kconfig | 12 ++
> drivers/vfio/mdev/Makefile | 3
> drivers/vfio/mdev/mdev_irqs.c | 318 +++++++++++++++++++++++++++++++++++++++++
> include/linux/mdev.h | 51 +++++++
> 4 files changed, 384 insertions(+)
> create mode 100644 drivers/vfio/mdev/mdev_irqs.c
IMS is not mdev specific, do not entangle it with mdev code. This
should be generic VFIO stuff.
> +static int mdev_msix_set_vector_signal(struct mdev_irq *mdev_irq, int vector, int fd)
> +{
> + int rc, irq;
> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
> + struct mdev_irq_entry *entry;
> + struct device *dev = &mdev->dev;
> + struct eventfd_ctx *trigger;
> + char *name;
> + bool pasid_en;
> + u32 auxval;
> +
> + if (vector < 0 || vector >= mdev_irq->num)
> + return -EINVAL;
> +
> + entry = &mdev_irq->irq_entries[vector];
> +
> + if (entry->ims)
> + irq = dev_msi_irq_vector(dev, entry->ims_id);
> + else
> + irq = 0;
> +
> + pasid_en = mdev_irq->pasid != INVALID_IOASID ? true : false;
> +
> + /* IMS and invalid pasid is not a valid configuration */
> + if (entry->ims && !pasid_en)
> + return -EINVAL;
> +
> + if (entry->trigger) {
> + if (irq) {
> + irq_bypass_unregister_producer(&entry->producer);
> + free_irq(irq, entry->trigger);
> + if (pasid_en) {
> + auxval = ims_ctrl_pasid_aux(0, false);
> + irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
> + }
> + }
> + kfree(entry->name);
> + eventfd_ctx_put(entry->trigger);
> + entry->trigger = NULL;
> + }
> +
> + if (fd < 0)
> + return 0;
> +
> + name = kasprintf(GFP_KERNEL, "vfio-mdev-irq[%d](%s)", vector, dev_name(dev));
> + if (!name)
> + return -ENOMEM;
> +
> + trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(trigger)) {
> + kfree(name);
> + return PTR_ERR(trigger);
> + }
> +
> + entry->name = name;
> + entry->trigger = trigger;
> +
> + if (!irq)
> + return 0;
> +
> + if (pasid_en) {
> + auxval = ims_ctrl_pasid_aux(mdev_irq->pasid, true);
> + rc = irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
> + if (rc < 0)
> + goto err;
Why is anything to do with PASID here? Something has gone wrong with
the layers I suspect..
Oh yes. drivers/irqchip/irq-ims-msi.c is dxd specific and shouldn't be
pretending to be common code.
The protocol to stuff the pasid and other stuff into the auxdata is
also compeltely idxd specific and is just a hacky way to communicate
from this code to the IDXD irq-chip.
So this doesn't belong here either. Pass in the auxdata from the idxd
code and I'd rename the irq-ims-msi to irq-ims-idxd
> +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
> +{
> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
> + struct device *dev;
> + int rc;
> +
> + if (nvec != mdev_irq->num)
> + return -EINVAL;
> +
> + if (mdev_irq->ims_num) {
> + dev = &mdev->dev;
> + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
Huh? The PCI device should be the only device touching IRQ stuff. I'm
nervous to see you mix in the mdev struct device into this function.
Isn't the msi_domain just idxd->ims_domain?
Jason
On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
>> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>> return rc;
>> }
>>
>> + ims_info.max_slots = idxd->ims_size;
>> + ims_info.slots = idxd->reg_base + idxd->ims_offset;
>> + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
>> + if (!idxd->ims_domain) {
>> + dev_warn(dev, "Fail to acquire IMS domain\n");
>> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>> + return -ENODEV;
>> + }
> I'm quite surprised that every mdev doesn't create its own ims_domain
> in its probe function.
>
> This places a global total limit on the # of vectors which makes me
> ask what was the point of using IMS in the first place ?
>
> The entire idea for IMS was to make the whole allocation system fully
> dynamic based on demand.
Hi Jason, thank you for the review of the series.
My understanding is that the driver creates a single IMS domain for the
device and provides the address base and IMS numbers for the domain
based on device IMS resources. So the IMS region needs to be contiguous.
Each mdev can call msi_domain_alloc_irqs() and acquire the number of IMS
vectors it desires and the DEV MSI core code will keep track of which
vectors are being used. This allows the mdev devices to dynamically
allocate based on demand. If the driver allocates a domain per mdev,
it'll needs to do internal accounting of the base and vector numbers for
each of those domains that the MSI core already provides. Isn't that
what we are trying to avoid? As mdevs come and go, that partitioning
will become fragmented.
For example, mdev 0 allocates 1 vector, mdev 1 allocates 2 vectors, and
mdev 3 allocates 3 vector. You have 1 vectors unallocated. When mdev 1
goes away and a new mdev shows up wanting 3 vectors, you won't be able
to allocate the domain because of fragmentation even though you have
enough vectors.
If all mdevs allocate the same IMS numbers, the fragmentation issue does
not exist. But the driver still has to keep track of which vectors are
free and which ones are used in order to provide the appropriate base.
And dev msi core already does this for us if we have a single domain.
Feels like we would just be duplicating code doing the same thing?
>
>> rc = mdev_register_device(dev, drv);
>> if (rc < 0) {
>> + irq_domain_remove(idxd->ims_domain);
>> iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>> return rc;
>> }
> This really wants a goto error unwind
>
> Jason
On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
>
> On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> > > @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> > > return rc;
> > > }
> > > + ims_info.max_slots = idxd->ims_size;
> > > + ims_info.slots = idxd->reg_base + idxd->ims_offset;
> > > + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> > > + if (!idxd->ims_domain) {
> > > + dev_warn(dev, "Fail to acquire IMS domain\n");
> > > + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> > > + return -ENODEV;
> > > + }
> > I'm quite surprised that every mdev doesn't create its own ims_domain
> > in its probe function.
> >
> > This places a global total limit on the # of vectors which makes me
> > ask what was the point of using IMS in the first place ?
> >
> > The entire idea for IMS was to make the whole allocation system fully
> > dynamic based on demand.
>
> Hi Jason, thank you for the review of the series.
>
> My understanding is that the driver creates a single IMS domain for the
> device and provides the address base and IMS numbers for the domain based on
> device IMS resources. So the IMS region needs to be contiguous. Each mdev
> can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
> desires and the DEV MSI core code will keep track of which vectors are being
> used. This allows the mdev devices to dynamically allocate based on demand.
> If the driver allocates a domain per mdev, it'll needs to do internal
> accounting of the base and vector numbers for each of those domains that the
> MSI core already provides. Isn't that what we are trying to avoid? As mdevs
> come and go, that partitioning will become fragmented.
I suppose it depends entirely on how the HW works.
If the HW has a fixed number of interrupt vectors organized in a
single table then by all means allocate a single domain that spans the
entire fixed HW vector space. But then why do we have a ims_size
variable here??
However, that really begs the question of why the HW is using IMS at
all? I'd expect needing 2x-10x the max MSI-X vector size before
reaching for IMS.
So does IDXD really have like a 4k - 40k entry linear IMS vector table
to wrap a shared domain around?
Basically, that isn't really "scalable" it is just "bigger".
Fully scalable would be for every mdev to point to its own 2k entry
IMS table that is allocated on the fly. Every mdev gets a domain and
every domain is fully utilized by the mdev in emulating
MSI-X. Basically for a device like idxd every PASID would have to map
to a IMS vector table array.
I suppose that was not what was done?
Jason
On 5/26/2021 5:54 PM, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
>> On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
>>> On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
>>>> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>>>> return rc;
>>>> }
>>>> + ims_info.max_slots = idxd->ims_size;
>>>> + ims_info.slots = idxd->reg_base + idxd->ims_offset;
>>>> + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
>>>> + if (!idxd->ims_domain) {
>>>> + dev_warn(dev, "Fail to acquire IMS domain\n");
>>>> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>>>> + return -ENODEV;
>>>> + }
>>> I'm quite surprised that every mdev doesn't create its own ims_domain
>>> in its probe function.
>>>
>>> This places a global total limit on the # of vectors which makes me
>>> ask what was the point of using IMS in the first place ?
>>>
>>> The entire idea for IMS was to make the whole allocation system fully
>>> dynamic based on demand.
>> Hi Jason, thank you for the review of the series.
>>
>> My understanding is that the driver creates a single IMS domain for the
>> device and provides the address base and IMS numbers for the domain based on
>> device IMS resources. So the IMS region needs to be contiguous. Each mdev
>> can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
>> desires and the DEV MSI core code will keep track of which vectors are being
>> used. This allows the mdev devices to dynamically allocate based on demand.
>> If the driver allocates a domain per mdev, it'll needs to do internal
>> accounting of the base and vector numbers for each of those domains that the
>> MSI core already provides. Isn't that what we are trying to avoid? As mdevs
>> come and go, that partitioning will become fragmented.
> I suppose it depends entirely on how the HW works.
>
> If the HW has a fixed number of interrupt vectors organized in a
> single table then by all means allocate a single domain that spans the
> entire fixed HW vector space. But then why do we have a ims_size
> variable here??
>
> However, that really begs the question of why the HW is using IMS at
> all? I'd expect needing 2x-10x the max MSI-X vector size before
> reaching for IMS.
>
> So does IDXD really have like a 4k - 40k entry linear IMS vector table
> to wrap a shared domain around?
>
> Basically, that isn't really "scalable" it is just "bigger".
>
> Fully scalable would be for every mdev to point to its own 2k entry
> IMS table that is allocated on the fly. Every mdev gets a domain and
> every domain is fully utilized by the mdev in emulating
> MSI-X. Basically for a device like idxd every PASID would have to map
> to a IMS vector table array.
>
> I suppose that was not what was done?
At least not for first gen of hardware. DSA 1.0 has 2k of IMS entries
total. ims_size is what is read from the device cap register. For MSIX,
the device only has 1 misc vector and 8 I/O vectors. That's why IMS is
being used for mdevs. We will discuss with our hardware people your
suggestion.
>
> Jason
On Wed, May 26, 2021 at 09:54:44PM -0300, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
> >
> > On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> > > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> > > > @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> > > > return rc;
> > > > }
> > > > + ims_info.max_slots = idxd->ims_size;
> > > > + ims_info.slots = idxd->reg_base + idxd->ims_offset;
> > > > + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> > > > + if (!idxd->ims_domain) {
> > > > + dev_warn(dev, "Fail to acquire IMS domain\n");
> > > > + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> > > > + return -ENODEV;
> > > > + }
> > > I'm quite surprised that every mdev doesn't create its own ims_domain
> > > in its probe function.
> > >
> > > This places a global total limit on the # of vectors which makes me
> > > ask what was the point of using IMS in the first place ?
> > >
> > > The entire idea for IMS was to make the whole allocation system fully
> > > dynamic based on demand.
> >
> > Hi Jason, thank you for the review of the series.
> >
> > My understanding is that the driver creates a single IMS domain for the
> > device and provides the address base and IMS numbers for the domain based on
> > device IMS resources. So the IMS region needs to be contiguous. Each mdev
> > can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
> > desires and the DEV MSI core code will keep track of which vectors are being
> > used. This allows the mdev devices to dynamically allocate based on demand.
> > If the driver allocates a domain per mdev, it'll needs to do internal
> > accounting of the base and vector numbers for each of those domains that the
> > MSI core already provides. Isn't that what we are trying to avoid? As mdevs
> > come and go, that partitioning will become fragmented.
>
> I suppose it depends entirely on how the HW works.
>
> If the HW has a fixed number of interrupt vectors organized in a
> single table then by all means allocate a single domain that spans the
> entire fixed HW vector space. But then why do we have a ims_size
> variable here??
>
> However, that really begs the question of why the HW is using IMS at
> all? I'd expect needing 2x-10x the max MSI-X vector size before
> reaching for IMS.
Its more than the number of vectors. Yes, thats one of the attributes.
IMS also has have additional flexibility. I think we covered this a while
back but maybe lost since its been a while.
- Format isn't just the standard MSIx, for e.g. DSA has the pending bits
all merged and co-located together with the interrupt store.
- You might want the vector space to be mabe on device. (I think you
alluded one of your devices can actually do that?)
- Remember we do handle validation when interrupts are requested from user
space. Interrupts are validated with PASID of requester. (I think we also
talked about if we should turn the interrupt message to also take a PASID
as opposed to request without PASID as its specified in PCIe)
- For certain devices the interupt might be simply in the user context
maintained by the kernel. Graphics for e.g.
>
> So does IDXD really have like a 4k - 40k entry linear IMS vector table
> to wrap a shared domain around?
>
> Basically, that isn't really "scalable" it is just "bigger".
>
> Fully scalable would be for every mdev to point to its own 2k entry
> IMS table that is allocated on the fly. Every mdev gets a domain and
> every domain is fully utilized by the mdev in emulating
> MSI-X. Basically for a device like idxd every PASID would have to map
> to a IMS vector table array.
>
> I suppose that was not what was done?
>
> Jason
--
Cheers,
Ashok
On Wed, May 26, 2021 at 06:41:07PM -0700, Raj, Ashok wrote:
> On Wed, May 26, 2021 at 09:54:44PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 05:22:22PM -0700, Dave Jiang wrote:
> > >
> > > On 5/23/2021 4:50 PM, Jason Gunthorpe wrote:
> > > > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> > > > > @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> > > > > return rc;
> > > > > }
> > > > > + ims_info.max_slots = idxd->ims_size;
> > > > > + ims_info.slots = idxd->reg_base + idxd->ims_offset;
> > > > > + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> > > > > + if (!idxd->ims_domain) {
> > > > > + dev_warn(dev, "Fail to acquire IMS domain\n");
> > > > > + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> > > > > + return -ENODEV;
> > > > > + }
> > > > I'm quite surprised that every mdev doesn't create its own ims_domain
> > > > in its probe function.
> > > >
> > > > This places a global total limit on the # of vectors which makes me
> > > > ask what was the point of using IMS in the first place ?
> > > >
> > > > The entire idea for IMS was to make the whole allocation system fully
> > > > dynamic based on demand.
> > >
> > > Hi Jason, thank you for the review of the series.
> > >
> > > My understanding is that the driver creates a single IMS domain for the
> > > device and provides the address base and IMS numbers for the domain based on
> > > device IMS resources. So the IMS region needs to be contiguous. Each mdev
> > > can call msi_domain_alloc_irqs() and acquire the number of IMS vectors it
> > > desires and the DEV MSI core code will keep track of which vectors are being
> > > used. This allows the mdev devices to dynamically allocate based on demand.
> > > If the driver allocates a domain per mdev, it'll needs to do internal
> > > accounting of the base and vector numbers for each of those domains that the
> > > MSI core already provides. Isn't that what we are trying to avoid? As mdevs
> > > come and go, that partitioning will become fragmented.
> >
> > I suppose it depends entirely on how the HW works.
> >
> > If the HW has a fixed number of interrupt vectors organized in a
> > single table then by all means allocate a single domain that spans the
> > entire fixed HW vector space. But then why do we have a ims_size
> > variable here??
> >
> > However, that really begs the question of why the HW is using IMS at
> > all? I'd expect needing 2x-10x the max MSI-X vector size before
> > reaching for IMS.
>
> Its more than the number of vectors. Yes, thats one of the attributes.
> IMS also has have additional flexibility. I think we covered this a while
> back but maybe lost since its been a while.
>
> - Format isn't just the standard MSIx, for e.g. DSA has the pending bits
> all merged and co-located together with the interrupt store.
But this is just random hardware churn, there is nothing wrong with
keeping the pending bits in the standard format
> - You might want the vector space to be mabe on device. (I think you
> alluded one of your devices can actually do that?)
Sure, but IDXD is not doing this
> - Remember we do handle validation when interrupts are requested from user
> space. Interrupts are validated with PASID of requester. (I think we also
> talked about if we should turn the interrupt message to also take a PASID
> as opposed to request without PASID as its specified in PCIe)
Yes, but overall this doesn't really make sense to me, and doesn't in
of itself require IMS. The PASID table could be an addendum to the
normal MSI-X table.
Besides, Interrupts and PASID are not related concepts. Does every
user SVA process with a unique PASID get a unique interrupt? The
device and CPU doesn't have enough vectors to do this.
Frankly I expect interrupts to be multiplexed by queues not by PASID,
so that interrupts can be shared.
> - For certain devices the interupt might be simply in the user context
> maintained by the kernel. Graphics for e.g.
IDXD is also not doing this.
Jason
On 5/23/2021 5:02 PM, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:19:36PM -0700, Dave Jiang wrote:
>> Add common helper code to setup IMS once the MSI domain has been
>> setup by the device driver. The main helper function is
>> mdev_ims_set_msix_trigger() that is called by the VFIO ioctl
>> VFIO_DEVICE_SET_IRQS. The function deals with the setup and
>> teardown of emulated and IMS backed eventfd that gets exported
>> to the guest kernel via VFIO as MSIX vectors.
>>
>> Suggested-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Dave Jiang <[email protected]>
>> ---
>> drivers/vfio/mdev/Kconfig | 12 ++
>> drivers/vfio/mdev/Makefile | 3
>> drivers/vfio/mdev/mdev_irqs.c | 318 +++++++++++++++++++++++++++++++++++++++++
>> include/linux/mdev.h | 51 +++++++
>> 4 files changed, 384 insertions(+)
>> create mode 100644 drivers/vfio/mdev/mdev_irqs.c
> IMS is not mdev specific, do not entangle it with mdev code. This
> should be generic VFIO stuff.
>
>> +static int mdev_msix_set_vector_signal(struct mdev_irq *mdev_irq, int vector, int fd)
>> +{
>> + int rc, irq;
>> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
>> + struct mdev_irq_entry *entry;
>> + struct device *dev = &mdev->dev;
>> + struct eventfd_ctx *trigger;
>> + char *name;
>> + bool pasid_en;
>> + u32 auxval;
>> +
>> + if (vector < 0 || vector >= mdev_irq->num)
>> + return -EINVAL;
>> +
>> + entry = &mdev_irq->irq_entries[vector];
>> +
>> + if (entry->ims)
>> + irq = dev_msi_irq_vector(dev, entry->ims_id);
>> + else
>> + irq = 0;
>> +
>> + pasid_en = mdev_irq->pasid != INVALID_IOASID ? true : false;
>> +
>> + /* IMS and invalid pasid is not a valid configuration */
>> + if (entry->ims && !pasid_en)
>> + return -EINVAL;
>> +
>> + if (entry->trigger) {
>> + if (irq) {
>> + irq_bypass_unregister_producer(&entry->producer);
>> + free_irq(irq, entry->trigger);
>> + if (pasid_en) {
>> + auxval = ims_ctrl_pasid_aux(0, false);
>> + irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
>> + }
>> + }
>> + kfree(entry->name);
>> + eventfd_ctx_put(entry->trigger);
>> + entry->trigger = NULL;
>> + }
>> +
>> + if (fd < 0)
>> + return 0;
>> +
>> + name = kasprintf(GFP_KERNEL, "vfio-mdev-irq[%d](%s)", vector, dev_name(dev));
>> + if (!name)
>> + return -ENOMEM;
>> +
>> + trigger = eventfd_ctx_fdget(fd);
>> + if (IS_ERR(trigger)) {
>> + kfree(name);
>> + return PTR_ERR(trigger);
>> + }
>> +
>> + entry->name = name;
>> + entry->trigger = trigger;
>> +
>> + if (!irq)
>> + return 0;
>> +
>> + if (pasid_en) {
>> + auxval = ims_ctrl_pasid_aux(mdev_irq->pasid, true);
>> + rc = irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
>> + if (rc < 0)
>> + goto err;
> Why is anything to do with PASID here? Something has gone wrong with
> the layers I suspect..
>
> Oh yes. drivers/irqchip/irq-ims-msi.c is dxd specific and shouldn't be
> pretending to be common code.
>
> The protocol to stuff the pasid and other stuff into the auxdata is
> also compeltely idxd specific and is just a hacky way to communicate
> from this code to the IDXD irq-chip.
>
> So this doesn't belong here either. Pass in the auxdata from the idxd
> code and I'd rename the irq-ims-msi to irq-ims-idxd
>
>> +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
>> +{
>> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
>> + struct device *dev;
>> + int rc;
>> +
>> + if (nvec != mdev_irq->num)
>> + return -EINVAL;
>> +
>> + if (mdev_irq->ims_num) {
>> + dev = &mdev->dev;
>> + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
> Huh? The PCI device should be the only device touching IRQ stuff. I'm
> nervous to see you mix in the mdev struct device into this function.
As we talked about in the other thread. We have a single IMS domain per
device. The domain is set to the mdev 'struct device' and we allocate
the vectors to each mdev 'struct device' so we can manage those IMS
vectors specifically for that mdev.
>
> Isn't the msi_domain just idxd->ims_domain?
Yes
>
> Jason
On Thu, May 27, 2021 at 06:49:59PM -0700, Dave Jiang wrote:
> > > +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
> > > +{
> > > + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
> > > + struct device *dev;
> > > + int rc;
> > > +
> > > + if (nvec != mdev_irq->num)
> > > + return -EINVAL;
> > > +
> > > + if (mdev_irq->ims_num) {
> > > + dev = &mdev->dev;
> > > + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
> > Huh? The PCI device should be the only device touching IRQ stuff. I'm
> > nervous to see you mix in the mdev struct device into this function.
>
> As we talked about in the other thread. We have a single IMS domain per
> device. The domain is set to the mdev 'struct device' and we allocate the
> vectors to each mdev 'struct device' so we can manage those IMS vectors
> specifically for that mdev.
That is not the point, I'm asking if you should be calling
dev_set_msi_domain(mdev) at all
Jason
On 5/28/2021 5:21 AM, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 06:49:59PM -0700, Dave Jiang wrote:
>>>> +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
>>>> +{
>>>> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
>>>> + struct device *dev;
>>>> + int rc;
>>>> +
>>>> + if (nvec != mdev_irq->num)
>>>> + return -EINVAL;
>>>> +
>>>> + if (mdev_irq->ims_num) {
>>>> + dev = &mdev->dev;
>>>> + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
>>> Huh? The PCI device should be the only device touching IRQ stuff. I'm
>>> nervous to see you mix in the mdev struct device into this function.
>> As we talked about in the other thread. We have a single IMS domain per
>> device. The domain is set to the mdev 'struct device' and we allocate the
>> vectors to each mdev 'struct device' so we can manage those IMS vectors
>> specifically for that mdev.
> That is not the point, I'm asking if you should be calling
> dev_set_msi_domain(mdev) at all
I'm not familiar with the standard way of doing this. Should I not set
the domain to the mdev 'struct device' because I can have multiple mdev
using the same domain? With the domain set, I am able to retrieve it and
call the msi_domain_alloc_irqs() in common code. Alternatively we can
pass in the domain during init and not rely on dev->msi_domain.
>
> Jason
On Fri, May 28, 2021 at 09:37:56AM -0700, Dave Jiang wrote:
>
> On 5/28/2021 5:21 AM, Jason Gunthorpe wrote:
> > On Thu, May 27, 2021 at 06:49:59PM -0700, Dave Jiang wrote:
> > > > > +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
> > > > > +{
> > > > > + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
> > > > > + struct device *dev;
> > > > > + int rc;
> > > > > +
> > > > > + if (nvec != mdev_irq->num)
> > > > > + return -EINVAL;
> > > > > +
> > > > > + if (mdev_irq->ims_num) {
> > > > > + dev = &mdev->dev;
> > > > > + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
> > > > Huh? The PCI device should be the only device touching IRQ stuff. I'm
> > > > nervous to see you mix in the mdev struct device into this function.
> > > As we talked about in the other thread. We have a single IMS domain per
> > > device. The domain is set to the mdev 'struct device' and we allocate the
> > > vectors to each mdev 'struct device' so we can manage those IMS vectors
> > > specifically for that mdev.
> > That is not the point, I'm asking if you should be calling
> > dev_set_msi_domain(mdev) at all
>
> I'm not familiar with the standard way of doing this. Should I not set the
> domain to the mdev 'struct device' because I can have multiple mdev using
> the same domain? With the domain set, I am able to retrieve it and call the
> msi_domain_alloc_irqs() in common code. Alternatively we can pass in the
> domain during init and not rely on dev->msi_
Honestly, I don't know. I would prefer Thomas confirm what is the
correct way to use the msi_domain as IDXD is going to be the reference
everyone copies.
Jason
On Fri, May 28 2021 at 13:39, Jason Gunthorpe wrote:
> On Fri, May 28, 2021 at 09:37:56AM -0700, Dave Jiang wrote:
>> On 5/28/2021 5:21 AM, Jason Gunthorpe wrote:
>> > On Thu, May 27, 2021 at 06:49:59PM -0700, Dave Jiang wrote:
>> > > > > +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
>> > > > > +{
>> > > > > + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
>> > > > > + struct device *dev;
>> > > > > + int rc;
>> > > > > +
>> > > > > + if (nvec != mdev_irq->num)
>> > > > > + return -EINVAL;
>> > > > > +
>> > > > > + if (mdev_irq->ims_num) {
>> > > > > + dev = &mdev->dev;
>> > > > > + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
>> > > >
>> > > > Huh? The PCI device should be the only device touching IRQ stuff. I'm
>> > > > nervous to see you mix in the mdev struct device into this function.
>> > >
>> > > As we talked about in the other thread. We have a single IMS domain per
>> > > device. The domain is set to the mdev 'struct device' and we allocate the
>> > > vectors to each mdev 'struct device' so we can manage those IMS vectors
>> > > specifically for that mdev.
>> >
>> > That is not the point, I'm asking if you should be calling
>> > dev_set_msi_domain(mdev) at all
>>
>> I'm not familiar with the standard way of doing this. Should I not set the
>> domain to the mdev 'struct device' because I can have multiple mdev using
>> the same domain? With the domain set, I am able to retrieve it and call the
>> msi_domain_alloc_irqs() in common code. Alternatively we can pass in the
>> domain during init and not rely on dev->msi_
>
> Honestly, I don't know. I would prefer Thomas confirm what is the
> correct way to use the msi_domain as IDXD is going to be the reference
> everyone copies.
The general expectation is that the MSI irqdomain is retrievable from
struct device for any device which supports MSI.
Thanks,
tglx
On Fri, May 21 2021 at 17:19, Dave Jiang wrote:
> Add common helper code to setup IMS once the MSI domain has been
> setup by the device driver. The main helper function is
> mdev_ims_set_msix_trigger() that is called by the VFIO ioctl
> VFIO_DEVICE_SET_IRQS. The function deals with the setup and
> teardown of emulated and IMS backed eventfd that gets exported
> to the guest kernel via VFIO as MSIX vectors.
So this talks about IMS, but the functionality is all named mdev_msix*
and mdev_irqs*. Confused.
> +/*
> + * Mediate device IMS library code
Mediated?
> +static int mdev_msix_set_vector_signal(struct mdev_irq *mdev_irq, int vector, int fd)
> +{
> + int rc, irq;
> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
> + struct mdev_irq_entry *entry;
> + struct device *dev = &mdev->dev;
> + struct eventfd_ctx *trigger;
> + char *name;
> + bool pasid_en;
> + u32 auxval;
> +
> + if (vector < 0 || vector >= mdev_irq->num)
> + return -EINVAL;
> +
> + entry = &mdev_irq->irq_entries[vector];
> +
> + if (entry->ims)
> + irq = dev_msi_irq_vector(dev, entry->ims_id);
> + else
> + irq = 0;
I have no idea what this does. Comments are overrated...
Aside of that dev_msi_irq_vector() seems to be a gross misnomer. AFAICT
it retrieves the Linux interrupt number and not some vector.
> + pasid_en = mdev_irq->pasid != INVALID_IOASID ? true : false;
> +
> + /* IMS and invalid pasid is not a valid configuration */
> + if (entry->ims && !pasid_en)
> + return -EINVAL;
Why is this not validated already?
> + if (entry->trigger) {
> + if (irq) {
> + irq_bypass_unregister_producer(&entry->producer);
> + free_irq(irq, entry->trigger);
> + if (pasid_en) {
> + auxval = ims_ctrl_pasid_aux(0, false);
> + irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
Why can't this be done in the irq chip when the interrupt is torn down?
Just because the irq chip driver, which is thankfully not merged yet,
has been implemented that way?
I did this aux dance because someone explained to me that this has to be
handled seperately and has to be changed independent of all the
interrupt setup and whatever. But looking at the actual usage now that's
clearly not the case.
What's the exact order of all this? I assume so:
1) mdev_irqs_init()
2) mdev_irqs_set_pasid()
3) mdev_set_msix_trigger()
Right? See below.
> +}
> +EXPORT_SYMBOL_GPL(mdev_irqs_set_pasid);
> + if (fd < 0)
> + return 0;
> +
> + name = kasprintf(GFP_KERNEL, "vfio-mdev-irq[%d](%s)", vector, dev_name(dev));
> + if (!name)
> + return -ENOMEM;
> +
> + trigger = eventfd_ctx_fdget(fd);
> + if (IS_ERR(trigger)) {
> + kfree(name);
> + return PTR_ERR(trigger);
> + }
> +
> + entry->name = name;
> + entry->trigger = trigger;
> +
> + if (!irq)
> + return 0;
These exit conditions are completely confusing.
> + if (pasid_en) {
> + auxval = ims_ctrl_pasid_aux(mdev_irq->pasid, true);
> + rc = irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
> + if (rc < 0)
> + goto err;
Again. This can be handled in the interrupt chip when the interrupt is
set up through request_irq().
> +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
> +{
> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
> + struct device *dev;
> + int rc;
> +
> + if (nvec != mdev_irq->num)
> + return -EINVAL;
> +
> + if (mdev_irq->ims_num) {
> + dev = &mdev->dev;
> + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
The allocation of the interrupts happens _after_ PASID has been
set and PASID is per device, right?
So the obvious place to store PASID is in struct device because the
device pointer is for one stored in the msi entry descriptor and it is
also handed down to the irq domain allocation function. So this can be
checked at allocation time already.
What's unclear to me is under which circumstances does the IMS interrupt
require a PASID.
1) Always
2) Use case dependent
Thanks,
tglx
On Sun, May 23 2021 at 20:50, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
>> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
>> return rc;
>> }
>>
>> + ims_info.max_slots = idxd->ims_size;
>> + ims_info.slots = idxd->reg_base + idxd->ims_offset;
>> + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
>> + if (!idxd->ims_domain) {
>> + dev_warn(dev, "Fail to acquire IMS domain\n");
>> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
>> + return -ENODEV;
>> + }
>
> I'm quite surprised that every mdev doesn't create its own ims_domain
> in its probe function.
What for?
> This places a global total limit on the # of vectors which makes me
> ask what was the point of using IMS in the first place ?
That depends on how IMS is implemented. The IDXD variant has a fixed
sized message store which is shared between all subdevices, so yet
another domain would not provide any value.
For the case where the IMS store is seperate, you still have one central
irqdomain per physical device. The domain allocation function can then
create storage on demand or reuse existing storage and just fill in the
pointers.
Thanks,
tglx
On Mon, May 31, 2021 at 03:48:35PM +0200, Thomas Gleixner wrote:
> What's unclear to me is under which circumstances does the IMS interrupt
> require a PASID.
>
> 1) Always
> 2) Use case dependent
It is just a weird IDXD thing. The PASID is serving as some VM
identifier in that HW, somehow, AFAIK.
Jason
On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
> On Sun, May 23 2021 at 20:50, Jason Gunthorpe wrote:
> > On Fri, May 21, 2021 at 05:20:37PM -0700, Dave Jiang wrote:
> >> @@ -77,8 +80,18 @@ int idxd_mdev_host_init(struct idxd_device *idxd, struct mdev_driver *drv)
> >> return rc;
> >> }
> >>
> >> + ims_info.max_slots = idxd->ims_size;
> >> + ims_info.slots = idxd->reg_base + idxd->ims_offset;
> >> + idxd->ims_domain = pci_ims_array_create_msi_irq_domain(idxd->pdev, &ims_info);
> >> + if (!idxd->ims_domain) {
> >> + dev_warn(dev, "Fail to acquire IMS domain\n");
> >> + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
> >> + return -ENODEV;
> >> + }
> >
> > I'm quite surprised that every mdev doesn't create its own ims_domain
> > in its probe function.
>
> What for?
IDXD wouldn't need it, but proper IMS HW with no bound of number of
vectors can't provide a ims_info.max_slots value here.
Instead each use use site, like VFIO, would want to specify the number
of vectors to allocate for its own usage, then parcel them out one by
one in the normal way. Basically VFIO is emulating a normal MSI-X
table.
> > This places a global total limit on the # of vectors which makes me
> > ask what was the point of using IMS in the first place ?
>
> That depends on how IMS is implemented. The IDXD variant has a fixed
> sized message store which is shared between all subdevices, so yet
> another domain would not provide any value.
Right, IDXD would have been perfectly happy to use the normal MSI-X
table from what I can see.
> For the case where the IMS store is seperate, you still have one central
> irqdomain per physical device. The domain allocation function can then
> create storage on demand or reuse existing storage and just fill in the
> pointers.
I think it is philosophically backwards, and it is in part what is
motivating pretending this weird auxdomain and PASID stuff is generic.
The VFIO model is the IRQ table is associated with a VM. When the
vfio_device is created it decides how big the MSI-X table will be and
it needs to allocate a block of interrupts to emulate it. For security
those interrupts need to be linked in the HW to the vfio_device and
the VM. ie VM A cannot trigger an interrupt that would deliver to VM
B.
IDXD choose to use the PASID, but other HW might use a generic VM_ID.
Further, IDXD choose to use a VM_ID per IMS entry, but other HW is
likely to use a VM_ID per block of IMS entries. Ie the HW tree starts
a VM object, then locates the IMS table for that object, then triggers
the interrupt.
If we think about the later sort of HW I don't think the whole aux
data and domain per pci function makes alot of sense. You'd want a
domain per VM_ID and all the IMS entires in that domain share the same
VM_ID. In this regard the irq domain will correspond to the security
boundary.
While IDXD is probably fine to organize its domains like this, I am
surprised to learn there is basically no reason for it to be using
IMS.
Jason
On Mon, May 31 2021 at 13:57, Jason Gunthorpe wrote:
> On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
>> > I'm quite surprised that every mdev doesn't create its own ims_domain
>> > in its probe function.
>>
>> What for?
>
> IDXD wouldn't need it, but proper IMS HW with no bound of number of
> vectors can't provide a ims_info.max_slots value here.
There is no need to do so:
https://lore.kernel.org/r/[email protected]
which has the IMS_MSI_QUEUE variant at which you looked at and said:
"I haven't looked through everything in detail, but this does look like
it is good for the mlx5 devices."
ims_info.max_slots is a property of the IMS_MSI_ARRAY and does not make
any restrictions on other storage.
> Instead each use use site, like VFIO, would want to specify the number
> of vectors to allocate for its own usage, then parcel them out one by
> one in the normal way. Basically VFIO is emulating a normal MSI-X
> table.
Just with a size which exceeds a normal MSI-X table, but that's an
implementation detail of the underlying physical device. It does not put
any restrictions on mdev at all.
>> > This places a global total limit on the # of vectors which makes me
>> > ask what was the point of using IMS in the first place ?
>>
>> That depends on how IMS is implemented. The IDXD variant has a fixed
>> sized message store which is shared between all subdevices, so yet
>> another domain would not provide any value.
>
> Right, IDXD would have been perfectly happy to use the normal MSI-X
> table from what I can see.
Again. No, it's a storage size problem and regular MSI-X does not
support auxiliary data.
>> For the case where the IMS store is seperate, you still have one central
>> irqdomain per physical device. The domain allocation function can then
>> create storage on demand or reuse existing storage and just fill in the
>> pointers.
>
> I think it is philosophically backwards, and it is in part what is
> motivating pretending this weird auxdomain and PASID stuff is generic.
That's a different story and as I explained to Dave already hacking all
this into mdev is backwards, but that does not make your idea of a
irqdomain per mdev any more reasonable.
The mdev does not do anything irq chip/domain related. It uses what the
underlying physical device provides. If you think otherwise then please
provide me the hierarchical model which I explained here:
https://lore.kernel.org/r/[email protected]
https://lore.kernel.org/r/[email protected]
> The VFIO model is the IRQ table is associated with a VM. When the
> vfio_device is created it decides how big the MSI-X table will be and
> it needs to allocate a block of interrupts to emulate it. For security
> those interrupts need to be linked in the HW to the vfio_device and
> the VM. ie VM A cannot trigger an interrupt that would deliver to VM
> B.
Fine.
> IDXD choose to use the PASID, but other HW might use a generic VM_ID.
So what?
> Further, IDXD choose to use a VM_ID per IMS entry, but other HW is
> likely to use a VM_ID per block of IMS entries. Ie the HW tree starts
> a VM object, then locates the IMS table for that object, then triggers
> the interrupt.
If you read my other reply to Dave carefuly then you might have noticed
that this is crap and irrelevant because the ID (whatever it is) is per
device and that ID has to be stored in the device. Whether the actual
irq chip/domain driver implementation uses it per associated irq or not
does not matter at all.
> If we think about the later sort of HW I don't think the whole aux
> data and domain per pci function makes alot of sense.
First of all that is already debunked and will go nowhere and second
there is no requirement to implement this for some other incarnation of
IMS when done correctly. That whole irq_set_auxdata() stuff is not going
anywhere simply because it's not needed at all.
All what's needed is a function to store some sort of ID per device
(mdev) and the underlying IMS driver takes care of what to do with it.
That has to happen before the interrupts are allocated and if that info
is invalid then the allocation function can reject it.
> You'd want a domain per VM_ID and all the IMS entires in that domain
> share the same VM_ID. In this regard the irq domain will correspond to
> the security boundary.
The real problems are:
- Intel misled me with the requirement to set PASID after the fact
which is simply wrong and what caused me to come up with that
irq_set_auxdata() workaround.
- Their absolute ignorance for proper layering led to adding all that
irq_set_auxdata() muck to this mdev library.
Ergo, the proper thing to do is to fix this ID storage problem (PASID,
VM_ID or whatever) at the proper place, i.e. store it in struct device
(which is associated to that mdev) and let the individual drivers handle
it as they require.
It's that simple and this needs to be fixed and not some made up
philosophical question about irqdomains per mdev. Those would be even
worse than what Intel did here.
Thanks,
tglx
On Tue, Jun 01, 2021 at 01:55:22AM +0200, Thomas Gleixner wrote:
> On Mon, May 31 2021 at 13:57, Jason Gunthorpe wrote:
> > On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
> >> > I'm quite surprised that every mdev doesn't create its own ims_domain
> >> > in its probe function.
> >>
> >> What for?
> >
> > IDXD wouldn't need it, but proper IMS HW with no bound of number of
> > vectors can't provide a ims_info.max_slots value here.
>
> There is no need to do so:
>
> https://lore.kernel.org/r/[email protected]
>
> which has the IMS_MSI_QUEUE variant at which you looked at and said:
>
> "I haven't looked through everything in detail, but this does look like
> it is good for the mlx5 devices."
>
> ims_info.max_slots is a property of the IMS_MSI_ARRAY and does not make
> any restrictions on other storage.
Ok, it has been a while since then
> >> That depends on how IMS is implemented. The IDXD variant has a fixed
> >> sized message store which is shared between all subdevices, so yet
> >> another domain would not provide any value.
> >
> > Right, IDXD would have been perfectly happy to use the normal MSI-X
> > table from what I can see.
>
> Again. No, it's a storage size problem and regular MSI-X does not
> support auxiliary data.
I mean the IDXD HW could have been designed with a normal format MSI-X
table and a side table with the PASID.
> Ergo, the proper thing to do is to fix this ID storage problem (PASID,
> VM_ID or whatever) at the proper place, i.e. store it in struct device
> (which is associated to that mdev) and let the individual drivers handle
> it as they require.
If the struct device defines all the details of how to place the IRQ
into the HW, including what HW table to use, then it seems like it
could work.
I don't clearly remember all the details anymore so lets look at how
non-IDXD devices might work when HW actually comes.
Jason
On 5/23/2021 4:22 PM, Jason Gunthorpe wrote:
> On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
>> Introducing mdev types “1dwq-v1” type. This mdev type allows
>> allocation of a single dedicated wq from available dedicated wqs. After
>> a workqueue (wq) is enabled, the user will generate an uuid. On mdev
>> creation, the mdev driver code will find a dwq depending on the mdev
>> type. When the create operation is successful, the user generated uuid
>> can be passed to qemu. When the guest boots up, it should discover a
>> DSA device when doing PCI discovery.
>>
>> For example of “1dwq-v1” type:
>> 1. Enable wq with “mdev” wq type
>> 2. A user generated uuid.
>> 3. The uuid is written to the mdev class sysfs path:
>> echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
>> 4. Pass the following parameter to qemu:
>> "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
> So the idxd core driver knows to create a "vfio" wq with its own much
> machinery but you still want to involve the horrible mdev guid stuff?
>
> Why??
Are you referring to calling mdev_device_create() directly in the mdev
idxd_driver probe? I think this would work with our dedicated wq where a
single mdev can be assigned to a wq. However, later on when we need to
support shared wq where we can create multiple mdev per wq, we'll need
an entry point to do so. In the name of making things consistent from
user perspective, going through sysfs seems the way to do it.
On Wed, Jun 02, 2021 at 08:40:51AM -0700, Dave Jiang wrote:
>
> On 5/23/2021 4:22 PM, Jason Gunthorpe wrote:
> > On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
> > > Introducing mdev types “1dwq-v1” type. This mdev type allows
> > > allocation of a single dedicated wq from available dedicated wqs. After
> > > a workqueue (wq) is enabled, the user will generate an uuid. On mdev
> > > creation, the mdev driver code will find a dwq depending on the mdev
> > > type. When the create operation is successful, the user generated uuid
> > > can be passed to qemu. When the guest boots up, it should discover a
> > > DSA device when doing PCI discovery.
> > >
> > > For example of “1dwq-v1” type:
> > > 1. Enable wq with “mdev” wq type
> > > 2. A user generated uuid.
> > > 3. The uuid is written to the mdev class sysfs path:
> > > echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
> > > 4. Pass the following parameter to qemu:
> > > "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
> > So the idxd core driver knows to create a "vfio" wq with its own much
> > machinery but you still want to involve the horrible mdev guid stuff?
> >
> > Why??
>
> Are you referring to calling mdev_device_create() directly in the mdev
> idxd_driver probe?
No, just call vfio_register_group_dev and forget about mdev.
> I think this would work with our dedicated wq where a single mdev
> can be assigned to a wq.
Ok, sounds great
> However, later on when we need to support shared wq where we can
> create multiple mdev per wq, we'll need an entry point to do so. In
> the name of making things consistent from user perspective, going
> through sysfs seems the way to do it.
Why not use your already very complicated idxd sysfs to do this?
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 7:18 AM
>
> On Wed, Jun 02, 2021 at 08:40:51AM -0700, Dave Jiang wrote:
> >
> > On 5/23/2021 4:22 PM, Jason Gunthorpe wrote:
> > > On Fri, May 21, 2021 at 05:19:05PM -0700, Dave Jiang wrote:
> > > > Introducing mdev types “1dwq-v1” type. This mdev type allows
> > > > allocation of a single dedicated wq from available dedicated wqs. After
> > > > a workqueue (wq) is enabled, the user will generate an uuid. On mdev
> > > > creation, the mdev driver code will find a dwq depending on the mdev
> > > > type. When the create operation is successful, the user generated uuid
> > > > can be passed to qemu. When the guest boots up, it should discover a
> > > > DSA device when doing PCI discovery.
> > > >
> > > > For example of “1dwq-v1” type:
> > > > 1. Enable wq with “mdev” wq type
> > > > 2. A user generated uuid.
> > > > 3. The uuid is written to the mdev class sysfs path:
> > > > echo $UUID >
> /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-
> v1/create
> > > > 4. Pass the following parameter to qemu:
> > > > "-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
> > > So the idxd core driver knows to create a "vfio" wq with its own much
> > > machinery but you still want to involve the horrible mdev guid stuff?
> > >
> > > Why??
> >
> > Are you referring to calling mdev_device_create() directly in the mdev
> > idxd_driver probe?
>
> No, just call vfio_register_group_dev and forget about mdev.
>
> > I think this would work with our dedicated wq where a single mdev
> > can be assigned to a wq.
>
> Ok, sounds great
>
> > However, later on when we need to support shared wq where we can
> > create multiple mdev per wq, we'll need an entry point to do so. In
> > the name of making things consistent from user perspective, going
> > through sysfs seems the way to do it.
>
> Why not use your already very complicated idxd sysfs to do this?
>
Jason, can you clarify your attitude on mdev guid stuff? Are you
completely against it or case-by-case? If the former, this is a big
decision thus it's better to have consensus with Alex/Kirti. If the
latter, would like to hear your criteria for when it can be used
and when not...
Thanks
Kevin
On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> Jason, can you clarify your attitude on mdev guid stuff? Are you
> completely against it or case-by-case? If the former, this is a big
> decision thus it's better to have consensus with Alex/Kirti. If the
> latter, would like to hear your criteria for when it can be used
> and when not...
I dislike it generally, but it exists so <shrug>. I know others feel
more strongly about it being un-kernely and the wrong way to use sysfs.
Here I was remarking how the example in the cover letter made the mdev
part seem totally pointless. If it is pointless then don't do it.
Remember we have stripped away the actual special need to use
mdev. You don't *have* to use mdev anymore to use vfio. That is a
significant ideology change even from a few months ago.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 9:50 AM
>
> On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
>
> > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > completely against it or case-by-case? If the former, this is a big
> > decision thus it's better to have consensus with Alex/Kirti. If the
> > latter, would like to hear your criteria for when it can be used
> > and when not...
>
> I dislike it generally, but it exists so <shrug>. I know others feel
> more strongly about it being un-kernely and the wrong way to use sysfs.
>
> Here I was remarking how the example in the cover letter made the mdev
> part seem totally pointless. If it is pointless then don't do it.
Is your point about that as long as a mdev requires pre-config
through driver specific sysfs then it doesn't make sense to use
mdev guid interface anymore?
The value of mdev guid interface is providing a vendor-agnostic
interface for mdev life-cycle management which allows one-
enable-fit-all in upper management stack. Requiring vendor
specific pre-config does blur the boundary here.
Alex/Kirt/Cornelia, what about your opinion here? It's better
we can have an consensus on when and where the existing
mdev sysfs could be used, as this will affect every new mdev
implementation from now on.
>
> Remember we have stripped away the actual special need to use
> mdev. You don't *have* to use mdev anymore to use vfio. That is a
> significant ideology change even from a few months ago.
>
Yes, "don't have to" but if there is value of doing so it's
not necessary to blocking it? One point in my mind is that if
we should minimize vendor-specific contracts for user to
manage mdev or subdevice...
Thanks
Kevin
On Thu, Jun 03, 2021 at 05:52:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, June 3, 2021 9:50 AM
> >
> > On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> >
> > > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > > completely against it or case-by-case? If the former, this is a big
> > > decision thus it's better to have consensus with Alex/Kirti. If the
> > > latter, would like to hear your criteria for when it can be used
> > > and when not...
> >
> > I dislike it generally, but it exists so <shrug>. I know others feel
> > more strongly about it being un-kernely and the wrong way to use sysfs.
> >
> > Here I was remarking how the example in the cover letter made the mdev
> > part seem totally pointless. If it is pointless then don't do it.
>
> Is your point about that as long as a mdev requires pre-config
> through driver specific sysfs then it doesn't make sense to use
> mdev guid interface anymore?
Yes
> The value of mdev guid interface is providing a vendor-agnostic
> interface for mdev life-cycle management which allows one-
> enable-fit-all in upper management stack. Requiring vendor
> specific pre-config does blur the boundary here.
It isn't even vendor-agnostic - understanding the mdev_type
configuration stuff is still vendor specific.
Jason
On Thu, 3 Jun 2021 05:52:58 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, June 3, 2021 9:50 AM
> >
> > On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> >
> > > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > > completely against it or case-by-case? If the former, this is a big
> > > decision thus it's better to have consensus with Alex/Kirti. If the
> > > latter, would like to hear your criteria for when it can be used
> > > and when not...
> >
> > I dislike it generally, but it exists so <shrug>. I know others feel
> > more strongly about it being un-kernely and the wrong way to use sysfs.
> >
> > Here I was remarking how the example in the cover letter made the mdev
> > part seem totally pointless. If it is pointless then don't do it.
>
> Is your point about that as long as a mdev requires pre-config
> through driver specific sysfs then it doesn't make sense to use
> mdev guid interface anymore?
Can you describe exactly what step 1. is doing in this case from the
original cover letter ("Enable wq with "mdev" wq type")? That does
sound a bit like configuring something to use mdev then separately
going to the trouble of creating the mdev. As Jason suggests, if a wq
is tagged for mdev/vfio, it could just register itself as a vfio bus
driver.
But if we want to use mdev, why doesn't available_instances for your
mdev type simply report all unassigned wq and the `echo $UUID > create`
grabs a wq for mdev? That would remove this pre-config contention,
right?
> The value of mdev guid interface is providing a vendor-agnostic
> interface for mdev life-cycle management which allows one-
> enable-fit-all in upper management stack. Requiring vendor
> specific pre-config does blur the boundary here.
We need to be careful about using work-avoidance in the upper
management stack as a primary use case for an interface though.
> Alex/Kirt/Cornelia, what about your opinion here? It's better
> we can have an consensus on when and where the existing
> mdev sysfs could be used, as this will affect every new mdev
> implementation from now on.
I have a hard time defining some fixed criteria for using mdev. It's
essentially always been true that vendors could write their own vfio
"bus driver", like vfio-pci or vfio-platform, specific to their device.
Mdevs were meant to be a way for the (non-vfio) driver of a device to
expose portions of the device through mediation for use with vfio. It
seems like that's largely being done here.
What I think has changed recently is this desire to make it easier to
create those vendor drivers and some promise of making module binding
work to avoid the messiness around picking a driver for the device. In
the auxiliary bus case that I think Jason is working on, it sounds like
the main device driver exposes portions of itself on an auxiliary bus
where drivers on that bus can integrate into the vfio subsystem. It
starts to get pretty fuzzy with what mdev already does, but it's also a
more versatile interface. Is it right for everyone? Probably not.
Is the pre-configuration issue here really a push vs pull problem? I
can see the requirement in step 1. is dedicating some resources to an
mdev use case, so at that point it seems like the argument is whether we
should just create aux bus devices that get automatically bound to a
vendor vfio-pci variant and we avoid the mdev lifecycle, which is both
convenient and ugly. On the other hand, mdev has more of a pull
interface, ie. here are a bunch of device types and how many of each we
can support, use create to pull what you need.
> > Remember we have stripped away the actual special need to use
> > mdev. You don't *have* to use mdev anymore to use vfio. That is a
> > significant ideology change even from a few months ago.
> >
>
> Yes, "don't have to" but if there is value of doing so it's
> not necessary to blocking it? One point in my mind is that if
> we should minimize vendor-specific contracts for user to
> manage mdev or subdevice...
Again, this in itself is not a great justification for using mdev,
we're creating vendor specific device types with vendor specific
additional features, that could all be done via some sort of netlink
interface too. The thing that pushes this more towards mdev for me is
that I don't think each of these wqs appear as devices to the host,
they're internal resources of the parent device and we want to compose
them in ways that are slightly more amenable to traditional mdevs... I
think. Thanks,
Alex
On Fri, 21 May 2021 17:20:19 -0700
Dave Jiang <[email protected]> wrote:
> Move some VFIO_PCI macros to a common header as they will be shared between
> mdev and vfio_pci.
No, this is the current implementation of vfio-pci, it's specifically
not meant to be a standard. Each vfio device driver is free to expose
regions on the device file descriptor as they wish. If you want to use
a 40-bit implementation as well, great, but it should not be imposed as
a standard. Thanks,
Alex
> Signed-off-by: Dave Jiang <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci_private.h | 6 ------
> include/linux/vfio.h | 6 ++++++
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index a17943911fcb..e644f981509c 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -18,12 +18,6 @@
> #ifndef VFIO_PCI_PRIVATE_H
> #define VFIO_PCI_PRIVATE_H
>
> -#define VFIO_PCI_OFFSET_SHIFT 40
> -
> -#define VFIO_PCI_OFFSET_TO_INDEX(off) (off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> -
> /* Special capability IDs predefined access */
> #define PCI_CAP_ID_INVALID 0xFF /* default raw access */
> #define PCI_CAP_ID_INVALID_VIRT 0xFE /* default virt access */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 3b372fa57ef4..ed5ca027eb49 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -15,6 +15,12 @@
> #include <linux/poll.h>
> #include <uapi/linux/vfio.h>
>
> +#define VFIO_PCI_OFFSET_SHIFT 40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off) ((off) >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +
> struct vfio_device {
> struct device *dev;
> const struct vfio_device_ops *ops;
>
>
Hi, Alex,
Thanks for sharing your thoughts.
> From: Alex Williamson <[email protected]>
> Sent: Friday, June 4, 2021 11:40 AM
>
> On Thu, 3 Jun 2021 05:52:58 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Thursday, June 3, 2021 9:50 AM
> > >
> > > On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
> > >
> > > > Jason, can you clarify your attitude on mdev guid stuff? Are you
> > > > completely against it or case-by-case? If the former, this is a big
> > > > decision thus it's better to have consensus with Alex/Kirti. If the
> > > > latter, would like to hear your criteria for when it can be used
> > > > and when not...
> > >
> > > I dislike it generally, but it exists so <shrug>. I know others feel
> > > more strongly about it being un-kernely and the wrong way to use sysfs.
> > >
> > > Here I was remarking how the example in the cover letter made the mdev
> > > part seem totally pointless. If it is pointless then don't do it.
> >
> > Is your point about that as long as a mdev requires pre-config
> > through driver specific sysfs then it doesn't make sense to use
> > mdev guid interface anymore?
>
> Can you describe exactly what step 1. is doing in this case from the
> original cover letter ("Enable wq with "mdev" wq type")? That does
> sound a bit like configuring something to use mdev then separately
> going to the trouble of creating the mdev. As Jason suggests, if a wq
> is tagged for mdev/vfio, it could just register itself as a vfio bus
> driver.
I'll leave to Dave to explain the exact detail in step 1.
>
> But if we want to use mdev, why doesn't available_instances for your
> mdev type simply report all unassigned wq and the `echo $UUID > create`
> grabs a wq for mdev? That would remove this pre-config contention,
> right?
This way could also work. It sort of changes pre-config to post-config,
i.e. after an unassigned wq is grabbed for mdev, the admin then
configures additional vendor specific parameters (not initialized by
parent driver) before this mdev is assigned to a VM. Looks this is also
what NVIDIA is doing for their vGPU, with a cmdline tool (nvidia-smi)
and nvidia sysfs node for setting plugin parameters:
https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf
But I'll leave to Dave again as there must be a reason why they choose
pre-config in the first place.
>
> > The value of mdev guid interface is providing a vendor-agnostic
> > interface for mdev life-cycle management which allows one-
> > enable-fit-all in upper management stack. Requiring vendor
> > specific pre-config does blur the boundary here.
>
> We need to be careful about using work-avoidance in the upper
> management stack as a primary use case for an interface though.
ok
>
> > Alex/Kirt/Cornelia, what about your opinion here? It's better
> > we can have an consensus on when and where the existing
> > mdev sysfs could be used, as this will affect every new mdev
> > implementation from now on.
>
> I have a hard time defining some fixed criteria for using mdev. It's
> essentially always been true that vendors could write their own vfio
> "bus driver", like vfio-pci or vfio-platform, specific to their device.
> Mdevs were meant to be a way for the (non-vfio) driver of a device to
> expose portions of the device through mediation for use with vfio. It
> seems like that's largely being done here.
>
> What I think has changed recently is this desire to make it easier to
> create those vendor drivers and some promise of making module binding
> work to avoid the messiness around picking a driver for the device. In
> the auxiliary bus case that I think Jason is working on, it sounds like
> the main device driver exposes portions of itself on an auxiliary bus
> where drivers on that bus can integrate into the vfio subsystem. It
> starts to get pretty fuzzy with what mdev already does, but it's also a
> more versatile interface. Is it right for everyone? Probably not.
idxd is also moving toward this model per Jason's suggestion. Although
auxiliar bus is not directly used, idxd driver has its own bus for exposing
portion of its resources. From this angle, all the motivation around mdev
bus does get fuzzy...
>
> Is the pre-configuration issue here really a push vs pull problem? I
> can see the requirement in step 1. is dedicating some resources to an
> mdev use case, so at that point it seems like the argument is whether we
> should just create aux bus devices that get automatically bound to a
> vendor vfio-pci variant and we avoid the mdev lifecycle, which is both
> convenient and ugly. On the other hand, mdev has more of a pull
> interface, ie. here are a bunch of device types and how many of each we
> can support, use create to pull what you need.
I see your point. Looks what idxd is toward now is a mixed model. The
parent driver uses a push interface to initialize a pool of instances
which are then managed through mdev in a pull mode.
>
> > > Remember we have stripped away the actual special need to use
> > > mdev. You don't *have* to use mdev anymore to use vfio. That is a
> > > significant ideology change even from a few months ago.
> > >
> >
> > Yes, "don't have to" but if there is value of doing so it's
> > not necessary to blocking it? One point in my mind is that if
> > we should minimize vendor-specific contracts for user to
> > manage mdev or subdevice...
>
> Again, this in itself is not a great justification for using mdev,
> we're creating vendor specific device types with vendor specific
> additional features, that could all be done via some sort of netlink
> interface too. The thing that pushes this more towards mdev for me is
> that I don't think each of these wqs appear as devices to the host,
> they're internal resources of the parent device and we want to compose
> them in ways that are slightly more amenable to traditional mdevs... I
> think. Thanks,
>
Yes, this is one reason going toward mdev.
btw I'm not clear what the netlink interface will finally be, especially
about whether any generic cmd should be defined cross devices given
that subdevice management still has large generality. Jason, do you have
an example somewhere which we can take a look regarding to mlx
netlink design?
Thanks
Kevin
On 6/6/2021 11:22 PM, Tian, Kevin wrote:
> Hi, Alex,
>
> Thanks for sharing your thoughts.
>
>> From: Alex Williamson <[email protected]>
>> Sent: Friday, June 4, 2021 11:40 AM
>>
>> On Thu, 3 Jun 2021 05:52:58 +0000
>> "Tian, Kevin" <[email protected]> wrote:
>>
>>>> From: Jason Gunthorpe <[email protected]>
>>>> Sent: Thursday, June 3, 2021 9:50 AM
>>>>
>>>> On Thu, Jun 03, 2021 at 01:11:37AM +0000, Tian, Kevin wrote:
>>>>
>>>>> Jason, can you clarify your attitude on mdev guid stuff? Are you
>>>>> completely against it or case-by-case? If the former, this is a big
>>>>> decision thus it's better to have consensus with Alex/Kirti. If the
>>>>> latter, would like to hear your criteria for when it can be used
>>>>> and when not...
>>>> I dislike it generally, but it exists so <shrug>. I know others feel
>>>> more strongly about it being un-kernely and the wrong way to use sysfs.
>>>>
>>>> Here I was remarking how the example in the cover letter made the mdev
>>>> part seem totally pointless. If it is pointless then don't do it.
>>> Is your point about that as long as a mdev requires pre-config
>>> through driver specific sysfs then it doesn't make sense to use
>>> mdev guid interface anymore?
>> Can you describe exactly what step 1. is doing in this case from the
>> original cover letter ("Enable wq with "mdev" wq type")? That does
>> sound a bit like configuring something to use mdev then separately
>> going to the trouble of creating the mdev. As Jason suggests, if a wq
>> is tagged for mdev/vfio, it could just register itself as a vfio bus
>> driver.
> I'll leave to Dave to explain the exact detail in step 1.
So in step 1, we 'tag' the wq to be dedicated to guest usage and put the
hardware wq into enable state. For a dedicated mode wq, we can
definitely just register directly and skip the mdev step. For a shared
wq mode, we can have multiple mdev running on top of a single wq. So we
need some way to create more mdevs. We can either go with the existing
established creation path by mdev, or invent something custom for the
driver as Jason suggested to accomodate additional virtual devices for
guests. We implemented the mdev path originally with consideration of
mdev is established and has a known interface already.
>
>> But if we want to use mdev, why doesn't available_instances for your
>> mdev type simply report all unassigned wq and the `echo $UUID > create`
>> grabs a wq for mdev? That would remove this pre-config contention,
>> right?
> This way could also work. It sort of changes pre-config to post-config,
> i.e. after an unassigned wq is grabbed for mdev, the admin then
> configures additional vendor specific parameters (not initialized by
> parent driver) before this mdev is assigned to a VM. Looks this is also
> what NVIDIA is doing for their vGPU, with a cmdline tool (nvidia-smi)
> and nvidia sysfs node for setting plugin parameters:
>
> https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf
>
> But I'll leave to Dave again as there must be a reason why they choose
> pre-config in the first place.
I think things become more complicated when we go from a dedicated wq to
shared wq where the relationship of wq : mdev is 1 : 1 goes to 1 : N.
Also needing to keep a consistent user config experience is desired,
especially we already have such behavior since kernel 5.6 for host
usages. So we really need try to avoid doing wq configuration
differently just for "mdev" wqs. In the case suggested above, we
basically just flipped the configuration steps. Mdev is first created
through mdev sysfs interface. And then the device paramters are
configured. Where for us, we configure the device parameter first, and
then create the mdev. But in the end, it's still the hybrid mdev setup
right?
>>> The value of mdev guid interface is providing a vendor-agnostic
>>> interface for mdev life-cycle management which allows one-
>>> enable-fit-all in upper management stack. Requiring vendor
>>> specific pre-config does blur the boundary here.
>> We need to be careful about using work-avoidance in the upper
>> management stack as a primary use case for an interface though.
> ok
>
>>> Alex/Kirt/Cornelia, what about your opinion here? It's better
>>> we can have an consensus on when and where the existing
>>> mdev sysfs could be used, as this will affect every new mdev
>>> implementation from now on.
>> I have a hard time defining some fixed criteria for using mdev. It's
>> essentially always been true that vendors could write their own vfio
>> "bus driver", like vfio-pci or vfio-platform, specific to their device.
>> Mdevs were meant to be a way for the (non-vfio) driver of a device to
>> expose portions of the device through mediation for use with vfio. It
>> seems like that's largely being done here.
>>
>> What I think has changed recently is this desire to make it easier to
>> create those vendor drivers and some promise of making module binding
>> work to avoid the messiness around picking a driver for the device. In
>> the auxiliary bus case that I think Jason is working on, it sounds like
>> the main device driver exposes portions of itself on an auxiliary bus
>> where drivers on that bus can integrate into the vfio subsystem. It
>> starts to get pretty fuzzy with what mdev already does, but it's also a
>> more versatile interface. Is it right for everyone? Probably not.
> idxd is also moving toward this model per Jason's suggestion. Although
> auxiliar bus is not directly used, idxd driver has its own bus for exposing
> portion of its resources. From this angle, all the motivation around mdev
> bus does get fuzzy...
>
>> Is the pre-configuration issue here really a push vs pull problem? I
>> can see the requirement in step 1. is dedicating some resources to an
>> mdev use case, so at that point it seems like the argument is whether we
>> should just create aux bus devices that get automatically bound to a
>> vendor vfio-pci variant and we avoid the mdev lifecycle, which is both
>> convenient and ugly. On the other hand, mdev has more of a pull
>> interface, ie. here are a bunch of device types and how many of each we
>> can support, use create to pull what you need.
> I see your point. Looks what idxd is toward now is a mixed model. The
> parent driver uses a push interface to initialize a pool of instances
> which are then managed through mdev in a pull mode.
>
>>>> Remember we have stripped away the actual special need to use
>>>> mdev. You don't *have* to use mdev anymore to use vfio. That is a
>>>> significant ideology change even from a few months ago.
>>>>
>>> Yes, "don't have to" but if there is value of doing so it's
>>> not necessary to blocking it? One point in my mind is that if
>>> we should minimize vendor-specific contracts for user to
>>> manage mdev or subdevice...
>> Again, this in itself is not a great justification for using mdev,
>> we're creating vendor specific device types with vendor specific
>> additional features, that could all be done via some sort of netlink
>> interface too. The thing that pushes this more towards mdev for me is
>> that I don't think each of these wqs appear as devices to the host,
>> they're internal resources of the parent device and we want to compose
>> them in ways that are slightly more amenable to traditional mdevs... I
>> think. Thanks,
>>
> Yes, this is one reason going toward mdev.
>
> btw I'm not clear what the netlink interface will finally be, especially
> about whether any generic cmd should be defined cross devices given
> that subdevice management still has large generality. Jason, do you have
> an example somewhere which we can take a look regarding to mlx
> netlink design?
>
> Thanks
> Kevin
On Mon, Jun 07, 2021 at 06:22:08AM +0000, Tian, Kevin wrote:
>
> btw I'm not clear what the netlink interface will finally be, especially
> about whether any generic cmd should be defined cross devices given
> that subdevice management still has large generality. Jason, do you have
> an example somewhere which we can take a look regarding to mlx
> netlink design?
Start here:
https://lore.kernel.org/netdev/[email protected]/
devlink is some more generic way to control PCI/etc devices as some side
band to their usual uAPI interfaces like netlink/rdma/etc
Jason
On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
> So in step 1, we 'tag' the wq to be dedicated to guest usage and put the
> hardware wq into enable state. For a dedicated mode wq, we can definitely
> just register directly and skip the mdev step. For a shared wq mode, we can
> have multiple mdev running on top of a single wq. So we need some way to
> create more mdevs. We can either go with the existing established creation
> path by mdev, or invent something custom for the driver as Jason suggested
> to accomodate additional virtual devices for guests. We implemented the mdev
> path originally with consideration of mdev is established and has a known
> interface already.
It sounds like you could just as easially have a 'create new vfio'
file under the idxd sysfs.. Especially since you already have a bus
and dynamic vfio specific things being created on this bus.
Have you gone over this with Dan?
> I think things become more complicated when we go from a dedicated wq to
> shared wq where the relationship of wq : mdev is 1 : 1 goes to 1 : N. Also
> needing to keep a consistent user config experience is desired, especially
> we already have such behavior since kernel 5.6 for host usages. So we really
> need try to avoid doing wq configuration differently just for "mdev" wqs. In
> the case suggested above, we basically just flipped the configuration steps.
> Mdev is first created through mdev sysfs interface. And then the device
> paramters are configured. Where for us, we configure the device parameter
> first, and then create the mdev. But in the end, it's still the hybrid mdev
> setup right?
So you don't even use mdev to configure anything? Yuk.
Jason
On 6/7/2021 12:11 PM, Jason Gunthorpe wrote:
> On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
>
>> So in step 1, we 'tag' the wq to be dedicated to guest usage and put the
>> hardware wq into enable state. For a dedicated mode wq, we can definitely
>> just register directly and skip the mdev step. For a shared wq mode, we can
>> have multiple mdev running on top of a single wq. So we need some way to
>> create more mdevs. We can either go with the existing established creation
>> path by mdev, or invent something custom for the driver as Jason suggested
>> to accomodate additional virtual devices for guests. We implemented the mdev
>> path originally with consideration of mdev is established and has a known
>> interface already.
> It sounds like you could just as easially have a 'create new vfio'
> file under the idxd sysfs.. Especially since you already have a bus
> and dynamic vfio specific things being created on this bus.
Will explore this and using of 'struct vfio_device' without mdev.
>
> Have you gone over this with Dan?
>
>> I think things become more complicated when we go from a dedicated wq to
>> shared wq where the relationship of wq : mdev is 1 : 1 goes to 1 : N. Also
>> needing to keep a consistent user config experience is desired, especially
>> we already have such behavior since kernel 5.6 for host usages. So we really
>> need try to avoid doing wq configuration differently just for "mdev" wqs. In
>> the case suggested above, we basically just flipped the configuration steps.
>> Mdev is first created through mdev sysfs interface. And then the device
>> paramters are configured. Where for us, we configure the device parameter
>> first, and then create the mdev. But in the end, it's still the hybrid mdev
>> setup right?
> So you don't even use mdev to configure anything? Yuk.
>
> Jason
On 5/31/2021 6:48 AM, Thomas Gleixner wrote:
> On Fri, May 21 2021 at 17:19, Dave Jiang wrote:
>> Add common helper code to setup IMS once the MSI domain has been
>> setup by the device driver. The main helper function is
>> mdev_ims_set_msix_trigger() that is called by the VFIO ioctl
>> VFIO_DEVICE_SET_IRQS. The function deals with the setup and
>> teardown of emulated and IMS backed eventfd that gets exported
>> to the guest kernel via VFIO as MSIX vectors.
> So this talks about IMS, but the functionality is all named mdev_msix*
> and mdev_irqs*. Confused.
Jason mentioned this as well. Will move to vfio_ims* common code.
>> +/*
>> + * Mediate device IMS library code
> Mediated?
>
>> +static int mdev_msix_set_vector_signal(struct mdev_irq *mdev_irq, int vector, int fd)
>> +{
>> + int rc, irq;
>> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
>> + struct mdev_irq_entry *entry;
>> + struct device *dev = &mdev->dev;
>> + struct eventfd_ctx *trigger;
>> + char *name;
>> + bool pasid_en;
>> + u32 auxval;
>> +
>> + if (vector < 0 || vector >= mdev_irq->num)
>> + return -EINVAL;
>> +
>> + entry = &mdev_irq->irq_entries[vector];
>> +
>> + if (entry->ims)
>> + irq = dev_msi_irq_vector(dev, entry->ims_id);
>> + else
>> + irq = 0;
> I have no idea what this does. Comments are overrated...
>
> Aside of that dev_msi_irq_vector() seems to be a gross misnomer. AFAICT
> it retrieves the Linux interrupt number and not some vector.
Will change function name to dev_msi_irq().
>
>> + pasid_en = mdev_irq->pasid != INVALID_IOASID ? true : false;
>> +
>> + /* IMS and invalid pasid is not a valid configuration */
>> + if (entry->ims && !pasid_en)
>> + return -EINVAL;
> Why is this not validated already?
Will remove.
>
>> + if (entry->trigger) {
>> + if (irq) {
>> + irq_bypass_unregister_producer(&entry->producer);
>> + free_irq(irq, entry->trigger);
>> + if (pasid_en) {
>> + auxval = ims_ctrl_pasid_aux(0, false);
>> + irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
> Why can't this be done in the irq chip when the interrupt is torn down?
> Just because the irq chip driver, which is thankfully not merged yet,
> has been implemented that way?
>
> I did this aux dance because someone explained to me that this has to be
> handled seperately and has to be changed independent of all the
> interrupt setup and whatever. But looking at the actual usage now that's
> clearly not the case.
>
> What's the exact order of all this? I assume so:
>
> 1) mdev_irqs_init()
> 2) mdev_irqs_set_pasid()
> 3) mdev_set_msix_trigger()
>
> Right? See below.
I'll provide more info below. But yes we can add pasid to 'struct
device'. The work on auxdata is appreciated and is still needed.
>
>> +}
>> +EXPORT_SYMBOL_GPL(mdev_irqs_set_pasid);
>> + if (fd < 0)
>> + return 0;
>> +
>> + name = kasprintf(GFP_KERNEL, "vfio-mdev-irq[%d](%s)", vector, dev_name(dev));
>> + if (!name)
>> + return -ENOMEM;
>> +
>> + trigger = eventfd_ctx_fdget(fd);
>> + if (IS_ERR(trigger)) {
>> + kfree(name);
>> + return PTR_ERR(trigger);
>> + }
>> +
>> + entry->name = name;
>> + entry->trigger = trigger;
>> +
>> + if (!irq)
>> + return 0;
> These exit conditions are completely confusing.
I will add more comments. For the IMS path, some vectors are emulated
and some are backed by IMS. Thus the early exit.
>
>> + if (pasid_en) {
>> + auxval = ims_ctrl_pasid_aux(mdev_irq->pasid, true);
>> + rc = irq_set_auxdata(irq, IMS_AUXDATA_CONTROL_WORD, auxval);
>> + if (rc < 0)
>> + goto err;
> Again. This can be handled in the interrupt chip when the interrupt is
> set up through request_irq().
>
>> +static int mdev_msix_enable(struct mdev_irq *mdev_irq, int nvec)
>> +{
>> + struct mdev_device *mdev = irq_to_mdev(mdev_irq);
>> + struct device *dev;
>> + int rc;
>> +
>> + if (nvec != mdev_irq->num)
>> + return -EINVAL;
>> +
>> + if (mdev_irq->ims_num) {
>> + dev = &mdev->dev;
>> + rc = msi_domain_alloc_irqs(dev_get_msi_domain(dev), dev, mdev_irq->ims_num);
> The allocation of the interrupts happens _after_ PASID has been
> set and PASID is per device, right?
>
> So the obvious place to store PASID is in struct device because the
> device pointer is for one stored in the msi entry descriptor and it is
> also handed down to the irq domain allocation function. So this can be
> checked at allocation time already.
>
> What's unclear to me is under which circumstances does the IMS interrupt
> require a PASID.
>
> 1) Always
> 2) Use case dependent
>
Thomas, thank you for the review. I'll try to provide a summary below with what's going on with IMS after taking in yours and Jason's comments.
Interrupt Message Store (IMS) was designed to provide a more scalable means of interrupt storage compared to industry standard PCI-MSIx.
These can be defined in a device specific way per the SIOV spec.
* Not limited to 2048 vectors as specified by PCIe spec.
* Not limited how different parts of the interrupt, such as masking, pending bit array are located so facilitate different hardware configurations
that allows devices to layout in a more device friendly way.
https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html
Two variants were created by your code:
https://lore.kernel.org/linux-hyperv/[email protected]/
* IMS array – Mimics how DSA lays its IMS layout in hardware.
* IMS queue – free format memory based, devices such as graphics for e.g. could locate the IMS store in context maintained by system memory for interrupt
message and data.
Devices such as Intel DSA provides ability for unprivileged software, host user space, or guests to submit work. The intent to notify when work is complete
is part of the work descriptor submitted to the DSA hardware.
Generic Descriptor Format
+----------------------------------------------------------------+
| | PASID |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
| | Interrupt handle | |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
| |
+----------------------------------------------------------------+
DSA provides ability to allocate interrupt handles that are internally tied to HW irq. For IMS, the interrupt handle is the index for the IMS entries table.
This handle is submitted via work descriptors from unprivileged software. Either from host user space, or from other Virtual Machines. This allows ultimate
flexibility to software submitting the work, so hardware can notify completion via one of the interrupt handles. In order to ensure unprivileged software
doesn’t use a handle that doesn’t belong to it, DSA provides a facility for system software to associate a PASID with an interrupt handle, and DSA will ensure
the entity submitting the work is authorized to generate interrupt via this interrupt handle (PASID stored in IMS array should match PASID in descriptor).
The fact that the interrupt handle is tied to a PASID is implementation specific. The consumer of this interface doesn’t have any need to allocate a PASID
explicitly and is managed by privileged software.
DSA provides a way to skip PASID validation for IMS handles. This can be used if host kernel is the *only* agent generating work. Host usages without IOMMU
scalable mode are not currently implemented.
The PASID field in IMS entry is used to verify against the PASID that is associated with the submitted descriptor. The combination of the interrupt handle
(device IMS index) and the PASID verifies if the descriptor can generate the interrupt. On mismatch, invalid interrupt handle error (0x19) is generated by
the device in the software error register.
For a dedicated wq (dwq), the PASID is programmed into the WQ config register. When the descriptor is submitted to the WQ portal, the PASID from WQCFG is
compared the IMS entry as well as the interrupt handle that is programmed in the descriptor.
For a shared wq (swq), the PASID is either programmed in the descriptor for ENQCMDS or retrieved from the MSR in the case of ENQCMD. That PASID and the
interrupt handle is compared with the what is in the IMS entry.
With a match, the IMS interrupt is generated.
The following is the call flow for mdev without vSVM support:
1. idxd host driver sets PASID from iommu_aux_get_pasid() to ‘struct device’
2. idxd guest driver calls request_irq()
3. VFIO calls VFIO_DEVICE_SET_IRQS ioctl
4. idxd host driver calls vfio_set_ims_trigger() (newly created common helper function)
a. VFIO calls msi_domain_alloc_irqs() and programs valid 'struct device' PASID as auxdata to IMS entry
b. Host driver calls request_irq() for IMS interrupts
With a default pasid programmed to 'struct device', for this use case above we shouldn't have the need of programming pasid outside of irqchip.
For the use case of mdev with vSVM enabled, the code is staged for upstream submission after the current "mdev" series. When the guest idxd driver binds a
supervisor PASID, this guest PASID maps to a different host PASID and is not the default host PASID that is programmed to be programmed to the ‘struct device’.
The guest PASID is passed to the host driver through vendor specific means. The host driver needs to retrieve the host PASID that is mapped to this guest PASID
and program the that host PASID to the IMS entry. This is where the auxdata helper is needed and the PASID cannot be set via the common code path. The idxd
driver does this by having the guest driver fill the virtual MSIX permission table (device specific), which contains a PASID entry for each of the MSIX vectors
when SVA is turned on. The MMIO write to the guest vMSIXPERM table allows the host driver MMIO emulation code to retrieve the guest PASID and attempt to match
that with the host PASID. That host PASID is programmed to the IMS entry that is backing the guest MSIX vector. This cannot be done via the common path and
therefore requires the auxdata helper function to program the IMS PASID fields.
The following is the call flow for mdev with vSVM support:
1. idxd host driver sets PASID to mdev ‘struct device’ via iommu_aux_get_PASID()
2. idxd guest driver binds supervisor pasid
3. idxd guest driver calls request_irq()
4. VFIO calls VFIO_DEVICE_SET_IRQS ioctl
5. idxd host driver calls vfio_set_ims_trigger()
a. VFIO calls msi_domain_alloc_irqs() and programs PASID as auxdata to IMS entry
b. Host driver calls request_irq() for IMS interrupts
6. idxd guest driver programs virtual device MSIX permission table with guest PASID.
7. Host driver mdev MMIO emulation retrieves guest PASID from vdev MSIXPERM table and matches to host PASID via ioasid_find_by_spid().
a. Host driver calls irq_set_auxdata() to change to the new PASID for IMS entry.
On Tue, Jun 08, 2021 at 08:57:35AM -0700, Dave Jiang wrote:
> In order to ensure unprivileged software doesn’t use a handle that
> doesn’t belong to it, DSA provides a facility for system software to
> associate a PASID with an interrupt handle, and DSA will ensure the
> entity submitting the work is authorized to generate interrupt via
> this interrupt handle (PASID stored in IMS array should match PASID
> in descriptor).
How does a SVA userspace allocate interrupt handles and make them
affine to the proper CPU(s)?
IIRC interrupts are quite limited per-CPU due to the x86 IDT,
generally I would expect a kernel driver to allocate at most one IRQ
per CPU.
However here you say each process using SVA needs a unique interrupt
handle with its PASID encoded in it. Since the IMS irqchip you are
using can't share IRQs between interrupt handles does this mean that
every time userspace creates a SVA it triggers consumption of an IMS
and IDT entry on some CPU? How is this secure against DOS of limited
kernel resources?
> driver does this by having the guest driver fill the virtual MSIX
> permission table (device specific), which contains a PASID entry for
> each of the MSIX vectors when SVA is turned on. The MMIO write to
> the guest vMSIXPERM table allows the host driver MMIO emulation code
> to retrieve the guest PASID and attempt to match that with the host
> PASID. That host PASID is programmed to the IMS entry that is
> backing the guest MSIX vector. This cannot be done via the common
> path and therefore requires the auxdata helper function to program
> the IMS PASID fields.
So a VM guest gets a SW emulated vMSIXPERM table along side MSI-X, but
the physical driver went down this IMS adventure?
And you had to do this because, as discussed earlier, true IMS is not
usable in the guest due to other platform problems?
Jason
On Tue, Jun 08 2021 at 08:57, Dave Jiang wrote:
> On 5/31/2021 6:48 AM, Thomas Gleixner wrote:
>> What's unclear to me is under which circumstances does the IMS interrupt
>> require a PASID.
>>
>> 1) Always
>> 2) Use case dependent
>>
> Thomas, thank you for the review. I'll try to provide a summary below
> with what's going on with IMS after taking in yours and Jason's
> comments.
<snip>
No need to paste the manuals into mail.
</snip>
> DSA provides a way to skip PASID validation for IMS handles. This can
> be used if host kernel is the *only* agent generating work. Host
> usages without IOMMU scalable mode are not currently implemented.
So the IMS irq chip driver can do:
ims_array_alloc_msi_store(domain, dev)
{
struct msi_domain_info *info = domain->host_data;
struct ims_array_data *ims = info->data;
if (ims->flags & VALIDATE_PASID) {
if (!valid_pasid(dev))
return -EINVAL;
}
or something like that.
> The following is the call flow for mdev without vSVM support:
> 1. idxd host driver sets PASID from iommu_aux_get_pasid() to ‘struct device’
Why needs every driver to implement that?
That should be part of the iommu management to store that.
> 2. idxd guest driver calls request_irq()
> 3. VFIO calls VFIO_DEVICE_SET_IRQS ioctl
How does the guest driver request_irq() end up in the VFIO ioctl on the
host?
> 4. idxd host driver calls vfio_set_ims_trigger() (newly created common helper function)
> a. VFIO calls msi_domain_alloc_irqs() and programs valid 'struct device' PASID as auxdata to IMS entry
VFIO does not program anything into the IMS entry.
The IMS irq chip driver retrieves PASID from struct device and does
that. That can be part of the domain allocation function, but there is
no requirement to do so. It can be done later, e.g. when the interrupt
is started up.
> b. Host driver calls request_irq() for IMS interrupts
>
> With a default pasid programmed to 'struct device', for this use case
> above we shouldn't have the need of programming pasid outside of
> irqchip.
s/shouldn't/do not/
> The following is the call flow for mdev with vSVM support:
> 1. idxd host driver sets PASID to mdev ‘struct device’ via iommu_aux_get_PASID()
> 2. idxd guest driver binds supervisor pasid
> 3. idxd guest driver calls request_irq()
> 4. VFIO calls VFIO_DEVICE_SET_IRQS ioctl
> 5. idxd host driver calls vfio_set_ims_trigger()
> a. VFIO calls msi_domain_alloc_irqs() and programs PASID as auxdata to IMS entry
> b. Host driver calls request_irq() for IMS interrupts
> 6. idxd guest driver programs virtual device MSIX permission table with guest PASID.
> 7. Host driver mdev MMIO emulation retrieves guest PASID from vdev
> MSIXPERM table and matches to host PASID via ioasid_find_by_spid().
> a. Host driver calls irq_set_auxdata() to change to the new PASID
> for IMS entry.
What enforces this ordering? Certainly not the hardware.
The guest driver knows the guest PASID _before_ interrupts are allocated
or requested for the device. So it can store the guest PASID _before_ it
triggers the mechanism which makes vfio/host initialize the interrupts.
So no. It's not needed at all. It's pretty much the same as the host
side driver except for the that MSIXPERM stuff.
And just for the record. Setting MSIXPERM _after_ request_irq()
completed is just wrong because if an interrupt is raised _before_ that
MSIXPERM muck is set up, then it will fire with the host PASID and not
with the guest's.
This whole IDXD stuff has been a monstrous layering violation from the
very beginning and unfortunately this hasn't changed much since then.
Thanks,
tglx
On 6/8/2021 9:02 AM, Dave Jiang wrote:
>
> On 6/7/2021 12:11 PM, Jason Gunthorpe wrote:
>> On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
>>
>>> So in step 1, we 'tag' the wq to be dedicated to guest usage and put
>>> the
>>> hardware wq into enable state. For a dedicated mode wq, we can
>>> definitely
>>> just register directly and skip the mdev step. For a shared wq mode,
>>> we can
>>> have multiple mdev running on top of a single wq. So we need some
>>> way to
>>> create more mdevs. We can either go with the existing established
>>> creation
>>> path by mdev, or invent something custom for the driver as Jason
>>> suggested
>>> to accomodate additional virtual devices for guests. We implemented
>>> the mdev
>>> path originally with consideration of mdev is established and has a
>>> known
>>> interface already.
>> It sounds like you could just as easially have a 'create new vfio'
>> file under the idxd sysfs.. Especially since you already have a bus
>> and dynamic vfio specific things being created on this bus.
>
> Will explore this and using of 'struct vfio_device' without mdev.
>
Hi Jason. I hacked the idxd driver to remove mdev association and use
vfio_device directly. Ran into some issues. Specifically mdev does some
special handling when it comes to iommu domain. When we hit
vfio_iommu_type1_attach_group(), there's a branch in there for
mdev_bus_type. It sets the group with mdev_group flag, which later has
effect of special handling for iommu_attach_group. And in addition, it
ends up switching the bus to pci_bus_type before iommu_domain_alloc() is
called. Do we need to provide similar type of handling for vfio_device
that are not backed by an entire PCI device like vfio_pci? Not sure it's
the right thing to do to attach these devices to pci_bus_type directly.
On Fri, Jun 11, 2021 at 11:21:42AM -0700, Dave Jiang wrote:
>
> On 6/8/2021 9:02 AM, Dave Jiang wrote:
> >
> > On 6/7/2021 12:11 PM, Jason Gunthorpe wrote:
> > > On Mon, Jun 07, 2021 at 11:13:04AM -0700, Dave Jiang wrote:
> > >
> > > > So in step 1, we 'tag' the wq to be dedicated to guest usage and
> > > > put the
> > > > hardware wq into enable state. For a dedicated mode wq, we can
> > > > definitely
> > > > just register directly and skip the mdev step. For a shared wq
> > > > mode, we can
> > > > have multiple mdev running on top of a single wq. So we need
> > > > some way to
> > > > create more mdevs. We can either go with the existing
> > > > established creation
> > > > path by mdev, or invent something custom for the driver as Jason
> > > > suggested
> > > > to accomodate additional virtual devices for guests. We
> > > > implemented the mdev
> > > > path originally with consideration of mdev is established and
> > > > has a known
> > > > interface already.
> > > It sounds like you could just as easially have a 'create new vfio'
> > > file under the idxd sysfs.. Especially since you already have a bus
> > > and dynamic vfio specific things being created on this bus.
> >
> > Will explore this and using of 'struct vfio_device' without mdev.
> >
> Hi Jason. I hacked the idxd driver to remove mdev association and use
> vfio_device directly. Ran into some issues. Specifically mdev does some
> special handling when it comes to iommu domain.
Yes, I know of this, it this needs fixing.
> effect of special handling for iommu_attach_group. And in addition, it ends
> up switching the bus to pci_bus_type before iommu_domain_alloc() is called.
> Do we need to provide similar type of handling for vfio_device that are not
> backed by an entire PCI device like vfio_pci? Not sure it's the right thing
> to do to attach these devices to pci_bus_type directly.
Yes, type1 needs to be changed so it somehow knows that the struct
device is 'sw only', for instance because it has no direct HW IOMMU
connection, then all the mdev hackery should be deleted from type1.
I haven't investigated closely exactly how to do this.
Jason