v2:
IMS (now dev-msi):
With recommendations from Jason/Thomas/Dan on making IMS more generic:
Pass a non-pci generic device(struct device) for IMS management instead of mdev
Remove all references to mdev and symbol_get/put
Remove all references to IMS in common code and replace with dev-msi
remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers
Create a generic dev-msi domain with and without interrupt remapping enabled.
Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
mdev:
Removing unrelated bits from SVA enabling that’s not necessary for the submission. (Kevin)
Restructured entire mdev driver series to make reviewing easier (Kevin)
Made rw emulation more robust (Kevin)
Removed uuid wq type and added single dedicated wq type (Kevin)
Locking fixes for vdev (Yan Zhao)
VFIO MSIX trigger fixes (Yan Zhao)
Link to previous discussions with Jason:
https://lore.kernel.org/lkml/[email protected]/
The emulation part that can be moved to user space is very small due to the majority of the
emulations being control bits and need to reside in the kernel. We can revisit the necessity of
moving the small emulation part to userspace and required architectural changes at a later time.
This RFC series has been reviewed by Dan Williams <[email protected]>
The actual code can be independent of the stage 2 driver code submission that adds support for SVM,
ENQCMD(S), PASID, and shared workqueues. This code series will match the support of the 5.6 kernel
(stage 1) driver but on guest. The code is dependent on Baolu’s iommu aux-domain API extensions
patches that’s still in process of being reviewed:
https://lkml.org/lkml/2020/7/14/48
Stage 1 of the driver has been accepted in v5.6 kernel. It supports dedicated workqueue (wq)
without Shared Virtual Memory (SVM) support. Stage 2 supports shared wq and SVM. It is pending
upstream review and targeting kernel v5.9.
VFIO mediated device framework allows vendor drivers to wrap a portion of device resources into
virtual devices (mdev). Each mdev can be assigned to different guest using the same set of VFIO
uAPIs as assigning a physical device. Accessing to the mdev resource is served with mixed policies.
For example, vendor drivers typically mark data-path interface as pass-through for fast guest
operations, and then trap-and-mediate the control-path interface to avoid undesired interference
between mdevs. Some level of emulation is necessary behind vfio mdev to compose the virtual device
interface.
This series brings mdev to idxd driver to enable Intel Scalable IOV (SIOV), a hardware-assisted
mediated pass-through technology. SIOV makes each DSA wq independently assignable through
PASID-granular resource/DMA isolation. It helps improve scalability and reduces mediation
complexity against purely software-based mdev implementations. Each assigned wq is configured by
host and exposed to the guest in a read-only configuration mode, which allows the guest to use the
wq w/o additional setup. This design greatly reduces the emulation bits to focus on handling
commands from guests.
Introducing mdev types “1dwq” type. This mdev type allows allocation of a single dedicated wq from
available dedicated wqs. After a workqueue (wq) is enabled, the user will generate an uuid. On mdev
creation, the mdev driver code will find a dwq depending on the mdev type. When the create operation
is successful, the user generated uuid can be passed to qemu. When the guest boots up, it should
discover a DSA device when doing PCI discovery.
For example of “1dwq” type:
1. Enable wq with “mdev” wq type
2. A user generated uuid.
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-wq/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
The wq exported through mdev will have the read only config bit set for configuration. This means
that the device does not require the typical configuration. After enabling the device, the user
must set the WQ type and name. That is all is necessary to enable the WQ and start using it. The
single wq configuration is not the only way to create the mdev. Multi wqs support for mdev will be
in the future works.
The mdev utilizes Interrupt Message Store or IMS[3], a device-specific MSI implementation, instead
of MSIX for interrupts for the guest. This preserves MSIX for host usages and also allows a
significantly larger number of interrupt vectors for guest usage.
The idxd driver implements IMS as on-device memory mapped unified storage. Each interrupt message
is stored as a DWORD size data payload and a 64-bit address (same as MSI-X). Access to the IMS is
through the host idxd driver.
This patchset extends the existing platform-msi framework (which provides a generic mechanism to
support non-PCI compliant MSI interrupts) to benefit any driver which wants to allocate
msi-like(dev-msi) interrupts and provide its own ops functions (mask/unmask etc.)
Call-back functions defined by the kernel and implemented by the driver are used to
1. program the interrupt addr/data values instead of the kernel directly programming them.
2. mask/unmask the interrupt source
The kernel can specify the requirements for these callback functions (e.g., the driver is not
expected to block, or not expected to take a lock in the callback function).
Support for 2 new IRQ chip/domain is added(with and without IRQ_REMAP support- DEV-MSI/IR-DEV-MSI).
[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage2.5
---
Dave Jiang (13):
dmaengine: idxd: add support for readonly config devices
dmaengine: idxd: add interrupt handle request support
dmaengine: idxd: add DEV-MSI support in base driver
dmaengine: idxd: add device support functions in prep for mdev
dmaengine: idxd: add basic mdev registration and helper functions
dmaengine: idxd: add emulation rw routines
dmaengine: idxd: prep for virtual device commands
dmaengine: idxd: virtual device commands emulation
dmaengine: idxd: ims setup for the vdcm
dmaengine: idxd: add mdev type as a new wq type
dmaengine: idxd: add dedicated wq mdev type
dmaengine: idxd: add new wq state for mdev
dmaengine: idxd: add error notification from host driver to mediated device
Jing Lin (1):
dmaengine: idxd: add ABI documentation for mediated device support
Megha Dey (4):
platform-msi: Introduce platform_msi_ops
irq/dev-msi: Add support for a new DEV_MSI irq domain
irq/dev-msi: Create IR-DEV-MSI irq domain
irq/dev-msi: Introduce APIs to allocate/free dev-msi interrupts
Documentation/ABI/stable/sysfs-driver-dma-idxd | 15
arch/x86/include/asm/hw_irq.h | 6
arch/x86/kernel/apic/msi.c | 12
drivers/base/Kconfig | 7
drivers/base/Makefile | 1
drivers/base/dev-msi.c | 170 ++++
drivers/base/platform-msi.c | 62 +
drivers/base/platform-msi.h | 23
drivers/dma/Kconfig | 7
drivers/dma/idxd/Makefile | 2
drivers/dma/idxd/cdev.c | 6
drivers/dma/idxd/device.c | 266 +++++-
drivers/dma/idxd/idxd.h | 62 +
drivers/dma/idxd/ims.c | 174 ++++
drivers/dma/idxd/ims.h | 17
drivers/dma/idxd/init.c | 100 ++
drivers/dma/idxd/irq.c | 6
drivers/dma/idxd/mdev.c | 1106 ++++++++++++++++++++++++
drivers/dma/idxd/mdev.h | 118 +++
drivers/dma/idxd/registers.h | 24 -
drivers/dma/idxd/submit.c | 37 +
drivers/dma/idxd/sysfs.c | 55 +
drivers/dma/idxd/vdev.c | 962 +++++++++++++++++++++
drivers/dma/idxd/vdev.h | 28 +
drivers/dma/mv_xor_v2.c | 6
drivers/dma/qcom/hidma.c | 6
drivers/iommu/arm-smmu-v3.c | 6
drivers/iommu/intel/irq_remapping.c | 11
drivers/irqchip/irq-mbigen.c | 8
drivers/irqchip/irq-mvebu-icu.c | 6
drivers/mailbox/bcm-flexrm-mailbox.c | 6
drivers/perf/arm_smmuv3_pmu.c | 6
include/linux/intel-iommu.h | 1
include/linux/irqdomain.h | 11
include/linux/msi.h | 35 +
include/uapi/linux/idxd.h | 2
36 files changed, 3270 insertions(+), 100 deletions(-)
create mode 100644 drivers/base/dev-msi.c
create mode 100644 drivers/base/platform-msi.h
create mode 100644 drivers/dma/idxd/ims.c
create mode 100644 drivers/dma/idxd/ims.h
create mode 100644 drivers/dma/idxd/mdev.c
create mode 100644 drivers/dma/idxd/mdev.h
create mode 100644 drivers/dma/idxd/vdev.c
create mode 100644 drivers/dma/idxd/vdev.h
--
From: Megha Dey <[email protected]>
platform-msi.c provides a generic way to handle non-PCI message
signaled interrupts. However, it assumes that only the message
needs to be customized. Given that an MSI is just a write
transaction, some devices may need custom callbacks to
mask/unmask their interrupts.
Hence, introduce a new structure platform_msi_ops, which would
provide device specific write function for now and introduce device
device specific callbacks (mask/unmask), in subsequent patches.
Devices may find more efficient ways to store addr/data pairs
than what is recommended by the PCI sig. (For e.g. the storage of
the vector might not be resident on the device. Consider GPGPU
for instance, where the vector could be part of the execution
context instead of being stored on the device.)
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/base/platform-msi.c | 29 +++++++++++++++--------------
drivers/dma/mv_xor_v2.c | 6 +++++-
drivers/dma/qcom/hidma.c | 6 +++++-
drivers/iommu/arm-smmu-v3.c | 6 +++++-
drivers/irqchip/irq-mbigen.c | 8 ++++++--
drivers/irqchip/irq-mvebu-icu.c | 6 +++++-
drivers/mailbox/bcm-flexrm-mailbox.c | 6 +++++-
drivers/perf/arm_smmuv3_pmu.c | 6 +++++-
include/linux/msi.h | 20 ++++++++++++++------
9 files changed, 65 insertions(+), 28 deletions(-)
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index c4a17e5edf8b..9d94cd699468 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -18,14 +18,14 @@
/*
* Internal data structure containing a (made up, but unique) devid
- * and the callback to write the MSI message.
+ * and the platform-msi ops
*/
struct platform_msi_priv_data {
- struct device *dev;
- void *host_data;
- msi_alloc_info_t arg;
- irq_write_msi_msg_t write_msg;
- int devid;
+ struct device *dev;
+ void *host_data;
+ msi_alloc_info_t arg;
+ const struct platform_msi_ops *ops;
+ int devid;
};
/* The devid allocator */
@@ -83,7 +83,7 @@ static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
priv_data = desc->platform.msi_priv_data;
- priv_data->write_msg(desc, msg);
+ priv_data->ops->write_msg(desc, msg);
}
static void platform_msi_update_chip_ops(struct msi_domain_info *info)
@@ -194,16 +194,17 @@ struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
static struct platform_msi_priv_data *
platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
- irq_write_msi_msg_t write_msi_msg)
+ const struct platform_msi_ops *platform_ops)
{
struct platform_msi_priv_data *datap;
+
/*
* Limit the number of interrupts to 2048 per device. Should we
* need to bump this up, DEV_ID_SHIFT should be adjusted
* accordingly (which would impact the max number of MSI
* capable devices).
*/
- if (!dev->msi_domain || !write_msi_msg || !nvec || nvec > MAX_DEV_MSIS)
+ if (!dev->msi_domain || !platform_ops->write_msg || !nvec || nvec > MAX_DEV_MSIS)
return ERR_PTR(-EINVAL);
if (dev->msi_domain->bus_token != DOMAIN_BUS_PLATFORM_MSI) {
@@ -227,7 +228,7 @@ platform_msi_alloc_priv_data(struct device *dev, unsigned int nvec,
return ERR_PTR(err);
}
- datap->write_msg = write_msi_msg;
+ datap->ops = platform_ops;
datap->dev = dev;
return datap;
@@ -249,12 +250,12 @@ static void platform_msi_free_priv_data(struct platform_msi_priv_data *data)
* Zero for success, or an error code in case of failure
*/
int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
- irq_write_msi_msg_t write_msi_msg)
+ const struct platform_msi_ops *platform_ops)
{
struct platform_msi_priv_data *priv_data;
int err;
- priv_data = platform_msi_alloc_priv_data(dev, nvec, write_msi_msg);
+ priv_data = platform_msi_alloc_priv_data(dev, nvec, platform_ops);
if (IS_ERR(priv_data))
return PTR_ERR(priv_data);
@@ -324,7 +325,7 @@ struct irq_domain *
__platform_msi_create_device_domain(struct device *dev,
unsigned int nvec,
bool is_tree,
- irq_write_msi_msg_t write_msi_msg,
+ const struct platform_msi_ops *platform_ops,
const struct irq_domain_ops *ops,
void *host_data)
{
@@ -332,7 +333,7 @@ __platform_msi_create_device_domain(struct device *dev,
struct irq_domain *domain;
int err;
- data = platform_msi_alloc_priv_data(dev, nvec, write_msi_msg);
+ data = platform_msi_alloc_priv_data(dev, nvec, platform_ops);
if (IS_ERR(data))
return NULL;
diff --git a/drivers/dma/mv_xor_v2.c b/drivers/dma/mv_xor_v2.c
index 9225f08dfee9..c0033c4f8ee5 100644
--- a/drivers/dma/mv_xor_v2.c
+++ b/drivers/dma/mv_xor_v2.c
@@ -710,6 +710,10 @@ static int mv_xor_v2_resume(struct platform_device *dev)
return 0;
}
+static const struct platform_msi_ops mv_xor_v2_msi_ops = {
+ .write_msg = mv_xor_v2_set_msi_msg,
+};
+
static int mv_xor_v2_probe(struct platform_device *pdev)
{
struct mv_xor_v2_device *xor_dev;
@@ -765,7 +769,7 @@ static int mv_xor_v2_probe(struct platform_device *pdev)
}
ret = platform_msi_domain_alloc_irqs(&pdev->dev, 1,
- mv_xor_v2_set_msi_msg);
+ &mv_xor_v2_msi_ops);
if (ret)
goto disable_clk;
diff --git a/drivers/dma/qcom/hidma.c b/drivers/dma/qcom/hidma.c
index 0a6d3ea08c78..c3ee63159ff8 100644
--- a/drivers/dma/qcom/hidma.c
+++ b/drivers/dma/qcom/hidma.c
@@ -678,6 +678,10 @@ static void hidma_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
writel(msg->data, dmadev->dev_evca + 0x120);
}
}
+
+static const struct platform_msi_ops hidma_msi_ops = {
+ .write_msg = hidma_write_msi_msg,
+};
#endif
static void hidma_free_msis(struct hidma_dev *dmadev)
@@ -703,7 +707,7 @@ static int hidma_request_msi(struct hidma_dev *dmadev,
struct msi_desc *failed_desc = NULL;
rc = platform_msi_domain_alloc_irqs(&pdev->dev, HIDMA_MSI_INTS,
- hidma_write_msi_msg);
+ &hidma_msi_ops);
if (rc)
return rc;
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..655e7987d8af 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -3410,6 +3410,10 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
}
+static const struct platform_msi_ops arm_smmu_msi_ops = {
+ .write_msg = arm_smmu_write_msi_msg,
+};
+
static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
{
struct msi_desc *desc;
@@ -3434,7 +3438,7 @@ static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
}
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
- ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
+ ret = platform_msi_domain_alloc_irqs(dev, nvec, &arm_smmu_msi_ops);
if (ret) {
dev_warn(dev, "failed to allocate MSIs - falling back to wired irqs\n");
return;
diff --git a/drivers/irqchip/irq-mbigen.c b/drivers/irqchip/irq-mbigen.c
index ff7627b57772..6619e2eadbce 100644
--- a/drivers/irqchip/irq-mbigen.c
+++ b/drivers/irqchip/irq-mbigen.c
@@ -232,6 +232,10 @@ static const struct irq_domain_ops mbigen_domain_ops = {
.free = mbigen_irq_domain_free,
};
+static const struct platform_msi_ops mbigen_msi_ops = {
+ .write_msg = mbigen_write_msg,
+};
+
static int mbigen_of_create_domain(struct platform_device *pdev,
struct mbigen_device *mgn_chip)
{
@@ -260,7 +264,7 @@ static int mbigen_of_create_domain(struct platform_device *pdev,
}
domain = platform_msi_create_device_domain(&child->dev, num_pins,
- mbigen_write_msg,
+ &mbigen_msi_ops,
&mbigen_domain_ops,
mgn_chip);
if (!domain) {
@@ -308,7 +312,7 @@ static int mbigen_acpi_create_domain(struct platform_device *pdev,
return -EINVAL;
domain = platform_msi_create_device_domain(&pdev->dev, num_pins,
- mbigen_write_msg,
+ &mbigen_msi_ops,
&mbigen_domain_ops,
mgn_chip);
if (!domain)
diff --git a/drivers/irqchip/irq-mvebu-icu.c b/drivers/irqchip/irq-mvebu-icu.c
index 91adf771f185..927d8ebc68cb 100644
--- a/drivers/irqchip/irq-mvebu-icu.c
+++ b/drivers/irqchip/irq-mvebu-icu.c
@@ -295,6 +295,10 @@ static const struct of_device_id mvebu_icu_subset_of_match[] = {
{},
};
+static const struct platform_msi_ops mvebu_icu_msi_ops = {
+ .write_msg = mvebu_icu_write_msg,
+};
+
static int mvebu_icu_subset_probe(struct platform_device *pdev)
{
struct mvebu_icu_msi_data *msi_data;
@@ -324,7 +328,7 @@ static int mvebu_icu_subset_probe(struct platform_device *pdev)
return -ENODEV;
irq_domain = platform_msi_create_device_tree_domain(dev, ICU_MAX_IRQS,
- mvebu_icu_write_msg,
+ &mvebu_icu_msi_ops,
&mvebu_icu_domain_ops,
msi_data);
if (!irq_domain) {
diff --git a/drivers/mailbox/bcm-flexrm-mailbox.c b/drivers/mailbox/bcm-flexrm-mailbox.c
index bee33abb5308..0268337e08e3 100644
--- a/drivers/mailbox/bcm-flexrm-mailbox.c
+++ b/drivers/mailbox/bcm-flexrm-mailbox.c
@@ -1492,6 +1492,10 @@ static void flexrm_mbox_msi_write(struct msi_desc *desc, struct msi_msg *msg)
writel_relaxed(msg->data, ring->regs + RING_MSI_DATA_VALUE);
}
+static const struct platform_msi_ops flexrm_mbox_msi_ops = {
+ .write_msg = flexrm_mbox_msi_write,
+};
+
static int flexrm_mbox_probe(struct platform_device *pdev)
{
int index, ret = 0;
@@ -1604,7 +1608,7 @@ static int flexrm_mbox_probe(struct platform_device *pdev)
/* Allocate platform MSIs for each ring */
ret = platform_msi_domain_alloc_irqs(dev, mbox->num_rings,
- flexrm_mbox_msi_write);
+ &flexrm_mbox_msi_ops);
if (ret)
goto fail_destroy_cmpl_pool;
diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index 48e28ef93a70..f1dec2bcd2a1 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -652,6 +652,10 @@ static void smmu_pmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
pmu->reg_base + SMMU_PMCG_IRQ_CFG2);
}
+static const struct platform_msi_ops smmu_pmu_msi_ops = {
+ .write_msg = smmu_pmu_write_msi_msg,
+};
+
static void smmu_pmu_setup_msi(struct smmu_pmu *pmu)
{
struct msi_desc *desc;
@@ -665,7 +669,7 @@ static void smmu_pmu_setup_msi(struct smmu_pmu *pmu)
if (!(readl(pmu->reg_base + SMMU_PMCG_CFGR) & SMMU_PMCG_CFGR_MSI))
return;
- ret = platform_msi_domain_alloc_irqs(dev, 1, smmu_pmu_write_msi_msg);
+ ret = platform_msi_domain_alloc_irqs(dev, 1, &smmu_pmu_msi_ops);
if (ret) {
dev_warn(dev, "failed to allocate MSIs\n");
return;
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8ad679e9d9c0..7f6a8eb51aca 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -321,6 +321,14 @@ enum {
MSI_FLAG_LEVEL_CAPABLE = (1 << 6),
};
+/*
+ * platform_msi_ops - Callbacks for platform MSI ops
+ * @write_msg: write message content
+ */
+struct platform_msi_ops {
+ irq_write_msi_msg_t write_msg;
+};
+
int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
bool force);
@@ -336,7 +344,7 @@ struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
struct msi_domain_info *info,
struct irq_domain *parent);
int platform_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
- irq_write_msi_msg_t write_msi_msg);
+ const struct platform_msi_ops *platform_ops);
void platform_msi_domain_free_irqs(struct device *dev);
/* When an MSI domain is used as an intermediate domain */
@@ -348,14 +356,14 @@ struct irq_domain *
__platform_msi_create_device_domain(struct device *dev,
unsigned int nvec,
bool is_tree,
- irq_write_msi_msg_t write_msi_msg,
+ const struct platform_msi_ops *platform_ops,
const struct irq_domain_ops *ops,
void *host_data);
-#define platform_msi_create_device_domain(dev, nvec, write, ops, data) \
- __platform_msi_create_device_domain(dev, nvec, false, write, ops, data)
-#define platform_msi_create_device_tree_domain(dev, nvec, write, ops, data) \
- __platform_msi_create_device_domain(dev, nvec, true, write, ops, data)
+#define platform_msi_create_device_domain(dev, nvec, p_ops, ops, data) \
+ __platform_msi_create_device_domain(dev, nvec, false, p_ops, ops, data)
+#define platform_msi_create_device_tree_domain(dev, nvec, p_ops, ops, data) \
+ __platform_msi_create_device_domain(dev, nvec, true, p_ops, ops, data)
int platform_msi_domain_alloc(struct irq_domain *domain, unsigned int virq,
unsigned int nr_irqs);
The VFIO mediated device for idxd driver will provide a virtual DSA
device by backing it with a workqueue. The virtual device will be limited
with the wq configuration registers set to read-only. Add support and
helper functions for the handling of a DSA device with the configuration
registers marked as read-only.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/device.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/idxd.h | 1
drivers/dma/idxd/init.c | 8 +++
drivers/dma/idxd/sysfs.c | 21 +++++---
4 files changed, 137 insertions(+), 9 deletions(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 9032b50b31af..7531ed9c1b81 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -750,3 +750,119 @@ int idxd_device_config(struct idxd_device *idxd)
return 0;
}
+
+static int idxd_wq_load_config(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ int wqcfg_offset;
+ int i;
+
+ wqcfg_offset = idxd->wqcfg_offset + wq->id * 32;
+ memcpy_fromio(&wq->wqcfg, idxd->reg_base + wqcfg_offset, sizeof(union wqcfg));
+
+ wq->size = wq->wqcfg.wq_size;
+ wq->threshold = wq->wqcfg.wq_thresh;
+ if (wq->wqcfg.priv)
+ wq->type = IDXD_WQT_KERNEL;
+
+ /* The driver does not support shared WQ mode in read-only config yet */
+ if (wq->wqcfg.mode == 0 || wq->wqcfg.pasid_en)
+ return -EOPNOTSUPP;
+
+ set_bit(WQ_FLAG_DEDICATED, &wq->flags);
+
+ wq->priority = wq->wqcfg.priority;
+
+ for (i = 0; i < 8; i++) {
+ wqcfg_offset = idxd->wqcfg_offset + wq->id * 32 + i * sizeof(u32);
+ dev_dbg(dev, "WQ[%d][%d][%#x]: %#x\n",
+ wq->id, i, wqcfg_offset, wq->wqcfg.bits[i]);
+ }
+
+ return 0;
+}
+
+static void idxd_group_load_config(struct idxd_group *group)
+{
+ struct idxd_device *idxd = group->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ int i, j, grpcfg_offset;
+
+ /*
+ * Load WQS bit fields
+ * Iterate through all 256 bits 64 bits at a time
+ */
+ for (i = 0; i < 4; i++) {
+ struct idxd_wq *wq;
+
+ grpcfg_offset = idxd->grpcfg_offset + group->id * 64 + i * sizeof(u64);
+ group->grpcfg.wqs[i] = ioread64(idxd->reg_base + grpcfg_offset);
+ dev_dbg(dev, "GRPCFG wq[%d:%d: %#x]: %#llx\n",
+ group->id, i, grpcfg_offset, group->grpcfg.wqs[i]);
+
+ if (i * 64 >= idxd->max_wqs)
+ break;
+
+ /* Iterate through all 64 bits and check for wq set */
+ for (j = 0; j < 64; j++) {
+ int id = i * 64 + j;
+
+ /* No need to check beyond max wqs */
+ if (id >= idxd->max_wqs)
+ break;
+
+ /* Set group assignment for wq if wq bit is set */
+ if (group->grpcfg.wqs[i] & BIT(j)) {
+ wq = &idxd->wqs[id];
+ wq->group = group;
+ }
+ }
+ }
+
+ grpcfg_offset = idxd->grpcfg_offset + group->id * 64 + 32;
+ group->grpcfg.engines = ioread64(idxd->reg_base + grpcfg_offset);
+ dev_dbg(dev, "GRPCFG engs[%d: %#x]: %#llx\n", group->id,
+ grpcfg_offset, group->grpcfg.engines);
+
+ for (i = 0; i < 64; i++) {
+ if (i >= idxd->max_engines)
+ break;
+
+ if (group->grpcfg.engines & BIT(i)) {
+ struct idxd_engine *engine = &idxd->engines[i];
+
+ engine->group = group;
+ }
+ }
+
+ grpcfg_offset = idxd->grpcfg_offset + group->id * 64 + 40;
+ group->grpcfg.flags.bits = ioread32(idxd->reg_base + grpcfg_offset);
+ dev_dbg(dev, "GRPFLAGS flags[%d: %#x]: %#x\n",
+ group->id, grpcfg_offset, group->grpcfg.flags.bits);
+}
+
+int idxd_device_load_config(struct idxd_device *idxd)
+{
+ union gencfg_reg reg;
+ int i, rc;
+
+ reg.bits = ioread32(idxd->reg_base + IDXD_GENCFG_OFFSET);
+ idxd->token_limit = reg.token_limit;
+
+ for (i = 0; i < idxd->max_groups; i++) {
+ struct idxd_group *group = &idxd->groups[i];
+
+ idxd_group_load_config(group);
+ }
+
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq = &idxd->wqs[i];
+
+ rc = idxd_wq_load_config(wq);
+ if (rc < 0)
+ return rc;
+ }
+
+ return 0;
+}
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 518768885d7b..fcea8bc060f5 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -306,6 +306,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+int idxd_device_load_config(struct idxd_device *idxd);
/* work queue control */
int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index a7e1dbfcd173..50c68de6b4ab 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -344,6 +344,14 @@ static int idxd_probe(struct idxd_device *idxd)
if (rc)
goto err_setup;
+ /* If the configs are readonly, then load them from device */
+ if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+ dev_dbg(dev, "Loading RO device config\n");
+ rc = idxd_device_load_config(idxd);
+ if (rc < 0)
+ goto err_setup;
+ }
+
rc = idxd_setup_interrupts(idxd);
if (rc)
goto err_setup;
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 57ea94a0dc51..fa1abdf503c2 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -102,7 +102,7 @@ static int idxd_config_bus_match(struct device *dev,
static int idxd_config_bus_probe(struct device *dev)
{
- int rc;
+ int rc = 0;
unsigned long flags;
dev_dbg(dev, "%s called\n", __func__);
@@ -120,7 +120,8 @@ static int idxd_config_bus_probe(struct device *dev)
/* Perform IDXD configuration and enabling */
spin_lock_irqsave(&idxd->dev_lock, flags);
- rc = idxd_device_config(idxd);
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+ rc = idxd_device_config(idxd);
spin_unlock_irqrestore(&idxd->dev_lock, flags);
if (rc < 0) {
module_put(THIS_MODULE);
@@ -207,7 +208,8 @@ static int idxd_config_bus_probe(struct device *dev)
}
spin_lock_irqsave(&idxd->dev_lock, flags);
- rc = idxd_device_config(idxd);
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+ rc = idxd_device_config(idxd);
spin_unlock_irqrestore(&idxd->dev_lock, flags);
if (rc < 0) {
mutex_unlock(&wq->wq_lock);
@@ -328,13 +330,14 @@ static int idxd_config_bus_remove(struct device *dev)
idxd_unregister_dma_device(idxd);
rc = idxd_device_disable(idxd);
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq = &idxd->wqs[i];
- for (i = 0; i < idxd->max_wqs; i++) {
- struct idxd_wq *wq = &idxd->wqs[i];
-
- mutex_lock(&wq->wq_lock);
- idxd_wq_disable_cleanup(wq);
- mutex_unlock(&wq->wq_lock);
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_disable_cleanup(wq);
+ mutex_unlock(&wq->wq_lock);
+ }
}
module_put(THIS_MODULE);
Add support for requesting interrupt handle from the device. The interrupt
handle is put in the interrupt handle field of a descriptor for the device
to determine which interrupt vector to use be it MSI-X or IMS. On the host
device, the interrupt handle is indexed to the MSI-X table. This allows a
descriptor to program the interrupt handle 1:1 with the MSI-X index without
getting it from the request interrupt handle device command. For a guest
device, the index can be any index that the host assigned for the IMS
table, and therefore it must be requested from the virtual device during
MSI-X setup by the driver running on the guest.
On the actual hardware the MSIX vector 0 is misc interrupt and handles
events such as administrative command completion, error reporting,
performance monitor overflow, and etc. The MSIX vectors 1...N
are used for descriptor completion interrupts. On the guest kernel,
the MSIX interrupts are backed by the mediated device through emulation
or IMS vectors. Vector 0 is handled through emulation by the host vdcm.
It only requires the host driver to send the signal to qemu. The vector 1
(and more may be supported later) is backed by IMS.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/device.c | 30 ++++++++++++++++++++++++++++++
drivers/dma/idxd/idxd.h | 11 +++++++++++
drivers/dma/idxd/init.c | 29 +++++++++++++++++++++++++++++
drivers/dma/idxd/registers.h | 4 +++-
drivers/dma/idxd/submit.c | 29 ++++++++++++++++++++++-------
5 files changed, 95 insertions(+), 8 deletions(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 7531ed9c1b81..2b4e8ab99ebd 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -502,6 +502,36 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
dev_dbg(dev, "pasid %d drained\n", pasid);
}
+#define INT_HANDLE_IMS_TABLE 0x10000
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx,
+ int *handle, enum idxd_interrupt_type irq_type)
+{
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand, status;
+
+ if (!(idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)))
+ return -EOPNOTSUPP;
+
+ dev_dbg(dev, "get int handle, idx %d\n", idx);
+
+ operand = idx & 0xffff;
+ if (irq_type == IDXD_IRQ_IMS)
+ operand |= INT_HANDLE_IMS_TABLE;
+ dev_dbg(dev, "cmd: %u operand: %#x\n",
+ IDXD_CMD_REQUEST_INT_HANDLE, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_REQUEST_INT_HANDLE, operand, &status);
+
+ if ((status & 0xff) != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "request int handle failed: %#x\n", status);
+ return -ENXIO;
+ }
+
+ *handle = (status >> 8) & 0xffff;
+
+ dev_dbg(dev, "int handle acquired: %u\n", *handle);
+ return 0;
+}
+
/* Device configuration bits */
static void idxd_group_config_write(struct idxd_group *group)
{
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index fcea8bc060f5..2cd190a3da73 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -138,6 +138,7 @@ struct idxd_hw {
union group_cap_reg group_cap;
union engine_cap_reg engine_cap;
struct opcap opcap;
+ u32 cmd_cap;
};
enum idxd_device_state {
@@ -201,6 +202,8 @@ struct idxd_device {
struct dma_device dma_dev;
struct workqueue_struct *wq;
struct work_struct work;
+
+ int *int_handles;
};
/* IDXD software descriptor */
@@ -214,6 +217,7 @@ struct idxd_desc {
struct list_head list;
int id;
int cpu;
+ unsigned int vec_ptr;
struct idxd_wq *wq;
};
@@ -242,6 +246,11 @@ enum idxd_portal_prot {
IDXD_PORTAL_LIMITED,
};
+enum idxd_interrupt_type {
+ IDXD_IRQ_MSIX = 0,
+ IDXD_IRQ_IMS,
+};
+
static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
{
return prot * 0x1000;
@@ -307,6 +316,8 @@ int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
int idxd_device_load_config(struct idxd_device *idxd);
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
+ enum idxd_interrupt_type irq_type);
/* work queue control */
int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 50c68de6b4ab..9fd505a03444 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -132,6 +132,22 @@ static int idxd_setup_interrupts(struct idxd_device *idxd)
}
dev_dbg(dev, "Allocated idxd-msix %d for vector %d\n",
i, msix->vector);
+
+ if (idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)) {
+ /*
+ * The MSIX vector enumeration starts at 1 with vector 0 being the
+ * misc interrupt that handles non I/O completion events. The
+ * interrupt handles are for IMS enumeration on guest. The misc
+ * interrupt vector does not require a handle and therefore we start
+ * the int_handles at index 0. Since 'i' starts at 1, the first
+ * int_handles index will be 0.
+ */
+ rc = idxd_device_request_int_handle(idxd, i, &idxd->int_handles[i - 1],
+ IDXD_IRQ_MSIX);
+ if (rc < 0)
+ goto err_no_irq;
+ dev_dbg(dev, "int handle requested: %u\n", idxd->int_handles[i - 1]);
+ }
}
idxd_unmask_error_interrupts(idxd);
@@ -159,6 +175,13 @@ static int idxd_setup_internals(struct idxd_device *idxd)
int i;
init_waitqueue_head(&idxd->cmd_waitq);
+
+ if (idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)) {
+ idxd->int_handles = devm_kcalloc(dev, idxd->max_wqs, sizeof(int), GFP_KERNEL);
+ if (!idxd->int_handles)
+ return -ENOMEM;
+ }
+
idxd->groups = devm_kcalloc(dev, idxd->max_groups,
sizeof(struct idxd_group), GFP_KERNEL);
if (!idxd->groups)
@@ -230,6 +253,12 @@ static void idxd_read_caps(struct idxd_device *idxd)
/* reading generic capabilities */
idxd->hw.gen_cap.bits = ioread64(idxd->reg_base + IDXD_GENCAP_OFFSET);
dev_dbg(dev, "gen_cap: %#llx\n", idxd->hw.gen_cap.bits);
+
+ if (idxd->hw.gen_cap.cmd_cap) {
+ idxd->hw.cmd_cap = ioread32(idxd->reg_base + IDXD_CMDCAP_OFFSET);
+ dev_dbg(dev, "cmd_cap: %#x\n", idxd->hw.cmd_cap);
+ }
+
idxd->max_xfer_bytes = 1ULL << idxd->hw.gen_cap.max_xfer_shift;
dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index a0df4f3fe1fb..ace7248ee195 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -23,8 +23,8 @@ union gen_cap_reg {
u64 overlap_copy:1;
u64 cache_control_mem:1;
u64 cache_control_cache:1;
+ u64 cmd_cap:1;
u64 rsvd:3;
- u64 int_handle_req:1;
u64 dest_readback:1;
u64 drain_readback:1;
u64 rsvd2:6;
@@ -223,6 +223,8 @@ enum idxd_cmdsts_err {
IDXD_CMDSTS_ERR_NO_HANDLE,
};
+#define IDXD_CMDCAP_OFFSET 0xb0
+
#define IDXD_SWERR_OFFSET 0xc0
#define IDXD_SWERR_VALID 0x00000001
#define IDXD_SWERR_OVERFLOW 0x00000002
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index 3e63d820a98e..70c7703a4495 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -22,11 +22,17 @@ static struct idxd_desc *__get_desc(struct idxd_wq *wq, int idx, int cpu)
desc->hw->pasid = idxd->pasid;
/*
- * Descriptor completion vectors are 1-8 for MSIX. We will round
- * robin through the 8 vectors.
+ * Descriptor completion vectors are 1...N for MSIX. We will round
+ * robin through the N vectors.
*/
wq->vec_ptr = (wq->vec_ptr % idxd->num_wq_irqs) + 1;
- desc->hw->int_handle = wq->vec_ptr;
+ if (!idxd->int_handles) {
+ desc->hw->int_handle = wq->vec_ptr;
+ } else {
+ desc->vec_ptr = wq->vec_ptr;
+ desc->hw->int_handle = idxd->int_handles[desc->vec_ptr];
+ }
+
return desc;
}
@@ -79,7 +85,6 @@ void idxd_free_desc(struct idxd_wq *wq, struct idxd_desc *desc)
int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
{
struct idxd_device *idxd = wq->idxd;
- int vec = desc->hw->int_handle;
void __iomem *portal;
if (idxd->state != IDXD_DEV_ENABLED)
@@ -110,9 +115,19 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
* Pending the descriptor to the lockless list for the irq_entry
* that we designated the descriptor to.
*/
- if (desc->hw->flags & IDXD_OP_FLAG_RCI)
- llist_add(&desc->llnode,
- &idxd->irq_entries[vec].pending_llist);
+ if (desc->hw->flags & IDXD_OP_FLAG_RCI) {
+ int vec;
+
+ /*
+ * If the driver is on host kernel, it would be the value
+ * assigned to interrupt handle, which is index for MSIX
+ * vector. If it's guest then can't use the int_handle since
+ * that is the index to IMS for the entire device. The guest
+ * device local index will be used.
+ */
+ vec = !idxd->int_handles ? desc->hw->int_handle : desc->vec_ptr;
+ llist_add(&desc->llnode, &idxd->irq_entries[vec].pending_llist);
+ }
return 0;
}
From: Megha Dey <[email protected]>
When DEV_MSI is enabled, the dev_msi_default_domain is updated to the
base DEV-MSI irq domain. If interrupt remapping is enabled, we create
a new IR-DEV-MSI irq domain and update the dev_msi_default domain to
the same.
For X86, introduce a new irq_alloc_type which will be used by the
interrupt remapping driver.
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
arch/x86/include/asm/hw_irq.h | 1 +
arch/x86/kernel/apic/msi.c | 12 ++++++
drivers/base/dev-msi.c | 66 +++++++++++++++++++++++++++++++----
drivers/iommu/intel/irq_remapping.c | 11 +++++-
include/linux/intel-iommu.h | 1 +
include/linux/irqdomain.h | 11 ++++++
include/linux/msi.h | 3 ++
7 files changed, 96 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 8ecd7570589d..bdddd63add41 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -40,6 +40,7 @@ enum irq_alloc_type {
X86_IRQ_ALLOC_TYPE_MSIX,
X86_IRQ_ALLOC_TYPE_DMAR,
X86_IRQ_ALLOC_TYPE_UV,
+ X86_IRQ_ALLOC_TYPE_DEV_MSI,
};
struct irq_alloc_info {
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 5cbaca58af95..8b25cadbae09 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -507,3 +507,15 @@ int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, &info);
}
#endif
+
+#ifdef CONFIG_DEV_MSI
+int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
+ int nvec, msi_alloc_info_t *arg)
+{
+ memset(arg, 0, sizeof(*arg));
+
+ arg->type = X86_IRQ_ALLOC_TYPE_DEV_MSI;
+
+ return 0;
+}
+#endif
diff --git a/drivers/base/dev-msi.c b/drivers/base/dev-msi.c
index 240ccc353933..43d6ed3ba10f 100644
--- a/drivers/base/dev-msi.c
+++ b/drivers/base/dev-msi.c
@@ -5,6 +5,7 @@
* Author: Megha Dey <[email protected]>
*/
+#include <linux/device.h>
#include <linux/irq.h>
#include <linux/irqdomain.h>
#include <linux/msi.h>
@@ -32,7 +33,7 @@ static void dev_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
arg->hwirq = dev_msi_calc_hwirq(desc);
}
-static int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
+int __weak dev_msi_prepare(struct irq_domain *domain, struct device *dev,
int nvec, msi_alloc_info_t *arg)
{
memset(arg, 0, sizeof(*arg));
@@ -81,15 +82,66 @@ static int __init create_dev_msi_domain(void)
if (!fn)
return -ENXIO;
- dev_msi_default_domain = msi_create_irq_domain(fn, &dev_msi_domain_info, parent);
+ /*
+ * This initcall may come after remap code is initialized. Ensure that
+ * dev_msi_default domain is updated correctly.
+ */
if (!dev_msi_default_domain) {
- pr_warn("failed to initialize irqdomain for DEV-MSI.\n");
- return -ENXIO;
+ dev_msi_default_domain = msi_create_irq_domain(fn, &dev_msi_domain_info, parent);
+ if (!dev_msi_default_domain) {
+ pr_warn("failed to initialize irqdomain for DEV-MSI.\n");
+ return -ENXIO;
+ }
+
+ irq_domain_update_bus_token(dev_msi_default_domain, DOMAIN_BUS_PLATFORM_MSI);
+ irq_domain_free_fwnode(fn);
}
- irq_domain_update_bus_token(dev_msi_default_domain, DOMAIN_BUS_PLATFORM_MSI);
- irq_domain_free_fwnode(fn);
-
return 0;
}
device_initcall(create_dev_msi_domain);
+
+#ifdef CONFIG_IRQ_REMAP
+static struct irq_chip dev_msi_ir_controller = {
+ .name = "IR-DEV-MSI",
+ .irq_unmask = platform_msi_unmask_irq,
+ .irq_mask = platform_msi_mask_irq,
+ .irq_write_msi_msg = platform_msi_write_msg,
+ .irq_ack = irq_chip_ack_parent,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_vcpu_affinity = irq_chip_set_vcpu_affinity_parent,
+ .flags = IRQCHIP_SKIP_SET_WAKE,
+};
+
+static struct msi_domain_info dev_msi_ir_domain_info = {
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
+ .ops = &dev_msi_domain_ops,
+ .chip = &dev_msi_ir_controller,
+ .handler = handle_edge_irq,
+ .handler_name = "edge",
+};
+
+struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
+ const char *name)
+{
+ struct fwnode_handle *fn;
+ struct irq_domain *domain;
+
+ fn = irq_domain_alloc_named_fwnode(name);
+ if (!fn)
+ return NULL;
+
+ domain = msi_create_irq_domain(fn, &dev_msi_ir_domain_info, parent);
+ if (!domain) {
+ pr_warn("failed to initialize irqdomain for IR-DEV-MSI.\n");
+ return ERR_PTR(-ENXIO);
+ }
+
+ irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
+
+ if (!dev_msi_default_domain)
+ dev_msi_default_domain = domain;
+
+ return domain;
+}
+#endif
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 7f8769800815..51872aabe5f8 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -573,6 +573,10 @@ static int intel_setup_irq_remapping(struct intel_iommu *iommu)
"INTEL-IR-MSI",
iommu->seq_id);
+ iommu->ir_dev_msi_domain =
+ create_remap_dev_msi_irq_domain(iommu->ir_domain,
+ "INTEL-IR-DEV-MSI");
+
ir_table->base = page_address(pages);
ir_table->bitmap = bitmap;
iommu->ir_table = ir_table;
@@ -1299,9 +1303,10 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
case X86_IRQ_ALLOC_TYPE_HPET:
case X86_IRQ_ALLOC_TYPE_MSI:
case X86_IRQ_ALLOC_TYPE_MSIX:
+ case X86_IRQ_ALLOC_TYPE_DEV_MSI:
if (info->type == X86_IRQ_ALLOC_TYPE_HPET)
set_hpet_sid(irte, info->hpet_id);
- else
+ else if (info->type != X86_IRQ_ALLOC_TYPE_DEV_MSI)
set_msi_sid(irte, info->msi_dev);
msg->address_hi = MSI_ADDR_BASE_HI;
@@ -1353,8 +1358,10 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
if (!info || !iommu)
return -EINVAL;
+
if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_MSI &&
- info->type != X86_IRQ_ALLOC_TYPE_MSIX)
+ info->type != X86_IRQ_ALLOC_TYPE_MSIX &&
+ info->type != X86_IRQ_ALLOC_TYPE_DEV_MSI)
return -EINVAL;
/*
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d129baf7e0b8..3b868d1c43df 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -596,6 +596,7 @@ struct intel_iommu {
struct ir_table *ir_table; /* Interrupt remapping info */
struct irq_domain *ir_domain;
struct irq_domain *ir_msi_domain;
+ struct irq_domain *ir_dev_msi_domain;
#endif
struct iommu_device iommu; /* IOMMU core code handle */
int node;
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index b37350c4fe37..e537d7b50cee 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -589,6 +589,17 @@ irq_domain_hierarchical_is_msi_remap(struct irq_domain *domain)
}
#endif /* CONFIG_IRQ_DOMAIN_HIERARCHY */
+#if defined(CONFIG_DEV_MSI) && defined(CONFIG_IRQ_REMAP)
+extern struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
+ const char *name);
+#else
+static inline struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
+ const char *name)
+{
+ return NULL;
+}
+#endif
+
#else /* CONFIG_IRQ_DOMAIN */
static inline void irq_dispose_mapping(unsigned int virq) { }
static inline struct irq_domain *irq_find_matching_fwnode(
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 1da97f905720..7098ba566bcd 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -378,6 +378,9 @@ void *platform_msi_get_host_data(struct irq_domain *domain);
void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg);
void platform_msi_unmask_irq(struct irq_data *data);
void platform_msi_mask_irq(struct irq_data *data);
+
+int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
+ int nvec, msi_alloc_info_t *arg);
#endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
In preparation for support of VFIO mediated device for idxd driver, the
enabling for Interrupt Message Store (IMS) interrupts is added for the idxd
base driver. DEV-MSI is the generic kernel support that mechanisms like
IMS can use to get their interrupts enabled. With IMS support the idxd
driver can dynamically allocate interrupts on a per mdev basis based on how
many IMS vectors that are mapped to the mdev device. This commit only
provides the support functions in the base driver and not the VFIO mdev
code utilization.
The commit has some portal related changes. A "portal" is a special
location within the MMIO BAR2 of the DSA device where descriptors are
submitted via the CPU command MOVDIR64B or ENQCMD(S). The offset for the
portal address determines whether the submitted descriptor is for MSI-X
or IMS notification.
See Intel SIOV spec for more details:
https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/Makefile | 2 +-
drivers/dma/idxd/cdev.c | 4 ++-
drivers/dma/idxd/idxd.h | 14 ++++++++----
drivers/dma/idxd/ims.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/init.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/submit.c | 10 +++++++-
drivers/dma/idxd/sysfs.c | 11 +++++++++
7 files changed, 137 insertions(+), 9 deletions(-)
create mode 100644 drivers/dma/idxd/ims.c
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 8978b898d777..d1519b9d1dd0 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,2 @@
obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o ims.o
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index d4841d8c0928..ffeae646f947 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -202,8 +202,8 @@ static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
return rc;
vma->vm_flags |= VM_DONTCOPY;
- pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
- IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
+ pfn = (base + idxd_get_wq_portal_full_offset(wq->id, IDXD_PORTAL_LIMITED,
+ IDXD_IRQ_MSIX)) >> PAGE_SHIFT;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = ctx;
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 2cd190a3da73..c65fb6dcb7e0 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -152,6 +152,7 @@ enum idxd_device_flag {
IDXD_FLAG_CONFIGURABLE = 0,
IDXD_FLAG_CMD_RUNNING,
IDXD_FLAG_PASID_ENABLED,
+ IDXD_FLAG_SIOV_SUPPORTED,
};
struct idxd_device {
@@ -178,6 +179,7 @@ struct idxd_device {
int num_groups;
+ u32 ims_offset;
u32 msix_perm_offset;
u32 wqcfg_offset;
u32 grpcfg_offset;
@@ -185,6 +187,7 @@ struct idxd_device {
u64 max_xfer_bytes;
u32 max_batch_size;
+ int ims_size;
int max_groups;
int max_engines;
int max_tokens;
@@ -204,6 +207,7 @@ struct idxd_device {
struct work_struct work;
int *int_handles;
+ struct sbitmap ims_sbmap;
};
/* IDXD software descriptor */
@@ -251,15 +255,17 @@ enum idxd_interrupt_type {
IDXD_IRQ_IMS,
};
-static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
+static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot,
+ enum idxd_interrupt_type irq_type)
{
- return prot * 0x1000;
+ return prot * 0x1000 + irq_type * 0x2000;
}
static inline int idxd_get_wq_portal_full_offset(int wq_id,
- enum idxd_portal_prot prot)
+ enum idxd_portal_prot prot,
+ enum idxd_interrupt_type irq_type)
{
- return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
+ return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot, irq_type);
}
static inline void idxd_set_type(struct idxd_device *idxd)
diff --git a/drivers/dma/idxd/ims.c b/drivers/dma/idxd/ims.c
new file mode 100644
index 000000000000..5fece66122a2
--- /dev/null
+++ b/drivers/dma/idxd/ims.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/msi.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+
+static void idxd_free_ims_index(struct idxd_device *idxd,
+ unsigned long ims_idx)
+{
+ sbitmap_clear_bit(&idxd->ims_sbmap, ims_idx);
+}
+
+static int idxd_alloc_ims_index(struct idxd_device *idxd)
+{
+ int index;
+
+ index = sbitmap_get(&idxd->ims_sbmap, 0, false);
+ if (index < 0)
+ return -ENOSPC;
+ return index;
+}
+
+static unsigned int idxd_ims_irq_mask(struct msi_desc *desc)
+{
+ // Filled out later when VDCM is introduced.
+
+ return 0;
+}
+
+static unsigned int idxd_ims_irq_unmask(struct msi_desc *desc)
+{
+ // Filled out later when VDCM is introduced.
+
+ return 0;
+}
+
+static void idxd_ims_write_msg(struct msi_desc *desc, struct msi_msg *msg)
+{
+ // Filled out later when VDCM is introduced.
+}
+
+static struct platform_msi_ops idxd_ims_ops = {
+ .irq_mask = idxd_ims_irq_mask,
+ .irq_unmask = idxd_ims_irq_unmask,
+ .write_msg = idxd_ims_write_msg,
+};
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 9fd505a03444..3e2c7ac83daf 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -241,10 +241,51 @@ static void idxd_read_table_offsets(struct idxd_device *idxd)
idxd->msix_perm_offset = offsets.msix_perm * 0x100;
dev_dbg(dev, "IDXD MSIX Permission Offset: %#x\n",
idxd->msix_perm_offset);
+ idxd->ims_offset = offsets.ims * 0x100;
+ dev_dbg(dev, "IDXD IMS Offset: %#x\n", idxd->ims_offset);
idxd->perfmon_offset = offsets.perfmon * 0x100;
dev_dbg(dev, "IDXD Perfmon Offset: %#x\n", idxd->perfmon_offset);
}
+#define PCI_DEVSEC_CAP 0x23
+#define SIOVDVSEC1(offset) ((offset) + 0x4)
+#define SIOVDVSEC2(offset) ((offset) + 0x8)
+#define DVSECID 0x5
+#define SIOVCAP(offset) ((offset) + 0x14)
+
+static void idxd_check_siov(struct idxd_device *idxd)
+{
+ struct pci_dev *pdev = idxd->pdev;
+ struct device *dev = &pdev->dev;
+ int dvsec;
+ u16 val16;
+ u32 val32;
+
+ dvsec = pci_find_ext_capability(pdev, PCI_DEVSEC_CAP);
+ pci_read_config_word(pdev, SIOVDVSEC1(dvsec), &val16);
+ if (val16 != PCI_VENDOR_ID_INTEL) {
+ dev_dbg(&pdev->dev, "DVSEC vendor id is not Intel\n");
+ return;
+ }
+
+ pci_read_config_word(pdev, SIOVDVSEC2(dvsec), &val16);
+ if (val16 != DVSECID) {
+ dev_dbg(&pdev->dev, "DVSEC ID is not SIOV\n");
+ return;
+ }
+
+ pci_read_config_dword(pdev, SIOVCAP(dvsec), &val32);
+ if ((val32 & 0x1) && idxd->hw.gen_cap.max_ims_mult) {
+ idxd->ims_size = idxd->hw.gen_cap.max_ims_mult * 256ULL;
+ dev_dbg(dev, "IMS size: %u\n", idxd->ims_size);
+ set_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags);
+ dev_dbg(&pdev->dev, "IMS supported for device\n");
+ return;
+ }
+
+ dev_dbg(&pdev->dev, "SIOV unsupported for device\n");
+}
+
static void idxd_read_caps(struct idxd_device *idxd)
{
struct device *dev = &idxd->pdev->dev;
@@ -263,6 +304,7 @@ static void idxd_read_caps(struct idxd_device *idxd)
dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
dev_dbg(dev, "max batch size: %u\n", idxd->max_batch_size);
+ idxd_check_siov(idxd);
if (idxd->hw.gen_cap.config_en)
set_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags);
@@ -397,9 +439,19 @@ static int idxd_probe(struct idxd_device *idxd)
idxd->major = idxd_cdev_get_major(idxd);
+ if (idxd->ims_size) {
+ rc = sbitmap_init_node(&idxd->ims_sbmap, idxd->ims_size, -1,
+ GFP_KERNEL, dev_to_node(dev));
+ if (rc < 0)
+ goto sbitmap_fail;
+ }
dev_dbg(dev, "IDXD device %d probed successfully\n", idxd->id);
return 0;
+ sbitmap_fail:
+ mutex_lock(&idxd_idr_lock);
+ idr_remove(&idxd_idrs[idxd->type], idxd->id);
+ mutex_unlock(&idxd_idr_lock);
err_idr_fail:
idxd_mask_error_interrupts(idxd);
idxd_mask_msix_vectors(idxd);
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index 70c7703a4495..f0a0a0d21a7a 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -30,7 +30,13 @@ static struct idxd_desc *__get_desc(struct idxd_wq *wq, int idx, int cpu)
desc->hw->int_handle = wq->vec_ptr;
} else {
desc->vec_ptr = wq->vec_ptr;
- desc->hw->int_handle = idxd->int_handles[desc->vec_ptr];
+ /*
+ * int_handles are only for descriptor completion. However for device
+ * MSIX enumeration, vec 0 is used for misc interrupts. Therefore even
+ * though we are rotating through 1...N for descriptor interrupts, we
+ * need to acqurie the int_handles from 0..N-1.
+ */
+ desc->hw->int_handle = idxd->int_handles[desc->vec_ptr - 1];
}
return desc;
@@ -90,7 +96,7 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
if (idxd->state != IDXD_DEV_ENABLED)
return -EIO;
- portal = wq->portal + idxd_get_wq_portal_offset(IDXD_PORTAL_LIMITED);
+ portal = wq->portal + idxd_get_wq_portal_offset(IDXD_PORTAL_LIMITED, IDXD_IRQ_MSIX);
/*
* The wmb() flushes writes to coherent DMA data before
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index fa1abdf503c2..501a1d489ce3 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -1269,6 +1269,16 @@ static ssize_t numa_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(numa_node);
+static ssize_t ims_size_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct idxd_device *idxd =
+ container_of(dev, struct idxd_device, conf_dev);
+
+ return sprintf(buf, "%u\n", idxd->ims_size);
+}
+static DEVICE_ATTR_RO(ims_size);
+
static ssize_t max_batch_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1455,6 +1465,7 @@ static struct attribute *idxd_device_attributes[] = {
&dev_attr_max_work_queues_size.attr,
&dev_attr_max_engines.attr,
&dev_attr_numa_node.attr,
+ &dev_attr_ims_size.attr,
&dev_attr_max_batch_size.attr,
&dev_attr_max_transfer_size.attr,
&dev_attr_op_cap.attr,
Create a mediated device through the VFIO mediated device framework. The
mdev framework allows creation of an mediated device by the driver with
portion of the device's resources. The driver will emulate the slow path
such as the PCI config space, MMIO bar, and the command registers. The
descriptor submission portal(s) will be mmaped to the guest in order to
submit descriptors directly by the guest kernel or apps. The mediated
device support code in the idxd will be referred to as the Virtual
Device Composition Module (vdcm). Add basic plumbing to fill out the
mdev_parent_ops struct that VFIO mdev requires to support a mediated
device.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/Kconfig | 6
drivers/dma/idxd/Makefile | 4
drivers/dma/idxd/idxd.h | 11 +
drivers/dma/idxd/ims.c | 13 +
drivers/dma/idxd/ims.h | 10
drivers/dma/idxd/init.c | 11 +
drivers/dma/idxd/mdev.c | 980 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/mdev.h | 118 +++++
drivers/dma/idxd/vdev.c | 76 +++
drivers/dma/idxd/vdev.h | 19 +
10 files changed, 1247 insertions(+), 1 deletion(-)
create mode 100644 drivers/dma/idxd/ims.h
create mode 100644 drivers/dma/idxd/mdev.c
create mode 100644 drivers/dma/idxd/mdev.h
create mode 100644 drivers/dma/idxd/vdev.c
create mode 100644 drivers/dma/idxd/vdev.h
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 6a908785a5f7..69c1ae72df86 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -306,6 +306,12 @@ config INTEL_IDXD_SVM
depends on PCI_PASID
depends on PCI_IOV
+config INTEL_IDXD_MDEV
+ bool "IDXD VFIO Mediated Device Support"
+ depends on INTEL_IDXD
+ depends on VFIO_MDEV
+ depends on VFIO_MDEV_DEVICE
+
config INTEL_IOATDMA
tristate "Intel I/OAT DMA support"
depends on PCI && X86_64
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index d1519b9d1dd0..18622f81eb3f 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,4 @@
obj-$(CONFIG_INTEL_IDXD) += idxd.o
-idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o ims.o
+idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
+
+idxd-$(CONFIG_INTEL_IDXD_MDEV) += ims.o mdev.o vdev.o
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 438d6478a3f8..9588872cd273 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -121,6 +121,7 @@ struct idxd_wq {
struct sbitmap_queue sbq;
struct dma_chan dma_chan;
char name[WQ_NAME_SIZE + 1];
+ struct list_head vdcm_list;
};
struct idxd_engine {
@@ -153,6 +154,7 @@ enum idxd_device_flag {
IDXD_FLAG_CMD_RUNNING,
IDXD_FLAG_PASID_ENABLED,
IDXD_FLAG_SIOV_SUPPORTED,
+ IDXD_FLAG_MDEV_ENABLED,
};
struct idxd_device {
@@ -245,6 +247,11 @@ static inline bool device_pasid_enabled(struct idxd_device *idxd)
return test_bit(IDXD_FLAG_PASID_ENABLED, &idxd->flags);
}
+static inline bool device_mdev_enabled(struct idxd_device *idxd)
+{
+ return test_bit(IDXD_FLAG_MDEV_ENABLED, &idxd->flags);
+}
+
enum idxd_portal_prot {
IDXD_PORTAL_UNLIMITED = 0,
IDXD_PORTAL_LIMITED,
@@ -363,4 +370,8 @@ int idxd_cdev_get_major(struct idxd_device *idxd);
int idxd_wq_add_cdev(struct idxd_wq *wq);
void idxd_wq_del_cdev(struct idxd_wq *wq);
+/* mdev */
+int idxd_mdev_host_init(struct idxd_device *idxd);
+void idxd_mdev_host_release(struct idxd_device *idxd);
+
#endif
diff --git a/drivers/dma/idxd/ims.c b/drivers/dma/idxd/ims.c
index 5fece66122a2..bffc74c2b305 100644
--- a/drivers/dma/idxd/ims.c
+++ b/drivers/dma/idxd/ims.c
@@ -10,6 +10,19 @@
#include <uapi/linux/idxd.h>
#include "registers.h"
#include "idxd.h"
+#include "mdev.h"
+
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
static void idxd_free_ims_index(struct idxd_device *idxd,
unsigned long ims_idx)
diff --git a/drivers/dma/idxd/ims.h b/drivers/dma/idxd/ims.h
new file mode 100644
index 000000000000..3d823606e3a3
--- /dev/null
+++ b/drivers/dma/idxd/ims.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_IMS_H_
+#define _IDXD_IMS_H_
+
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd);
+int vidxd_free_ims_entries(struct vdcm_idxd *vidxd);
+
+#endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 3e2c7ac83daf..639ca74ae1f8 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -211,6 +211,7 @@ static int idxd_setup_internals(struct idxd_device *idxd)
wq->idxd = idxd;
mutex_init(&wq->wq_lock);
wq->idxd_cdev.minor = -1;
+ INIT_LIST_HEAD(&wq->vdcm_list);
}
for (i = 0; i < idxd->max_engines; i++) {
@@ -507,6 +508,14 @@ static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return -ENODEV;
}
+ if (IS_ENABLED(CONFIG_INTEL_IDXD_MDEV)) {
+ rc = idxd_mdev_host_init(idxd);
+ if (rc < 0)
+ dev_warn(dev, "VFIO mdev not setup: %d\n", rc);
+ else
+ set_bit(IDXD_FLAG_MDEV_ENABLED, &idxd->flags);
+ }
+
rc = idxd_setup_sysfs(idxd);
if (rc) {
dev_err(dev, "IDXD sysfs setup failed\n");
@@ -581,6 +590,8 @@ static void idxd_remove(struct pci_dev *pdev)
dev_dbg(&pdev->dev, "%s called\n", __func__);
idxd_cleanup_sysfs(idxd);
idxd_shutdown(pdev);
+ if (IS_ENABLED(CONFIG_INTEL_IDXD_MDEV) && device_mdev_enabled(idxd))
+ idxd_mdev_host_release(idxd);
if (device_pasid_enabled(idxd))
idxd_disable_system_pasid(idxd);
mutex_lock(&idxd_idr_lock);
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
new file mode 100644
index 000000000000..f9cc2909b1cf
--- /dev/null
+++ b/drivers/dma/idxd/mdev.c
@@ -0,0 +1,980 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <linux/circ_buf.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+#include "ims.h"
+
+static u64 idxd_pci_config[] = {
+ 0x001000000b258086ULL,
+ 0x0080000008800000ULL,
+ 0x000000000000000cULL,
+ 0x000000000000000cULL,
+ 0x0000000000000000ULL,
+ 0x2010808600000000ULL,
+ 0x0000004000000000ULL,
+ 0x000000ff00000000ULL,
+ 0x0000060000015011ULL, /* MSI-X capability, hardcoded 2 entries, Encoded as N-1 */
+ 0x0000070000000000ULL,
+ 0x0000000000920010ULL, /* PCIe capability */
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+};
+
+static inline void reset_vconfig(struct vdcm_idxd *vidxd)
+{
+ memset(vidxd->cfg, 0, VIDXD_MAX_CFG_SPACE_SZ);
+ memcpy(vidxd->cfg, idxd_pci_config, sizeof(idxd_pci_config));
+}
+
+static inline void reset_vmmio(struct vdcm_idxd *vidxd)
+{
+ memset(&vidxd->bar0, 0, VIDXD_MAX_MMIO_SPACE_SZ);
+}
+
+static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq = vidxd->wq;
+
+ reset_vconfig(vidxd);
+ reset_vmmio(vidxd);
+
+ vidxd->bar_size[0] = VIDXD_BAR0_SIZE;
+ vidxd->bar_size[1] = VIDXD_BAR2_SIZE;
+
+ vidxd_mmio_init(vidxd);
+
+ if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+ idxd_wq_disable(wq);
+}
+
+static void __idxd_vdcm_release(struct vdcm_idxd *vidxd)
+{
+ int rc;
+ struct device *dev = &vidxd->idxd->pdev->dev;
+
+ mutex_lock(&vidxd->dev_lock);
+ if (atomic_cmpxchg(&vidxd->vdev.released, 0, 1)) {
+ mutex_unlock(&vidxd->dev_lock);
+ return;
+ }
+
+ rc = vfio_unregister_notifier(mdev_dev(vidxd->vdev.mdev),
+ VFIO_GROUP_NOTIFY,
+ &vidxd->vdev.group_notifier);
+ if (rc < 0)
+ dev_warn(dev, "vfio_unregister_notifier group failed: %d\n", rc);
+
+ vidxd_free_ims_entries(vidxd);
+ /* Re-initialize the VIDXD to a pristine state for re-use */
+ idxd_vdcm_init(vidxd);
+ mutex_unlock(&vidxd->dev_lock);
+}
+
+static void idxd_vdcm_release(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "vdcm_idxd_release %d\n", vidxd->type->type);
+ __idxd_vdcm_release(vidxd);
+}
+
+static void idxd_vdcm_release_work(struct work_struct *work)
+{
+ struct vdcm_idxd *vidxd = container_of(work, struct vdcm_idxd,
+ vdev.release_work);
+
+ __idxd_vdcm_release(vidxd);
+}
+
+static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
+ struct vdcm_idxd_type *type)
+{
+ struct vdcm_idxd *vidxd;
+ struct idxd_wq *wq = NULL;
+ int i;
+
+ /* PLACEHOLDER, wq matching comes later */
+
+ if (!wq)
+ return ERR_PTR(-ENODEV);
+
+ vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
+ if (!vidxd)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&vidxd->dev_lock);
+ vidxd->idxd = idxd;
+ vidxd->vdev.mdev = mdev;
+ vidxd->wq = wq;
+ mdev_set_drvdata(mdev, vidxd);
+ vidxd->type = type;
+ vidxd->num_wqs = VIDXD_MAX_WQS;
+
+ for (i = 0; i < VIDXD_MAX_MSIX_VECS - 1; i++)
+ vidxd->ims_index[i] = -1;
+
+ INIT_WORK(&vidxd->vdev.release_work, idxd_vdcm_release_work);
+ idxd_vdcm_init(vidxd);
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_get(wq);
+ mutex_unlock(&wq->wq_lock);
+
+ return vidxd;
+}
+
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+
+static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
+ const char *name)
+{
+ int i;
+ char dev_name[IDXD_MDEV_NAME_LEN];
+
+ for (i = 0; i < IDXD_MDEV_TYPES; i++) {
+ snprintf(dev_name, IDXD_MDEV_NAME_LEN, "idxd-%s",
+ idxd_mdev_types[i].name);
+
+ if (!strncmp(name, dev_name, IDXD_MDEV_NAME_LEN))
+ return &idxd_mdev_types[i];
+ }
+
+ return NULL;
+}
+
+static int idxd_vdcm_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd;
+ struct vdcm_idxd_type *type;
+ struct device *dev, *parent;
+ struct idxd_device *idxd;
+ struct idxd_wq *wq;
+
+ parent = mdev_parent_dev(mdev);
+ idxd = dev_get_drvdata(parent);
+ dev = mdev_dev(mdev);
+
+ mdev_set_iommu_device(dev, parent);
+ type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+ if (!type) {
+ dev_err(dev, "failed to find type %s to create\n",
+ kobject_name(kobj));
+ return -EINVAL;
+ }
+
+ vidxd = vdcm_vidxd_create(idxd, mdev, type);
+ if (IS_ERR(vidxd)) {
+ dev_err(dev, "failed to create vidxd: %ld\n", PTR_ERR(vidxd));
+ return PTR_ERR(vidxd);
+ }
+
+ wq = vidxd->wq;
+ mutex_lock(&wq->wq_lock);
+ list_add(&vidxd->list, &wq->vdcm_list);
+ mutex_unlock(&wq->wq_lock);
+ dev_dbg(dev, "mdev creation success: %s\n", dev_name(mdev_dev(mdev)));
+
+ return 0;
+}
+
+static int idxd_vdcm_remove(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_device *idxd = vidxd->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ struct idxd_wq *wq = vidxd->wq;
+
+ dev_dbg(dev, "%s: removing for wq %d\n", __func__, vidxd->wq->id);
+
+ mutex_lock(&wq->wq_lock);
+ list_del(&vidxd->list);
+ idxd_wq_put(wq);
+ mutex_unlock(&wq->wq_lock);
+
+ kfree(vidxd);
+ return 0;
+}
+
+static int idxd_vdcm_group_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct vdcm_idxd *vidxd = container_of(nb, struct vdcm_idxd,
+ vdev.group_notifier);
+
+ /* The only action we care about */
+ if (action == VFIO_GROUP_NOTIFY_SET_KVM)
+ if (!data)
+ schedule_work(&vidxd->vdev.release_work);
+
+ return NOTIFY_OK;
+}
+
+static int idxd_vdcm_open(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned long events;
+ int rc;
+ struct vdcm_idxd_type *type = vidxd->type;
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "%s: type: %d\n", __func__, type->type);
+
+ mutex_lock(&vidxd->dev_lock);
+ vidxd->vdev.group_notifier.notifier_call = idxd_vdcm_group_notifier;
+ events = VFIO_GROUP_NOTIFY_SET_KVM;
+ rc = vfio_register_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY, &events,
+ &vidxd->vdev.group_notifier);
+ if (rc < 0) {
+ mutex_unlock(&vidxd->dev_lock);
+ dev_err(dev, "vfio_register_notifier for group failed: %d\n", rc);
+ return rc;
+ }
+
+ /* allocate and setup IMS entries */
+ rc = vidxd_setup_ims_entries(vidxd);
+ if (rc < 0)
+ goto undo_group;
+
+ atomic_set(&vidxd->vdev.released, 0);
+ mutex_unlock(&vidxd->dev_lock);
+
+ return rc;
+
+ undo_group:
+ mutex_unlock(&vidxd->dev_lock);
+ vfio_unregister_notifier(mdev_dev(mdev), VFIO_GROUP_NOTIFY, &vidxd->vdev.group_notifier);
+ return rc;
+}
+
+static ssize_t idxd_vdcm_rw(struct mdev_device *mdev, char *buf, size_t count, loff_t *ppos,
+ enum idxd_vdcm_rw mode)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ struct device *dev = mdev_dev(mdev);
+ int rc = -EINVAL;
+
+ if (index >= VFIO_PCI_NUM_REGIONS) {
+ dev_err(dev, "invalid index: %u\n", index);
+ return -EINVAL;
+ }
+
+ switch (index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ if (mode == IDXD_VDCM_WRITE)
+ rc = vidxd_cfg_write(vidxd, pos, buf, count);
+ else
+ rc = vidxd_cfg_read(vidxd, pos, buf, count);
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ if (mode == IDXD_VDCM_WRITE)
+ rc = vidxd_mmio_write(vidxd, vidxd->bar_val[0] + pos, buf, count);
+ else
+ rc = vidxd_mmio_read(vidxd, vidxd->bar_val[0] + pos, buf, count);
+ break;
+ case VFIO_PCI_BAR2_REGION_INDEX:
+ case VFIO_PCI_BAR3_REGION_INDEX:
+ case VFIO_PCI_BAR4_REGION_INDEX:
+ case VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ default:
+ dev_err(dev, "unsupported region: %u\n", index);
+ }
+
+ return rc == 0 ? count : rc;
+}
+
+static ssize_t idxd_vdcm_read(struct mdev_device *mdev, char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned int done = 0;
+ int rc;
+
+ mutex_lock(&vidxd->dev_lock);
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 4;
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ rc = idxd_vdcm_rw(mdev, &val, sizeof(val), ppos,
+ IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+
+ mutex_unlock(&vidxd->dev_lock);
+ return done;
+
+ read_err:
+ mutex_unlock(&vidxd->dev_lock);
+ return -EFAULT;
+}
+
+static ssize_t idxd_vdcm_write(struct mdev_device *mdev, const char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned int done = 0;
+ int rc;
+
+ mutex_lock(&vidxd->dev_lock);
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 4;
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val,
+ sizeof(val), ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(mdev, &val, sizeof(val),
+ ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+
+ mutex_unlock(&vidxd->dev_lock);
+ return done;
+
+write_err:
+ mutex_unlock(&vidxd->dev_lock);
+ return -EFAULT;
+}
+
+static int check_vma(struct idxd_wq *wq, struct vm_area_struct *vma)
+{
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+ if (!(vma->vm_flags & VM_SHARED))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int idxd_vdcm_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+ unsigned int wq_idx, rc;
+ unsigned long req_size, pgoff = 0, offset;
+ pgprot_t pg_prot;
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_device *idxd = vidxd->idxd;
+ enum idxd_portal_prot virt_portal, phys_portal;
+ phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
+ struct device *dev = mdev_dev(mdev);
+
+ rc = check_vma(wq, vma);
+ if (rc)
+ return rc;
+
+ pg_prot = vma->vm_page_prot;
+ req_size = vma->vm_end - vma->vm_start;
+ vma->vm_flags |= VM_DONTCOPY;
+
+ offset = (vma->vm_pgoff << PAGE_SHIFT) &
+ ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
+
+ wq_idx = offset >> (PAGE_SHIFT + 2);
+ if (wq_idx >= 1) {
+ dev_err(dev, "mapping invalid wq %d off %lx\n",
+ wq_idx, offset);
+ return -EINVAL;
+ }
+
+ /*
+ * Check and see if the guest wants to map to the limited or unlimited portal.
+ * The driver will allow mapping to unlimited portal only if the the wq is a
+ * dedicated wq. Otherwise, it goes to limited.
+ */
+ virt_portal = ((offset >> PAGE_SHIFT) & 0x3) == 1;
+ phys_portal = IDXD_PORTAL_LIMITED;
+ if (virt_portal == IDXD_PORTAL_UNLIMITED && wq_dedicated(wq))
+ phys_portal = IDXD_PORTAL_UNLIMITED;
+
+ /* We always map IMS portals to the guest */
+ pgoff = (base + idxd_get_wq_portal_full_offset(wq->id, phys_portal,
+ IDXD_IRQ_IMS)) >> PAGE_SHIFT;
+
+ dev_dbg(dev, "mmap %lx %lx %lx %lx\n", vma->vm_start, pgoff, req_size,
+ pgprot_val(pg_prot));
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_private_data = mdev;
+ vma->vm_pgoff = pgoff;
+
+ return remap_pfn_range(vma, vma->vm_start, pgoff, req_size, pg_prot);
+}
+
+static int idxd_vdcm_get_irq_count(struct vdcm_idxd *vidxd, int type)
+{
+ /*
+ * Even though the number of MSIX vectors supported are not tied to number of
+ * wqs being exported, the current design is to allow 1 vector per WQ for guest.
+ * So here we end up with num of wqs plus 1 that handles the misc interrupts.
+ */
+ if (type == VFIO_PCI_MSI_IRQ_INDEX || type == VFIO_PCI_MSIX_IRQ_INDEX)
+ return VIDXD_MAX_MSIX_VECS;
+
+ return 0;
+}
+
+static irqreturn_t idxd_guest_wq_completion(int irq, void *data)
+{
+ struct ims_irq_entry *irq_entry = data;
+ struct vdcm_idxd *vidxd = irq_entry->vidxd;
+ int msix_idx = irq_entry->int_src;
+
+ vidxd_send_interrupt(vidxd, msix_idx + 1);
+ return IRQ_HANDLED;
+}
+
+static int msix_trigger_unregister(struct vdcm_idxd *vidxd, int index)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct ims_irq_entry *irq_entry;
+ int rc;
+
+ if (!vidxd->vdev.msix_trigger[index])
+ return 0;
+
+ dev_dbg(dev, "disable MSIX trigger %d\n", index);
+ if (index) {
+ irq_entry = &vidxd->irq_entries[index - 1];
+ if (irq_entry->irq_set) {
+ free_irq(irq_entry->irq, irq_entry);
+ irq_entry->irq_set = false;
+ }
+ rc = vidxd_disable_host_ims_pasid(vidxd, index - 1);
+ if (rc)
+ return rc;
+ }
+ eventfd_ctx_put(vidxd->vdev.msix_trigger[index]);
+ vidxd->vdev.msix_trigger[index] = NULL;
+
+ return 0;
+}
+
+static int msix_trigger_register(struct vdcm_idxd *vidxd, u32 fd, int index)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct ims_irq_entry *irq_entry;
+ struct eventfd_ctx *trigger;
+ int rc;
+
+ rc = msix_trigger_unregister(vidxd, index);
+ if (rc < 0)
+ return rc;
+
+ dev_dbg(dev, "enable MSIX trigger %d\n", index);
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger)) {
+ dev_warn(dev, "eventfd_ctx_fdget failed %d\n", index);
+ return PTR_ERR(trigger);
+ }
+
+ /*
+ * The MSIX vector 0 is emulated by the mdev. Starting with vector 1
+ * the interrupt is backed by IMS and needs to be set up, but we
+ * will be setting up entry 0 of the IMS vectors. So here we pass
+ * in i - 1 to the host setup and irq_entries.
+ */
+ if (index) {
+ irq_entry = &vidxd->irq_entries[index - 1];
+ rc = vidxd_enable_host_ims_pasid(vidxd, index - 1);
+ if (rc) {
+ dev_warn(dev, "failed to enable host ims pasid\n");
+ eventfd_ctx_put(trigger);
+ return rc;
+ }
+
+ rc = request_irq(irq_entry->irq, idxd_guest_wq_completion, 0, "idxd-ims", irq_entry);
+ if (rc) {
+ dev_warn(dev, "failed to request ims irq\n");
+ eventfd_ctx_put(trigger);
+ vidxd_disable_host_ims_pasid(vidxd, index - 1);
+ return rc;
+ }
+ irq_entry->irq_set = true;
+ }
+
+ vidxd->vdev.msix_trigger[index] = trigger;
+ return 0;
+}
+
+static int vdcm_idxd_set_msix_trigger(struct vdcm_idxd *vidxd,
+ unsigned int index, unsigned int start,
+ unsigned int count, uint32_t flags,
+ void *data)
+{
+ int i, rc = 0;
+
+ if (count > VIDXD_MAX_MSIX_ENTRIES - 1)
+ count = VIDXD_MAX_MSIX_ENTRIES - 1;
+
+ /*
+ * The MSIX vector 0 is emulated by the mdev. Starting with vector 1
+ * the interrupt is backed by IMS and needs to be set up, but we
+ * will be setting up entry 0 of the IMS vectors. So here we pass
+ * in i - 1 to the host setup and irq_entries.
+ */
+ if (count == 0 && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+ /* Disable all MSIX entries */
+ for (i = 0; i < VIDXD_MAX_MSIX_ENTRIES; i++) {
+ rc = msix_trigger_unregister(vidxd, i);
+ if (rc < 0)
+ return rc;
+ }
+ return 0;
+ }
+
+ for (i = 0; i < count; i++) {
+ if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ u32 fd = *(u32 *)(data + i * sizeof(u32));
+
+ rc = msix_trigger_register(vidxd, fd, i);
+ if (rc < 0)
+ return rc;
+ } else if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ rc = msix_trigger_unregister(vidxd, i);
+ if (rc < 0)
+ return rc;
+ }
+ }
+ return rc;
+}
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+ unsigned int index, unsigned int start,
+ unsigned int count, void *data)
+{
+ int (*func)(struct vdcm_idxd *vidxd, unsigned int index,
+ unsigned int start, unsigned int count, uint32_t flags,
+ void *data) = NULL;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ switch (index) {
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ dev_warn(dev, "intx interrupts not supported.\n");
+ break;
+ case VFIO_PCI_MSI_IRQ_INDEX:
+ dev_dbg(dev, "msi interrupt.\n");
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ func = vdcm_idxd_set_msix_trigger;
+ break;
+ }
+ break;
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ func = vdcm_idxd_set_msix_trigger;
+ break;
+ }
+ break;
+ default:
+ return -ENOTTY;
+ }
+
+ if (!func)
+ return -ENOTTY;
+
+ return func(vidxd, index, start, count, flags, data);
+}
+
+static void vidxd_vdcm_reset(struct vdcm_idxd *vidxd)
+{
+ vidxd_reset(vidxd);
+}
+
+static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
+ unsigned long arg)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned long minsz;
+ int rc = -EINVAL;
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "vidxd %p ioctl, cmd: %d\n", vidxd, cmd);
+
+ mutex_lock(&vidxd->dev_lock);
+ if (cmd == VFIO_DEVICE_GET_INFO) {
+ struct vfio_device_info info;
+
+ minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ info.flags = VFIO_DEVICE_FLAGS_PCI;
+ info.flags |= VFIO_DEVICE_FLAGS_RESET;
+ info.num_regions = VFIO_PCI_NUM_REGIONS;
+ info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+ struct vfio_region_info info;
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
+ size_t size;
+ int nr_areas = 1;
+ int cap_type_id = 0;
+
+ minsz = offsetofend(struct vfio_region_info, offset);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ switch (info.index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = VIDXD_MAX_CFG_SPACE_SZ;
+ info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = vidxd->bar_size[info.index];
+ if (!info.size) {
+ info.flags = 0;
+ break;
+ }
+
+ info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = 0;
+ info.flags = 0;
+ break;
+ case VFIO_PCI_BAR2_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.flags = VFIO_REGION_INFO_FLAG_CAPS | VFIO_REGION_INFO_FLAG_MMAP |
+ VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ info.size = vidxd->bar_size[1];
+
+ /*
+ * Every WQ has two areas for unlimited and limited
+ * MSI-X portals. IMS portals are not reported
+ */
+ nr_areas = 2;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+ cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+
+ sparse->areas[0].offset = 0;
+ sparse->areas[0].size = PAGE_SIZE;
+
+ sparse->areas[1].offset = PAGE_SIZE;
+ sparse->areas[1].size = PAGE_SIZE;
+ break;
+
+ case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = 0;
+ info.flags = 0;
+ dev_dbg(dev, "get region info bar:%d\n", info.index);
+ break;
+
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ dev_dbg(dev, "get region info index:%d\n", info.index);
+ break;
+ default: {
+ if (info.index >= VFIO_PCI_NUM_REGIONS)
+ rc = -EINVAL;
+ else
+ rc = 0;
+ goto out;
+ } /* default */
+ } /* info.index switch */
+
+ if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
+ if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP) {
+ rc = vfio_info_add_capability(&caps, &sparse->header,
+ sizeof(*sparse) + (sparse->nr_areas *
+ sizeof(*sparse->areas)));
+ kfree(sparse);
+ if (rc)
+ goto out;
+ }
+ }
+
+ if (caps.size) {
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user((void __user *)arg + sizeof(info),
+ caps.buf, caps.size)) {
+ kfree(caps.buf);
+ rc = -EFAULT;
+ goto out;
+ }
+ info.cap_offset = sizeof(info);
+ }
+
+ kfree(caps.buf);
+ }
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+ struct vfio_irq_info info;
+
+ minsz = offsetofend(struct vfio_irq_info, count);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ switch (info.index) {
+ case VFIO_PCI_MSI_IRQ_INDEX:
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ default:
+ rc = -EINVAL;
+ goto out;
+ } /* switch(info.index) */
+
+ info.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_NORESIZE;
+ info.count = idxd_vdcm_get_irq_count(vidxd, info.index);
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+ struct vfio_irq_set hdr;
+ u8 *data = NULL;
+ size_t data_size = 0;
+
+ minsz = offsetofend(struct vfio_irq_set, count);
+
+ if (copy_from_user(&hdr, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+ int max = idxd_vdcm_get_irq_count(vidxd, hdr.index);
+
+ rc = vfio_set_irqs_validate_and_prepare(&hdr, max, VFIO_PCI_NUM_IRQS,
+ &data_size);
+ if (rc) {
+ dev_err(dev, "intel:vfio_set_irqs_validate_and_prepare failed\n");
+ rc = -EINVAL;
+ goto out;
+ }
+ if (data_size) {
+ data = memdup_user((void __user *)(arg + minsz), data_size);
+ if (IS_ERR(data)) {
+ rc = PTR_ERR(data);
+ goto out;
+ }
+ }
+ }
+
+ if (!data) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ rc = idxd_vdcm_set_irqs(vidxd, hdr.flags, hdr.index, hdr.start, hdr.count, data);
+ kfree(data);
+ goto out;
+ } else if (cmd == VFIO_DEVICE_RESET) {
+ vidxd_vdcm_reset(vidxd);
+ }
+
+ out:
+ mutex_unlock(&vidxd->dev_lock);
+ return rc;
+}
+
+static const struct mdev_parent_ops idxd_vdcm_ops = {
+ .create = idxd_vdcm_create,
+ .remove = idxd_vdcm_remove,
+ .open = idxd_vdcm_open,
+ .release = idxd_vdcm_release,
+ .read = idxd_vdcm_read,
+ .write = idxd_vdcm_write,
+ .mmap = idxd_vdcm_mmap,
+ .ioctl = idxd_vdcm_ioctl,
+};
+
+int idxd_mdev_host_init(struct idxd_device *idxd)
+{
+ struct device *dev = &idxd->pdev->dev;
+ int rc;
+
+ if (!test_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags))
+ return -EOPNOTSUPP;
+
+ if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+ rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ if (rc < 0) {
+ dev_warn(dev, "Failed to enable aux-domain: %d\n", rc);
+ return rc;
+ }
+ } else {
+ dev_warn(dev, "No aux-domain feature.\n");
+ return -EOPNOTSUPP;
+ }
+
+ return mdev_register_device(dev, &idxd_vdcm_ops);
+}
+
+void idxd_mdev_host_release(struct idxd_device *idxd)
+{
+ struct device *dev = &idxd->pdev->dev;
+ int rc;
+
+ mdev_unregister_device(dev);
+ if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+ rc = iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ if (rc < 0)
+ dev_warn(dev, "Failed to disable aux-domain: %d\n",
+ rc);
+ }
+}
diff --git a/drivers/dma/idxd/mdev.h b/drivers/dma/idxd/mdev.h
new file mode 100644
index 000000000000..328055435cea
--- /dev/null
+++ b/drivers/dma/idxd/mdev.h
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_MDEV_H_
+#define _IDXD_MDEV_H_
+
+/* two 64-bit BARs implemented */
+#define VIDXD_MAX_BARS 2
+#define VIDXD_MAX_CFG_SPACE_SZ 4096
+#define VIDXD_MAX_MMIO_SPACE_SZ 8192
+#define VIDXD_MSIX_TBL_SZ_OFFSET 0x42
+#define VIDXD_CAP_CTRL_SZ 0x100
+#define VIDXD_GRP_CTRL_SZ 0x100
+#define VIDXD_WQ_CTRL_SZ 0x100
+#define VIDXD_WQ_OCPY_INT_SZ 0x20
+#define VIDXD_MSIX_TBL_SZ 0x90
+#define VIDXD_MSIX_PERM_TBL_SZ 0x48
+
+#define VIDXD_MSIX_TABLE_OFFSET 0x600
+#define VIDXD_MSIX_PERM_OFFSET 0x300
+#define VIDXD_GRPCFG_OFFSET 0x400
+#define VIDXD_WQCFG_OFFSET 0x500
+#define VIDXD_IMS_OFFSET 0x1000
+
+#define VIDXD_BAR0_SIZE 0x2000
+#define VIDXD_BAR2_SIZE 0x20000
+#define VIDXD_MAX_MSIX_ENTRIES (VIDXD_MSIX_TBL_SZ / 0x10)
+#define VIDXD_MAX_WQS 1
+#define VIDXD_MAX_MSIX_VECS 2
+
+#define VIDXD_ATS_OFFSET 0x100
+#define VIDXD_PRS_OFFSET 0x110
+#define VIDXD_PASID_OFFSET 0x120
+#define VIDXD_MSIX_PBA_OFFSET 0x700
+
+struct ims_irq_entry {
+ struct vdcm_idxd *vidxd;
+ int int_src;
+ unsigned int irq;
+ bool irq_set;
+};
+
+struct idxd_vdev {
+ struct mdev_device *mdev;
+ struct eventfd_ctx *msix_trigger[VIDXD_MAX_MSIX_ENTRIES];
+ struct notifier_block group_notifier;
+ struct work_struct release_work;
+ atomic_t released;
+};
+
+struct vdcm_idxd {
+ struct idxd_device *idxd;
+ struct idxd_wq *wq;
+ struct idxd_vdev vdev;
+ struct vdcm_idxd_type *type;
+ int num_wqs;
+ u64 ims_index[VIDXD_MAX_MSIX_VECS - 1];
+ struct msix_entry ims_entry;
+ struct ims_irq_entry irq_entries[VIDXD_MAX_WQS];
+
+ /* For VM use case */
+ u64 bar_val[VIDXD_MAX_BARS];
+ u64 bar_size[VIDXD_MAX_BARS];
+ u8 cfg[VIDXD_MAX_CFG_SPACE_SZ];
+ u8 bar0[VIDXD_MAX_MMIO_SPACE_SZ];
+ struct list_head list;
+ struct mutex dev_lock; /* lock for vidxd resources */
+};
+
+static inline struct vdcm_idxd *to_vidxd(struct idxd_vdev *vdev)
+{
+ return container_of(vdev, struct vdcm_idxd, vdev);
+}
+
+#define IDXD_MDEV_NAME_LEN 16
+#define IDXD_MDEV_DESCRIPTION_LEN 64
+
+enum idxd_mdev_type {
+ IDXD_MDEV_TYPE_1_DWQ = 0,
+};
+
+#define IDXD_MDEV_TYPES 1
+
+struct vdcm_idxd_type {
+ char name[IDXD_MDEV_NAME_LEN];
+ char description[IDXD_MDEV_DESCRIPTION_LEN];
+ enum idxd_mdev_type type;
+ unsigned int avail_instance;
+};
+
+enum idxd_vdcm_rw {
+ IDXD_VDCM_READ = 0,
+ IDXD_VDCM_WRITE,
+};
+
+static inline u64 get_reg_val(void *buf, int size)
+{
+ u64 val = 0;
+
+ switch (size) {
+ case 8:
+ val = *(uint64_t *)buf;
+ break;
+ case 4:
+ val = *(uint32_t *)buf;
+ break;
+ case 2:
+ val = *(uint16_t *)buf;
+ break;
+ case 1:
+ val = *(uint8_t *)buf;
+ break;
+ }
+
+ return val;
+}
+
+#endif
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
new file mode 100644
index 000000000000..af421852cc51
--- /dev/null
+++ b/drivers/dma/idxd/vdev.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
+{
+ /* PLACE HOLDER */
+ return 0;
+}
+
+int vidxd_disable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_enable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+}
+
+void vidxd_reset(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
new file mode 100644
index 000000000000..1a2fdda271e8
--- /dev/null
+++ b/drivers/dma/idxd/vdev.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_VDEV_H_
+#define _IDXD_VDEV_H_
+
+#include "mdev.h"
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
+void vidxd_mmio_init(struct vdcm_idxd *vidxd);
+void vidxd_reset(struct vdcm_idxd *vidxd);
+int vidxd_disable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx);
+int vidxd_enable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx);
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
+
+#endif
Add device support helper functions in preparation of adding VFIO
mdev support.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/device.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/idxd.h | 4 +++
2 files changed, 65 insertions(+)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 2b4e8ab99ebd..7443ffb5d3c3 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -287,6 +287,30 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq)
devm_iounmap(dev, wq->portal);
}
+int idxd_wq_abort(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand, status;
+
+ dev_dbg(dev, "Abort WQ %d\n", wq->id);
+ if (wq->state != IDXD_WQ_ENABLED) {
+ dev_dbg(dev, "WQ %d not active\n", wq->id);
+ return -ENXIO;
+ }
+
+ operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &status);
+ if (status != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "WQ abort failed: %#x\n", status);
+ return -ENXIO;
+ }
+
+ dev_dbg(dev, "WQ %d aborted\n", wq->id);
+ return 0;
+}
+
int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
{
struct idxd_device *idxd = wq->idxd;
@@ -366,6 +390,32 @@ void idxd_wq_disable_cleanup(struct idxd_wq *wq)
}
}
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+
+ /* PASID fields are 8 bytes into the WQCFG register */
+ offset = idxd->wqcfg_offset + wq->id * 32 + 8;
+ wq->wqcfg.pasid = pasid;
+ iowrite32(wq->wqcfg.bits[2], idxd->reg_base + offset);
+}
+
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+
+ /* priv field is 8 bytes into the WQCFG register */
+ offset = idxd->wqcfg_offset + wq->id * 32 + 8;
+ wq->wqcfg.priv = !!priv;
+ iowrite32(wq->wqcfg.bits[2], idxd->reg_base + offset);
+}
+
/* Device control bits */
static inline bool idxd_is_enabled(struct idxd_device *idxd)
{
@@ -502,6 +552,17 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
dev_dbg(dev, "pasid %d drained\n", pasid);
}
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid)
+{
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand;
+
+ operand = pasid;
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_PASID, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_PASID, operand, NULL);
+ dev_dbg(dev, "pasid %d aborted\n", pasid);
+}
+
#define INT_HANDLE_IMS_TABLE 0x10000
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx,
int *handle, enum idxd_interrupt_type irq_type)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index c65fb6dcb7e0..438d6478a3f8 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -321,6 +321,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid);
int idxd_device_load_config(struct idxd_device *idxd);
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
enum idxd_interrupt_type irq_type);
@@ -336,6 +337,9 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq);
void idxd_wq_disable_cleanup(struct idxd_wq *wq);
int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
int idxd_wq_disable_pasid(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq);
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid);
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv);
/* submission */
int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc);
Update some of the device commands in order to support usage by the virtual
device commands emulated by the vdcm. Expose some of the commands' raw
status so the virtual commands can utilize them accordingly.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/cdev.c | 2 +
drivers/dma/idxd/device.c | 69 +++++++++++++++++++++++++++++----------------
drivers/dma/idxd/idxd.h | 8 +++--
drivers/dma/idxd/irq.c | 2 +
drivers/dma/idxd/mdev.c | 2 +
drivers/dma/idxd/sysfs.c | 8 +++--
6 files changed, 56 insertions(+), 35 deletions(-)
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index ffeae646f947..90266c0bdef3 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -158,7 +158,7 @@ static int idxd_cdev_release(struct inode *node, struct file *filep)
if (rc < 0)
dev_err(dev, "wq disable pasid failed.\n");
}
- idxd_wq_drain(wq);
+ idxd_wq_drain(wq, NULL);
}
if (ctx->sva)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 7443ffb5d3c3..6e9d27f59638 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -197,22 +197,25 @@ void idxd_wq_free_resources(struct idxd_wq *wq)
sbitmap_queue_free(&wq->sbq);
}
-int idxd_wq_enable(struct idxd_wq *wq)
+int idxd_wq_enable(struct idxd_wq *wq, u32 *status)
{
struct idxd_device *idxd = wq->idxd;
struct device *dev = &idxd->pdev->dev;
- u32 status;
+ u32 stat;
if (wq->state == IDXD_WQ_ENABLED) {
dev_dbg(dev, "WQ %d already enabled\n", wq->id);
return -ENXIO;
}
- idxd_cmd_exec(idxd, IDXD_CMD_ENABLE_WQ, wq->id, &status);
+ idxd_cmd_exec(idxd, IDXD_CMD_ENABLE_WQ, wq->id, &stat);
- if (status != IDXD_CMDSTS_SUCCESS &&
- status != IDXD_CMDSTS_ERR_WQ_ENABLED) {
- dev_dbg(dev, "WQ enable failed: %#x\n", status);
+ if (status)
+ *status = stat;
+
+ if (stat != IDXD_CMDSTS_SUCCESS &&
+ stat != IDXD_CMDSTS_ERR_WQ_ENABLED) {
+ dev_dbg(dev, "WQ enable failed: %#x\n", stat);
return -ENXIO;
}
@@ -221,11 +224,11 @@ int idxd_wq_enable(struct idxd_wq *wq)
return 0;
}
-int idxd_wq_disable(struct idxd_wq *wq)
+int idxd_wq_disable(struct idxd_wq *wq, u32 *status)
{
struct idxd_device *idxd = wq->idxd;
struct device *dev = &idxd->pdev->dev;
- u32 status, operand;
+ u32 stat, operand;
dev_dbg(dev, "Disabling WQ %d\n", wq->id);
@@ -235,10 +238,13 @@ int idxd_wq_disable(struct idxd_wq *wq)
}
operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
- idxd_cmd_exec(idxd, IDXD_CMD_DISABLE_WQ, operand, &status);
+ idxd_cmd_exec(idxd, IDXD_CMD_DISABLE_WQ, operand, &stat);
+
+ if (status)
+ *status = stat;
- if (status != IDXD_CMDSTS_SUCCESS) {
- dev_dbg(dev, "WQ disable failed: %#x\n", status);
+ if (stat != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "WQ disable failed: %#x\n", stat);
return -ENXIO;
}
@@ -247,20 +253,31 @@ int idxd_wq_disable(struct idxd_wq *wq)
return 0;
}
-void idxd_wq_drain(struct idxd_wq *wq)
+int idxd_wq_drain(struct idxd_wq *wq, u32 *status)
{
struct idxd_device *idxd = wq->idxd;
struct device *dev = &idxd->pdev->dev;
- u32 operand;
+ u32 operand, stat;
if (wq->state != IDXD_WQ_ENABLED) {
dev_dbg(dev, "WQ %d in wrong state: %d\n", wq->id, wq->state);
- return;
+ return 0;
}
dev_dbg(dev, "Draining WQ %d\n", wq->id);
operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
- idxd_cmd_exec(idxd, IDXD_CMD_DRAIN_WQ, operand, NULL);
+ idxd_cmd_exec(idxd, IDXD_CMD_DRAIN_WQ, operand, &stat);
+
+ if (status)
+ *status = stat;
+
+ if (stat != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "WQ drain failed: %#x\n", stat);
+ return -ENXIO;
+ }
+
+ dev_dbg(dev, "WQ %d drained\n", wq->id);
+ return 0;
}
int idxd_wq_map_portal(struct idxd_wq *wq)
@@ -287,11 +304,11 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq)
devm_iounmap(dev, wq->portal);
}
-int idxd_wq_abort(struct idxd_wq *wq)
+int idxd_wq_abort(struct idxd_wq *wq, u32 *status)
{
struct idxd_device *idxd = wq->idxd;
struct device *dev = &idxd->pdev->dev;
- u32 operand, status;
+ u32 operand, stat;
dev_dbg(dev, "Abort WQ %d\n", wq->id);
if (wq->state != IDXD_WQ_ENABLED) {
@@ -301,9 +318,13 @@ int idxd_wq_abort(struct idxd_wq *wq)
operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
- idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &status);
- if (status != IDXD_CMDSTS_SUCCESS) {
- dev_dbg(dev, "WQ abort failed: %#x\n", status);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &stat);
+
+ if (status)
+ *status = stat;
+
+ if (stat != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "WQ abort failed: %#x\n", stat);
return -ENXIO;
}
@@ -319,7 +340,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
unsigned int offset;
unsigned long flags;
- rc = idxd_wq_disable(wq);
+ rc = idxd_wq_disable(wq, NULL);
if (rc < 0)
return rc;
@@ -331,7 +352,7 @@ int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
iowrite32(wqcfg.bits[2], idxd->reg_base + offset);
spin_unlock_irqrestore(&idxd->dev_lock, flags);
- rc = idxd_wq_enable(wq);
+ rc = idxd_wq_enable(wq, NULL);
if (rc < 0)
return rc;
@@ -346,7 +367,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
unsigned int offset;
unsigned long flags;
- rc = idxd_wq_disable(wq);
+ rc = idxd_wq_disable(wq, NULL);
if (rc < 0)
return rc;
@@ -358,7 +379,7 @@ int idxd_wq_disable_pasid(struct idxd_wq *wq)
iowrite32(wqcfg.bits[2], idxd->reg_base + offset);
spin_unlock_irqrestore(&idxd->dev_lock, flags);
- rc = idxd_wq_enable(wq);
+ rc = idxd_wq_enable(wq, NULL);
if (rc < 0)
return rc;
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 9588872cd273..1e4d9ec9b00d 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -336,15 +336,15 @@ int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handl
/* work queue control */
int idxd_wq_alloc_resources(struct idxd_wq *wq);
void idxd_wq_free_resources(struct idxd_wq *wq);
-int idxd_wq_enable(struct idxd_wq *wq);
-int idxd_wq_disable(struct idxd_wq *wq);
-void idxd_wq_drain(struct idxd_wq *wq);
+int idxd_wq_enable(struct idxd_wq *wq, u32 *status);
+int idxd_wq_disable(struct idxd_wq *wq, u32 *status);
+int idxd_wq_drain(struct idxd_wq *wq, u32 *status);
int idxd_wq_map_portal(struct idxd_wq *wq);
void idxd_wq_unmap_portal(struct idxd_wq *wq);
void idxd_wq_disable_cleanup(struct idxd_wq *wq);
int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
int idxd_wq_disable_pasid(struct idxd_wq *wq);
-int idxd_wq_abort(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq, u32 *status);
void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid);
void idxd_wq_setup_priv(struct idxd_wq *wq, int priv);
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index c97e08480323..c818acb34a14 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -60,7 +60,7 @@ static void idxd_device_reinit(struct work_struct *work)
struct idxd_wq *wq = &idxd->wqs[i];
if (wq->state == IDXD_WQ_ENABLED) {
- rc = idxd_wq_enable(wq);
+ rc = idxd_wq_enable(wq, NULL);
if (rc < 0) {
dev_warn(dev, "Unable to re-enable wq %s\n",
dev_name(&wq->conf_dev));
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index f9cc2909b1cf..01207ca42a79 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -70,7 +70,7 @@ static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
vidxd_mmio_init(vidxd);
if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
- idxd_wq_disable(wq);
+ idxd_wq_disable(wq, NULL);
}
static void __idxd_vdcm_release(struct vdcm_idxd *vidxd)
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 501a1d489ce3..cdf2b5aac314 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -218,7 +218,7 @@ static int idxd_config_bus_probe(struct device *dev)
return rc;
}
- rc = idxd_wq_enable(wq);
+ rc = idxd_wq_enable(wq, NULL);
if (rc < 0) {
mutex_unlock(&wq->wq_lock);
dev_warn(dev, "WQ %d enabling failed: %d\n",
@@ -229,7 +229,7 @@ static int idxd_config_bus_probe(struct device *dev)
rc = idxd_wq_map_portal(wq);
if (rc < 0) {
dev_warn(dev, "wq portal mapping failed: %d\n", rc);
- rc = idxd_wq_disable(wq);
+ rc = idxd_wq_disable(wq, NULL);
if (rc < 0)
dev_warn(dev, "IDXD wq disable failed\n");
mutex_unlock(&wq->wq_lock);
@@ -287,8 +287,8 @@ static void disable_wq(struct idxd_wq *wq)
idxd_wq_unmap_portal(wq);
- idxd_wq_drain(wq);
- rc = idxd_wq_disable(wq);
+ idxd_wq_drain(wq, NULL);
+ rc = idxd_wq_disable(wq, NULL);
idxd_wq_free_resources(wq);
wq->client_count = 0;
Add all the helper functions that supports the emulation of the commands
that are submitted to the device command register.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/registers.h | 16 +-
drivers/dma/idxd/vdev.c | 398 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 409 insertions(+), 5 deletions(-)
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index f8e4dd10a738..6531a40fad2e 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -115,7 +115,8 @@ union gencfg_reg {
union genctrl_reg {
struct {
u32 softerr_int_en:1;
- u32 rsvd:31;
+ u32 halt_state_int_en:1;
+ u32 rsvd:30;
};
u32 bits;
} __packed;
@@ -137,6 +138,8 @@ enum idxd_device_status_state {
IDXD_DEVICE_STATE_HALT,
};
+#define IDXD_GENSTATS_MASK 0x03
+
enum idxd_device_reset_type {
IDXD_DEVICE_RESET_SOFTWARE = 0,
IDXD_DEVICE_RESET_FLR,
@@ -149,6 +152,7 @@ enum idxd_device_reset_type {
#define IDXD_INTC_CMD 0x02
#define IDXD_INTC_OCCUPY 0x04
#define IDXD_INTC_PERFMON_OVFL 0x08
+#define IDXD_INTC_HALT_STATE 0x10
#define IDXD_CMD_OFFSET 0xa0
union idxd_command_reg {
@@ -160,6 +164,7 @@ union idxd_command_reg {
};
u32 bits;
} __packed;
+#define IDXD_CMD_INT_MASK 0x80000000
enum idxd_cmd {
IDXD_CMD_ENABLE_DEVICE = 1,
@@ -217,7 +222,7 @@ enum idxd_cmdsts_err {
/* disable device errors */
IDXD_CMDSTS_ERR_DIS_DEV_EN = 0x31,
/* disable WQ, drain WQ, abort WQ, reset WQ */
- IDXD_CMDSTS_ERR_DEV_NOT_EN,
+ IDXD_CMDSTS_ERR_WQ_NOT_EN,
/* request interrupt handle */
IDXD_CMDSTS_ERR_INVAL_INT_IDX = 0x41,
IDXD_CMDSTS_ERR_NO_HANDLE,
@@ -353,4 +358,11 @@ union wqcfg {
#define WQCFG_OFFSET(idxd_dev, n, ofs) ((idxd_dev)->wqcfg_offset +\
(n) * sizeof(union wqcfg) +\
sizeof(u32) * (ofs))
+
+enum idxd_wq_hw_state {
+ IDXD_WQ_DEV_DISABLED = 0,
+ IDXD_WQ_DEV_ENABLED,
+ IDXD_WQ_DEV_BUSY,
+};
+
#endif
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index b4eace02199e..df99d0bce5e9 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -81,6 +81,18 @@ static void vidxd_report_error(struct vdcm_idxd *vidxd, unsigned int error)
}
}
+static int idxd_get_mdev_pasid(struct mdev_device *mdev)
+{
+ struct iommu_domain *domain;
+ struct device *dev = mdev_dev(mdev);
+
+ domain = mdev_get_iommu_domain(dev);
+ if (!domain)
+ return -EINVAL;
+
+ return iommu_aux_get_pasid(domain, dev->parent);
+}
+
int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
{
u32 offset = pos & (vidxd->bar_size[0] - 1);
@@ -474,15 +486,395 @@ void vidxd_mmio_init(struct vdcm_idxd *vidxd)
static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
{
- /* PLACEHOLDER */
+ u8 *bar0 = vidxd->bar0;
+ u32 *cmd = (u32 *)(bar0 + IDXD_CMD_OFFSET);
+ u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+ u32 *intcause = (u32 *)(bar0 + IDXD_INTCAUSE_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ *cmdsts = val;
+ dev_dbg(dev, "%s: cmd: %#x status: %#x\n", __func__, *cmd, val);
+
+ if (*cmd & IDXD_CMD_INT_MASK) {
+ *intcause |= IDXD_INTC_CMD;
+ vidxd_send_interrupt(vidxd, 0);
+ }
+}
+
+static void vidxd_enable(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (gensts->state == IDXD_DEVICE_STATE_ENABLED)
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_ENABLED);
+
+ /* Check PCI configuration */
+ if (!(vidxd->cfg[PCI_COMMAND] & PCI_COMMAND_MASTER))
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_BUSMASTER_EN);
+
+ gensts->state = IDXD_DEVICE_STATE_ENABLED;
+
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_disable(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq;
+ union wqcfg *wqcfg;
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u32 status;
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (gensts->state == IDXD_DEVICE_STATE_DISABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+ return;
+ }
+
+ wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ wq = vidxd->wq;
+
+ /* If it is a DWQ, need to disable the DWQ as well */
+ if (wq_dedicated(wq)) {
+ idxd_wq_disable(wq, &status);
+ if (status) {
+ dev_warn(dev, "vidxd disable (wq disable) failed: %#x\n", status);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+ return;
+ }
+ } else {
+ idxd_wq_drain(wq, &status);
+ if (status)
+ dev_warn(dev, "vidxd disable (wq drain) failed: %#x\n", status);
+ }
+
+ wqcfg->wq_state = 0;
+ gensts->state = IDXD_DEVICE_STATE_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_drain_all(struct vdcm_idxd *vidxd)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct idxd_wq *wq = vidxd->wq;
+
+ dev_dbg(dev, "%s\n", __func__);
+
+ idxd_wq_drain(wq, NULL);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_drain(struct vdcm_idxd *vidxd, int val)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ u32 status;
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ idxd_wq_drain(wq, &status);
+ if (status) {
+ dev_dbg(dev, "wq drain failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_abort_all(struct vdcm_idxd *vidxd)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct idxd_wq *wq = vidxd->wq;
+
+ dev_dbg(dev, "%s\n", __func__);
+ idxd_wq_abort(wq, NULL);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_abort(struct vdcm_idxd *vidxd, int val)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ u32 status;
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ idxd_wq_abort(wq, &status);
+ if (status) {
+ dev_dbg(dev, "wq abort failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
}
void vidxd_reset(struct vdcm_idxd *vidxd)
{
- /* PLACEHOLDER */
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct idxd_wq *wq;
+
+ dev_dbg(dev, "%s\n", __func__);
+ gensts->state = IDXD_DEVICE_STATE_DRAIN;
+ wq = vidxd->wq;
+
+ if (wq->state == IDXD_WQ_ENABLED) {
+ idxd_wq_abort(wq, NULL);
+ idxd_wq_disable(wq, NULL);
+ }
+
+ vidxd_mmio_init(vidxd);
+ gensts->state = IDXD_DEVICE_STATE_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_reset(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u32 status;
+
+ wq = vidxd->wq;
+ dev_dbg(dev, "vidxd reset wq %u:%u\n", 0, wq->id);
+
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ idxd_wq_abort(wq, &status);
+ if (status) {
+ dev_dbg(dev, "vidxd reset wq failed to abort: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ idxd_wq_disable(wq, &status);
+ if (status) {
+ dev_dbg(dev, "vidxd reset wq failed to disable: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_alloc_int_handle(struct vdcm_idxd *vidxd, int vidx)
+{
+ bool ims = (vidx >> 16) & 1;
+ u32 cmdsts;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ vidx = vidx & 0xffff;
+
+ dev_dbg(dev, "allocating int handle for %x\n", vidx);
+
+ if (vidx != 1) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX);
+ return;
+ }
+
+ if (ims) {
+ dev_warn(dev, "IMS allocation is not implemented yet\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_NO_HANDLE);
+ } else {
+ vidx--; /* MSIX idx 0 is a slow path interrupt */
+ cmdsts = vidxd->ims_index[vidx] << 8;
+ dev_dbg(dev, "int handle %d:%lld\n", vidx, vidxd->ims_index[vidx]);
+ idxd_complete_command(vidxd, cmdsts);
+ }
+}
+
+static void vidxd_wq_enable(struct vdcm_idxd *vidxd, int wq_id)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wq_cap_reg *wqcap;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct idxd_device *idxd;
+ union wqcfg *vwqcfg, *wqcfg;
+ unsigned long flags;
+ int wq_pasid;
+ u32 status;
+ int priv;
+
+ if (wq_id >= VIDXD_MAX_WQS) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_WQIDX);
+ return;
+ }
+
+ idxd = vidxd->idxd;
+ wq = vidxd->wq;
+
+ dev_dbg(dev, "%s: wq %u:%u\n", __func__, wq_id, wq->id);
+
+ vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET + wq_id * 32);
+ wqcap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+ wqcfg = &wq->wqcfg;
+
+ if (vidxd_state(vidxd) != IDXD_DEVICE_STATE_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOTEN);
+ return;
+ }
+
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_DISABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_ENABLED);
+ return;
+ }
+
+ if (wq_dedicated(wq) && wqcap->dedicated_mode == 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_MODE);
+ return;
+ }
+
+ wq_pasid = idxd_get_mdev_pasid(mdev);
+ priv = 1;
+
+ if (wq_pasid >= 0) {
+ wqcfg->bits[2] &= ~0x3fffff00;
+ wqcfg->priv = priv;
+ wqcfg->pasid_en = 1;
+ wqcfg->pasid = wq_pasid;
+ dev_dbg(dev, "program pasid %d in wq %d\n", wq_pasid, wq->id);
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ idxd_wq_setup_pasid(wq, wq_pasid);
+ idxd_wq_setup_priv(wq, priv);
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ idxd_wq_enable(wq, &status);
+ if (status) {
+ dev_err(dev, "vidxd enable wq %d failed\n", wq->id);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+ } else {
+ dev_err(dev, "idxd pasid setup failed wq %d wq_pasid %d\n", wq->id, wq_pasid);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_PASID_EN);
+ return;
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DEV_ENABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_disable(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+ struct idxd_wq *wq;
+ union wqcfg *wqcfg;
+ u8 *bar0 = vidxd->bar0;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u32 status;
+
+ wq = vidxd->wq;
+
+ dev_dbg(dev, "vidxd disable wq %u:%u\n", 0, wq->id);
+
+ wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ /* If it is a DWQ, need to disable the DWQ as well */
+ if (wq_dedicated(wq)) {
+ idxd_wq_disable(wq, &status);
+ if (status) {
+ dev_warn(dev, "vidxd disable wq failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+ } else {
+ idxd_wq_drain(wq, &status);
+ if (status) {
+ dev_warn(dev, "vidxd disable drain wq failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+ }
+
+ wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
}
void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
{
- /* PLACEHOLDER */
+ union idxd_command_reg *reg = (union idxd_command_reg *)(vidxd->bar0 + IDXD_CMD_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ reg->bits = val;
+
+ dev_dbg(dev, "%s: cmd code: %u reg: %x\n", __func__, reg->cmd, reg->bits);
+
+ switch (reg->cmd) {
+ case IDXD_CMD_ENABLE_DEVICE:
+ vidxd_enable(vidxd);
+ break;
+ case IDXD_CMD_DISABLE_DEVICE:
+ vidxd_disable(vidxd);
+ break;
+ case IDXD_CMD_DRAIN_ALL:
+ vidxd_drain_all(vidxd);
+ break;
+ case IDXD_CMD_ABORT_ALL:
+ vidxd_abort_all(vidxd);
+ break;
+ case IDXD_CMD_RESET_DEVICE:
+ vidxd_reset(vidxd);
+ break;
+ case IDXD_CMD_ENABLE_WQ:
+ vidxd_wq_enable(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_DISABLE_WQ:
+ vidxd_wq_disable(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_DRAIN_WQ:
+ vidxd_wq_drain(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_ABORT_WQ:
+ vidxd_wq_abort(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_RESET_WQ:
+ vidxd_wq_reset(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_REQUEST_INT_HANDLE:
+ vidxd_alloc_int_handle(vidxd, reg->operand);
+ break;
+ default:
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_CMD);
+ break;
+ }
}
Add "mdev" wq type and support helpers. The mdev wq type marks the wq
to be utilized as a VFIO mediated device.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/idxd.h | 2 ++
drivers/dma/idxd/sysfs.c | 13 +++++++++++--
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 1e4d9ec9b00d..1f03019bb45d 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -71,6 +71,7 @@ enum idxd_wq_type {
IDXD_WQT_NONE = 0,
IDXD_WQT_KERNEL,
IDXD_WQT_USER,
+ IDXD_WQT_MDEV,
};
struct idxd_cdev {
@@ -308,6 +309,7 @@ void idxd_cleanup_sysfs(struct idxd_device *idxd);
int idxd_register_driver(void);
void idxd_unregister_driver(void);
struct bus_type *idxd_get_bus_type(struct idxd_device *idxd);
+bool is_idxd_wq_mdev(struct idxd_wq *wq);
/* device interrupt control */
irqreturn_t idxd_irq_handler(int vec, void *data);
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index cdf2b5aac314..d3b0a95b0d1d 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -14,6 +14,7 @@ static char *idxd_wq_type_names[] = {
[IDXD_WQT_NONE] = "none",
[IDXD_WQT_KERNEL] = "kernel",
[IDXD_WQT_USER] = "user",
+ [IDXD_WQT_MDEV] = "mdev",
};
static void idxd_conf_device_release(struct device *dev)
@@ -69,6 +70,11 @@ static inline bool is_idxd_wq_cdev(struct idxd_wq *wq)
return wq->type == IDXD_WQT_USER;
}
+inline bool is_idxd_wq_mdev(struct idxd_wq *wq)
+{
+ return wq->type == IDXD_WQT_MDEV ? true : false;
+}
+
static int idxd_config_bus_match(struct device *dev,
struct device_driver *drv)
{
@@ -1095,8 +1101,9 @@ static ssize_t wq_type_show(struct device *dev,
return sprintf(buf, "%s\n",
idxd_wq_type_names[IDXD_WQT_KERNEL]);
case IDXD_WQT_USER:
- return sprintf(buf, "%s\n",
- idxd_wq_type_names[IDXD_WQT_USER]);
+ return sprintf(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_USER]);
+ case IDXD_WQT_MDEV:
+ return sprintf(buf, "%s\n", idxd_wq_type_names[IDXD_WQT_MDEV]);
case IDXD_WQT_NONE:
default:
return sprintf(buf, "%s\n",
@@ -1123,6 +1130,8 @@ static ssize_t wq_type_store(struct device *dev,
wq->type = IDXD_WQT_KERNEL;
else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_USER]))
wq->type = IDXD_WQT_USER;
+ else if (sysfs_streq(buf, idxd_wq_type_names[IDXD_WQT_MDEV]))
+ wq->type = IDXD_WQT_MDEV;
else
return -EINVAL;
Add support for IMS enabling on the mediated device.
On the actual hardware the MSIX vector 0 is misc interrupt and handles
events such as administrative command completion, error reporting,
performance monitor overflow, and etc. The MSIX vectors 1...N
are used for descriptor completion interrupts. On the guest kernel,
the MSIX interrupts are backed by the mediated device through emulation
or IMS vectors. Vector 0 is handled through emulation by the host vdcm.
The vector 1 (and more may be supported later) is backed by IMS. IMS can
be setup with interrupt handlers via request_irq() just like MSIX
interrupts once the relevant IRQ domain is set with
dev_msi_domain_alloc_irqs().
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/Kconfig | 1
drivers/dma/idxd/ims.c | 142 +++++++++++++++++++++++++++++++++++++++++------
drivers/dma/idxd/ims.h | 7 ++
drivers/dma/idxd/vdev.c | 76 +++++++++++++++++++++----
4 files changed, 195 insertions(+), 31 deletions(-)
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 69c1ae72df86..a19e5dbeab9b 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -311,6 +311,7 @@ config INTEL_IDXD_MDEV
depends on INTEL_IDXD
depends on VFIO_MDEV
depends on VFIO_MDEV_DEVICE
+ depends on DEV_MSI
config INTEL_IOATDMA
tristate "Intel I/OAT DMA support"
diff --git a/drivers/dma/idxd/ims.c b/drivers/dma/idxd/ims.c
index bffc74c2b305..f9b7fbcb61df 100644
--- a/drivers/dma/idxd/ims.c
+++ b/drivers/dma/idxd/ims.c
@@ -7,22 +7,13 @@
#include <linux/device.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/msi.h>
+#include <linux/mdev.h>
#include <uapi/linux/idxd.h>
#include "registers.h"
#include "idxd.h"
#include "mdev.h"
-
-int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
-{
- /* PLACEHOLDER */
- return 0;
-}
-
-int vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
-{
- /* PLACEHOLDER */
- return 0;
-}
+#include "ims.h"
+#include "vdev.h"
static void idxd_free_ims_index(struct idxd_device *idxd,
unsigned long ims_idx)
@@ -42,21 +33,65 @@ static int idxd_alloc_ims_index(struct idxd_device *idxd)
static unsigned int idxd_ims_irq_mask(struct msi_desc *desc)
{
- // Filled out later when VDCM is introduced.
+ int ims_offset;
+ u32 mask_bits;
+ struct device *dev = desc->dev;
+ struct mdev_device *mdev = mdev_from_dev(dev);
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_device *idxd = vidxd->idxd;
+ void __iomem *base;
+ int ims_id = desc->platform.msi_index;
- return 0;
+ dev_dbg(dev, "idxd irq mask: %d\n", ims_id);
+
+ ims_offset = idxd->ims_offset + vidxd->ims_index[ims_id] * 0x10;
+ base = idxd->reg_base + ims_offset;
+ mask_bits = ioread32(base + IMS_ENTRY_VECTOR_CTRL);
+ mask_bits |= IMS_ENTRY_CTRL_MASKBIT;
+ iowrite32(mask_bits, base + IMS_ENTRY_VECTOR_CTRL);
+
+ return mask_bits;
}
static unsigned int idxd_ims_irq_unmask(struct msi_desc *desc)
{
- // Filled out later when VDCM is introduced.
+ int ims_offset;
+ u32 mask_bits;
+ struct device *dev = desc->dev;
+ struct mdev_device *mdev = mdev_from_dev(dev);
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_device *idxd = vidxd->idxd;
+ void __iomem *base;
+ int ims_id = desc->platform.msi_index;
- return 0;
+ dev_dbg(dev, "idxd irq unmask: %d\n", ims_id);
+
+ ims_offset = idxd->ims_offset + vidxd->ims_index[ims_id] * 0x10;
+ base = idxd->reg_base + ims_offset;
+ mask_bits = ioread32(base + IMS_ENTRY_VECTOR_CTRL);
+ mask_bits &= ~IMS_ENTRY_CTRL_MASKBIT;
+ iowrite32(mask_bits, base + IMS_ENTRY_VECTOR_CTRL);
+
+ return mask_bits;
}
static void idxd_ims_write_msg(struct msi_desc *desc, struct msi_msg *msg)
{
- // Filled out later when VDCM is introduced.
+ int ims_offset;
+ struct device *dev = desc->dev;
+ struct mdev_device *mdev = mdev_from_dev(dev);
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_device *idxd = vidxd->idxd;
+ void __iomem *base;
+ int ims_id = desc->platform.msi_index;
+
+ dev_dbg(dev, "ims_write: %d %x\n", ims_id, msg->address_lo);
+
+ ims_offset = idxd->ims_offset + vidxd->ims_index[ims_id] * 0x10;
+ base = idxd->reg_base + ims_offset;
+ iowrite32(msg->address_lo, base + IMS_ENTRY_LOWER_ADDR);
+ iowrite32(msg->address_hi, base + IMS_ENTRY_UPPER_ADDR);
+ iowrite32(msg->data, base + IMS_ENTRY_DATA);
}
static struct platform_msi_ops idxd_ims_ops = {
@@ -64,3 +99,76 @@ static struct platform_msi_ops idxd_ims_ops = {
.irq_unmask = idxd_ims_irq_unmask,
.write_msg = idxd_ims_write_msg,
};
+
+int vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ struct ims_irq_entry *irq_entry;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct msi_desc *desc;
+ int i = 0;
+
+ for_each_msi_entry(desc, dev) {
+ irq_entry = &vidxd->irq_entries[i];
+ /*
+ * When qemu dies unexpectedly, it does not call VFIO_IRQ_SET_DATA_NONE ioctl
+ * to free up the interrupts. We need to free the interrupts here as clean up
+ * if they haven't been freed.
+ */
+ if (irq_entry->irq_set)
+ free_irq(irq_entry->irq, irq_entry);
+ idxd_free_ims_index(idxd, vidxd->ims_index[i]);
+ vidxd->ims_index[i] = -1;
+ memset(irq_entry, 0, sizeof(*irq_entry));
+ i++;
+ }
+
+ dev_msi_domain_free_irqs(dev);
+ return 0;
+}
+
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ struct ims_irq_entry *irq_entry;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct msi_desc *desc;
+ int err, i = 0;
+ int index;
+
+ /*
+ * MSIX vec 0 is emulated by the vdcm and does not take up an IMS. The total MSIX vecs used
+ * by the mdev will be total IMS + 1. vec 0 is used for misc interrupts such as command
+ * completion, error notification, PMU, etc. The other vectors are used for descriptor
+ * completion. Thus only the number of IMS vectors need to be allocated, which is
+ * VIDXD_MAX_MSIX_VECS - 1.
+ */
+ err = dev_msi_domain_alloc_irqs(dev, VIDXD_MAX_MSIX_VECS - 1, &idxd_ims_ops);
+ if (err < 0) {
+ dev_dbg(dev, "Enabling IMS entry! %d\n", err);
+ return err;
+ }
+
+ i = 0;
+ for_each_msi_entry(desc, dev) {
+ index = idxd_alloc_ims_index(idxd);
+ if (index < 0) {
+ err = index;
+ break;
+ }
+ vidxd->ims_index[i] = index;
+
+ irq_entry = &vidxd->irq_entries[i];
+ irq_entry->vidxd = vidxd;
+ irq_entry->int_src = i;
+ irq_entry->irq = desc->irq;
+ i++;
+ }
+
+ if (err)
+ vidxd_free_ims_entries(vidxd);
+
+ return 0;
+}
diff --git a/drivers/dma/idxd/ims.h b/drivers/dma/idxd/ims.h
index 3d823606e3a3..97826abf1163 100644
--- a/drivers/dma/idxd/ims.h
+++ b/drivers/dma/idxd/ims.h
@@ -4,6 +4,13 @@
#ifndef _IDXD_IMS_H_
#define _IDXD_IMS_H_
+/* IMS entry format */
+#define IMS_ENTRY_LOWER_ADDR 0 /* Message Address */
+#define IMS_ENTRY_UPPER_ADDR 4 /* Message Upper Address */
+#define IMS_ENTRY_DATA 8 /* Message Data */
+#define IMS_ENTRY_VECTOR_CTRL 12 /* Vector Control */
+#define IMS_ENTRY_CTRL_MASKBIT 0x00000001
+
int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd);
int vidxd_free_ims_entries(struct vdcm_idxd *vidxd);
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index df99d0bce5e9..66e59cb02635 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -44,15 +44,75 @@ int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
return rc;
}
+static int idxd_get_mdev_pasid(struct mdev_device *mdev)
+{
+ struct iommu_domain *domain;
+ struct device *dev = mdev_dev(mdev);
+
+ domain = mdev_get_iommu_domain(dev);
+ if (!domain)
+ return -EINVAL;
+
+ return iommu_aux_get_pasid(domain, dev->parent);
+}
+
+#define IMS_PASID_ENABLE 0x8
int vidxd_disable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx)
{
- /* PLACEHOLDER */
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ unsigned int ims_offset;
+ struct idxd_device *idxd = vidxd->idxd;
+ u32 val;
+
+ /*
+ * Current implementation limits to 1 WQ for the vdev and therefore
+ * also only 1 IMS interrupt for that vdev.
+ */
+ if (ims_idx >= VIDXD_MAX_WQS) {
+ dev_warn(dev, "ims_idx greater than vidxd allowed: %d\n", ims_idx);
+ return -EINVAL;
+ }
+
+ ims_offset = idxd->ims_offset + vidxd->ims_index[ims_idx] * 0x10;
+ val = ioread32(idxd->reg_base + ims_offset + 12);
+ val &= ~IMS_PASID_ENABLE;
+ iowrite32(val, idxd->reg_base + ims_offset + 12);
+
return 0;
}
int vidxd_enable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx)
{
- /* PLACEHOLDER */
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ int pasid;
+ unsigned int ims_offset;
+ struct idxd_device *idxd = vidxd->idxd;
+ u32 val;
+
+ /*
+ * Current implementation limits to 1 WQ for the vdev and therefore
+ * also only 1 IMS interrupt for that vdev.
+ */
+ if (ims_idx >= VIDXD_MAX_WQS) {
+ dev_warn(dev, "ims_idx greater than vidxd allowed: %d\n", ims_idx);
+ return -EINVAL;
+ }
+
+ /* Setup the PASID filtering */
+ pasid = idxd_get_mdev_pasid(mdev);
+
+ if (pasid >= 0) {
+ ims_offset = idxd->ims_offset + vidxd->ims_index[ims_idx] * 0x10;
+ val = ioread32(idxd->reg_base + ims_offset + 12);
+ val |= IMS_PASID_ENABLE | (pasid << 12) | (val & 0x7);
+ iowrite32(val, idxd->reg_base + ims_offset + 12);
+ } else {
+ dev_warn(dev, "pasid setup failed for ims entry %lld\n", vidxd->ims_index[ims_idx]);
+ return -ENXIO;
+ }
+
return 0;
}
@@ -81,18 +141,6 @@ static void vidxd_report_error(struct vdcm_idxd *vidxd, unsigned int error)
}
}
-static int idxd_get_mdev_pasid(struct mdev_device *mdev)
-{
- struct iommu_domain *domain;
- struct device *dev = mdev_dev(mdev);
-
- domain = mdev_get_iommu_domain(dev);
- if (!domain)
- return -EINVAL;
-
- return iommu_aux_get_pasid(domain, dev->parent);
-}
-
int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
{
u32 offset = pos & (vidxd->bar_size[0] - 1);
When a dedicated wq is enabled as mdev, we must disable the wq on the
device in order to program the pasid to the wq. Introduce a wq state
IDXD_WQ_LOCKED that is software state only in order to prevent the user
from modifying the configuration while mdev wq is in this state. While
in this state, the wq is not in DISABLED state and will prevent any
modifications to the configuration. It is also not in the ENABLED state
and therefore prevents any actions allowed in the ENABLED state.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/idxd.h | 1 +
drivers/dma/idxd/mdev.c | 4 +++-
drivers/dma/idxd/sysfs.c | 2 ++
3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 1f03019bb45d..cc0665335aee 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -60,6 +60,7 @@ struct idxd_group {
enum idxd_wq_state {
IDXD_WQ_DISABLED = 0,
IDXD_WQ_ENABLED,
+ IDXD_WQ_LOCKED,
};
enum idxd_wq_flag {
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 744adfdc06cd..e3c32f9566b5 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -69,8 +69,10 @@ static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
vidxd_mmio_init(vidxd);
- if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+ if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED) {
idxd_wq_disable(wq, NULL);
+ wq->state = IDXD_WQ_LOCKED;
+ }
}
static void __idxd_vdcm_release(struct vdcm_idxd *vidxd)
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index d3b0a95b0d1d..6344cc719897 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -822,6 +822,8 @@ static ssize_t wq_state_show(struct device *dev,
return sprintf(buf, "disabled\n");
case IDXD_WQ_ENABLED:
return sprintf(buf, "enabled\n");
+ case IDXD_WQ_LOCKED:
+ return sprintf(buf, "locked\n");
}
return sprintf(buf, "unknown\n");
From: Megha Dey <[email protected]>
Add support for the creation of a new DEV_MSI irq domain. It creates a
new irq chip associated with the DEV_MSI domain and adds the necessary
domain operations to it.
Add a new config option DEV_MSI which must be enabled by any
driver that wants to support device-specific message-signaled-interrupts
outside of PCI-MSI(-X).
Lastly, add device specific mask/unmask callbacks in addition to a write
function to the platform_msi_ops.
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
arch/x86/include/asm/hw_irq.h | 5 ++
drivers/base/Kconfig | 7 +++
drivers/base/Makefile | 1
drivers/base/dev-msi.c | 95 +++++++++++++++++++++++++++++++++++++++++
drivers/base/platform-msi.c | 45 +++++++++++++------
drivers/base/platform-msi.h | 23 ++++++++++
include/linux/msi.h | 8 +++
7 files changed, 168 insertions(+), 16 deletions(-)
create mode 100644 drivers/base/dev-msi.c
create mode 100644 drivers/base/platform-msi.h
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 74c12437401e..8ecd7570589d 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -61,6 +61,11 @@ struct irq_alloc_info {
irq_hw_number_t msi_hwirq;
};
#endif
+#ifdef CONFIG_DEV_MSI
+ struct {
+ irq_hw_number_t hwirq;
+ };
+#endif
#ifdef CONFIG_X86_IO_APIC
struct {
int ioapic_id;
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 8d7001712062..f00901bac056 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -210,4 +210,11 @@ config GENERIC_ARCH_TOPOLOGY
appropriate scaling, sysfs interface for reading capacity values at
runtime.
+config DEV_MSI
+ bool "Device Specific Interrupt Messages"
+ select IRQ_DOMAIN_HIERARCHY
+ select GENERIC_MSI_IRQ_DOMAIN
+ help
+ Allow device drivers to generate device-specific interrupt messages
+ for devices independent of PCI MSI/-X.
endmenu
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 157452080f3d..ca1e4d39164e 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_REGMAP) += regmap/
obj-$(CONFIG_SOC_BUS) += soc.o
obj-$(CONFIG_PINCTRL) += pinctrl.o
obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
+obj-$(CONFIG_DEV_MSI) += dev-msi.o
obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
diff --git a/drivers/base/dev-msi.c b/drivers/base/dev-msi.c
new file mode 100644
index 000000000000..240ccc353933
--- /dev/null
+++ b/drivers/base/dev-msi.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2020 Intel Corporation.
+ *
+ * Author: Megha Dey <[email protected]>
+ */
+
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>
+#include "platform-msi.h"
+
+struct irq_domain *dev_msi_default_domain;
+
+static irq_hw_number_t dev_msi_get_hwirq(struct msi_domain_info *info,
+ msi_alloc_info_t *arg)
+{
+ return arg->hwirq;
+}
+
+static irq_hw_number_t dev_msi_calc_hwirq(struct msi_desc *desc)
+{
+ u32 devid;
+
+ devid = desc->platform.msi_priv_data->devid;
+
+ return (devid << (32 - DEV_ID_SHIFT)) | desc->platform.msi_index;
+}
+
+static void dev_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
+{
+ arg->hwirq = dev_msi_calc_hwirq(desc);
+}
+
+static int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
+ int nvec, msi_alloc_info_t *arg)
+{
+ memset(arg, 0, sizeof(*arg));
+
+ return 0;
+}
+
+static struct msi_domain_ops dev_msi_domain_ops = {
+ .get_hwirq = dev_msi_get_hwirq,
+ .set_desc = dev_msi_set_desc,
+ .msi_prepare = dev_msi_prepare,
+};
+
+static struct irq_chip dev_msi_controller = {
+ .name = "DEV-MSI",
+ .irq_unmask = platform_msi_unmask_irq,
+ .irq_mask = platform_msi_mask_irq,
+ .irq_write_msi_msg = platform_msi_write_msg,
+ .irq_ack = irq_chip_ack_parent,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .flags = IRQCHIP_SKIP_SET_WAKE,
+};
+
+static struct msi_domain_info dev_msi_domain_info = {
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
+ .ops = &dev_msi_domain_ops,
+ .chip = &dev_msi_controller,
+ .handler = handle_edge_irq,
+ .handler_name = "edge",
+};
+
+static int __init create_dev_msi_domain(void)
+{
+ struct irq_domain *parent = NULL;
+ struct fwnode_handle *fn;
+
+ /*
+ * Modern code should never have to use irq_get_default_host. But since
+ * dev-msi is invisible to DT/ACPI, this is an exception case.
+ */
+ parent = irq_get_default_host();
+ if (!parent)
+ return -ENXIO;
+
+ fn = irq_domain_alloc_named_fwnode("DEV_MSI");
+ if (!fn)
+ return -ENXIO;
+
+ dev_msi_default_domain = msi_create_irq_domain(fn, &dev_msi_domain_info, parent);
+ if (!dev_msi_default_domain) {
+ pr_warn("failed to initialize irqdomain for DEV-MSI.\n");
+ return -ENXIO;
+ }
+
+ irq_domain_update_bus_token(dev_msi_default_domain, DOMAIN_BUS_PLATFORM_MSI);
+ irq_domain_free_fwnode(fn);
+
+ return 0;
+}
+device_initcall(create_dev_msi_domain);
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 9d94cd699468..5e1f210d65ee 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -12,21 +12,7 @@
#include <linux/irqdomain.h>
#include <linux/msi.h>
#include <linux/slab.h>
-
-#define DEV_ID_SHIFT 21
-#define MAX_DEV_MSIS (1 << (32 - DEV_ID_SHIFT))
-
-/*
- * Internal data structure containing a (made up, but unique) devid
- * and the platform-msi ops
- */
-struct platform_msi_priv_data {
- struct device *dev;
- void *host_data;
- msi_alloc_info_t arg;
- const struct platform_msi_ops *ops;
- int devid;
-};
+#include "platform-msi.h"
/* The devid allocator */
static DEFINE_IDA(platform_msi_devid_ida);
@@ -76,7 +62,7 @@ static void platform_msi_update_dom_ops(struct msi_domain_info *info)
ops->set_desc = platform_msi_set_desc;
}
-static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
+void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
{
struct msi_desc *desc = irq_data_get_msi_desc(data);
struct platform_msi_priv_data *priv_data;
@@ -86,6 +72,33 @@ static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
priv_data->ops->write_msg(desc, msg);
}
+static void __platform_msi_desc_mask_unmask_irq(struct msi_desc *desc, u32 mask)
+{
+ const struct platform_msi_ops *ops;
+
+ ops = desc->platform.msi_priv_data->ops;
+ if (!ops)
+ return;
+
+ if (mask) {
+ if (ops->irq_mask)
+ ops->irq_mask(desc);
+ } else {
+ if (ops->irq_unmask)
+ ops->irq_unmask(desc);
+ }
+}
+
+void platform_msi_mask_irq(struct irq_data *data)
+{
+ __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 1);
+}
+
+void platform_msi_unmask_irq(struct irq_data *data)
+{
+ __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 0);
+}
+
static void platform_msi_update_chip_ops(struct msi_domain_info *info)
{
struct irq_chip *chip = info->chip;
diff --git a/drivers/base/platform-msi.h b/drivers/base/platform-msi.h
new file mode 100644
index 000000000000..1de8c2874218
--- /dev/null
+++ b/drivers/base/platform-msi.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2020 Intel Corporation.
+ *
+ * Author: Megha Dey <[email protected]>
+ */
+
+#include <linux/msi.h>
+
+#define DEV_ID_SHIFT 21
+#define MAX_DEV_MSIS (1 << (32 - DEV_ID_SHIFT))
+
+/*
+ * Data structure containing a (made up, but unique) devid
+ * and the platform-msi ops.
+ */
+struct platform_msi_priv_data {
+ struct device *dev;
+ void *host_data;
+ msi_alloc_info_t arg;
+ const struct platform_msi_ops *ops;
+ int devid;
+};
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 7f6a8eb51aca..1da97f905720 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -323,9 +323,13 @@ enum {
/*
* platform_msi_ops - Callbacks for platform MSI ops
+ * @irq_mask: mask an interrupt source
+ * @irq_unmask: unmask an interrupt source
* @write_msg: write message content
*/
struct platform_msi_ops {
+ unsigned int (*irq_mask)(struct msi_desc *desc);
+ unsigned int (*irq_unmask)(struct msi_desc *desc);
irq_write_msi_msg_t write_msg;
};
@@ -370,6 +374,10 @@ int platform_msi_domain_alloc(struct irq_domain *domain, unsigned int virq,
void platform_msi_domain_free(struct irq_domain *domain, unsigned int virq,
unsigned int nvec);
void *platform_msi_get_host_data(struct irq_domain *domain);
+
+void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg);
+void platform_msi_unmask_irq(struct irq_data *data);
+void platform_msi_mask_irq(struct irq_data *data);
#endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
From: Megha Dey <[email protected]>
The dev-msi interrupts are to be allocated/freed only for custom devices,
not standard PCI-MSIX devices.
These interrupts are device-defined and they are distinct from the already
existing msi interrupts:
pci-msi: Standard PCI MSI/MSI-X setup format
platform-msi: Platform custom, but device-driver opaque MSI setup/control
arch-msi: fallback for devices not assigned to the generic PCI domain
dev-msi: device defined IRQ domain for ancillary devices. For e.g. DSA
portal devices use device specific IMS(Interrupt message store) interrupts.
dev-msi interrupts are represented by their own device-type. That means
dev->msi_list is never contended for different interrupt types. It
will either be all PCI-MSI or all device-defined.
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/base/dev-msi.c | 23 +++++++++++++++++++++++
include/linux/msi.h | 4 ++++
2 files changed, 27 insertions(+)
diff --git a/drivers/base/dev-msi.c b/drivers/base/dev-msi.c
index 43d6ed3ba10f..4cc75bfd62da 100644
--- a/drivers/base/dev-msi.c
+++ b/drivers/base/dev-msi.c
@@ -145,3 +145,26 @@ struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
return domain;
}
#endif
+
+int dev_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
+ const struct platform_msi_ops *platform_ops)
+{
+ if (!dev->msi_domain) {
+ dev->msi_domain = dev_msi_default_domain;
+ } else if (dev->msi_domain != dev_msi_default_domain) {
+ dev_WARN_ONCE(dev, 1, "already registered to another irq domain?\n");
+ return -ENXIO;
+ }
+
+ return platform_msi_domain_alloc_irqs(dev, nvec, platform_ops);
+}
+EXPORT_SYMBOL_GPL(dev_msi_domain_alloc_irqs);
+
+void dev_msi_domain_free_irqs(struct device *dev)
+{
+ if (dev->msi_domain != dev_msi_default_domain)
+ dev_WARN_ONCE(dev, 1, "registered to incorrect irq domain?\n");
+
+ platform_msi_domain_free_irqs(dev);
+}
+EXPORT_SYMBOL_GPL(dev_msi_domain_free_irqs);
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 7098ba566bcd..9dde8a43a0f7 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -381,6 +381,10 @@ void platform_msi_mask_irq(struct irq_data *data);
int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
int nvec, msi_alloc_info_t *arg);
+
+int dev_msi_domain_alloc_irqs(struct device *dev, unsigned int nvec,
+ const struct platform_msi_ops *platform_ops);
+void dev_msi_domain_free_irqs(struct device *dev);
#endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
When a device error occurs, the mediated device need to be notified in
order to notify the guest of device error. Add support to notify the
specific mdev when an error is wq specific and broadcast errors to all mdev
when it's a generic device error.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/idxd.h | 12 ++++++++++++
drivers/dma/idxd/irq.c | 4 ++++
drivers/dma/idxd/vdev.c | 34 ++++++++++++++++++++++++++++++++++
drivers/dma/idxd/vdev.h | 1 +
4 files changed, 51 insertions(+)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index cc0665335aee..4a3947b84f20 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -377,4 +377,16 @@ void idxd_wq_del_cdev(struct idxd_wq *wq);
int idxd_mdev_host_init(struct idxd_device *idxd);
void idxd_mdev_host_release(struct idxd_device *idxd);
+#ifdef CONFIG_INTEL_IDXD_MDEV
+void idxd_vidxd_send_errors(struct idxd_device *idxd);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
+#else
+static inline void idxd_vidxd_send_errors(struct idxd_device *idxd)
+{
+}
+static inline void idxd_wq_vidxd_send_errors(struct idxd_wq *wq)
+{
+}
+#endif /* CONFIG_INTEL_IDXD_MDEV */
+
#endif
diff --git a/drivers/dma/idxd/irq.c b/drivers/dma/idxd/irq.c
index c818acb34a14..32e59ba85cd0 100644
--- a/drivers/dma/idxd/irq.c
+++ b/drivers/dma/idxd/irq.c
@@ -148,6 +148,8 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
if (wq->type == IDXD_WQT_USER)
wake_up_interruptible(&wq->idxd_cdev.err_queue);
+ else if (wq->type == IDXD_WQT_MDEV)
+ idxd_wq_vidxd_send_errors(wq);
} else {
int i;
@@ -156,6 +158,8 @@ irqreturn_t idxd_misc_thread(int vec, void *data)
if (wq->type == IDXD_WQT_USER)
wake_up_interruptible(&wq->idxd_cdev.err_queue);
+ else if (wq->type == IDXD_WQT_MDEV)
+ idxd_wq_vidxd_send_errors(wq);
}
}
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index 66e59cb02635..d87c5355e0cb 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -926,3 +926,37 @@ void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
break;
}
}
+
+static void vidxd_send_errors(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u8 *bar0 = vidxd->bar0;
+ union sw_err_reg *swerr = (union sw_err_reg *)(bar0 + IDXD_SWERR_OFFSET);
+ union genctrl_reg *genctrl = (union genctrl_reg *)(bar0 + IDXD_GENCTRL_OFFSET);
+ int i;
+ unsigned long flags;
+
+ if (swerr->valid) {
+ if (!swerr->overflow)
+ swerr->overflow = 1;
+ return;
+ }
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < 4; i++) {
+ swerr->bits[i] = idxd->sw_err.bits[i];
+ swerr++;
+ }
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+ if (genctrl->softerr_int_en)
+ vidxd_send_interrupt(vidxd, 0);
+}
+
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq)
+{
+ struct vdcm_idxd *vidxd;
+
+ list_for_each_entry(vidxd, &wq->vdcm_list, list)
+ vidxd_send_errors(vidxd);
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
index 2dc8d22d3ea7..8ba7cbdb7e8b 100644
--- a/drivers/dma/idxd/vdev.h
+++ b/drivers/dma/idxd/vdev.h
@@ -23,5 +23,6 @@ int vidxd_disable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx);
int vidxd_enable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx);
int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
+void idxd_wq_vidxd_send_errors(struct idxd_wq *wq);
#endif
Add the support code for "1dwq" mdev type. This mdev type follows the
standard VFIO mdev flow. The "1dwq" type will export a single dedicated wq
to the mdev. The dwq will have read-only configuration that is configured
by the host. The mdev type does not support PASID and SVA and will match
the stage 1 driver in functional support. For backward compatibility, the
mdev will maintain the DSA spec definition of this mdev type once the
commit goes upstream.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/mdev.c | 142 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 133 insertions(+), 9 deletions(-)
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 01207ca42a79..744adfdc06cd 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -113,21 +113,58 @@ static void idxd_vdcm_release_work(struct work_struct *work)
__idxd_vdcm_release(vidxd);
}
+static struct idxd_wq *find_any_dwq(struct idxd_device *idxd)
+{
+ int i;
+ struct idxd_wq *wq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < idxd->max_wqs; i++) {
+ wq = &idxd->wqs[i];
+
+ if (wq->state != IDXD_WQ_ENABLED)
+ continue;
+
+ if (!wq_dedicated(wq))
+ continue;
+
+ if (idxd_wq_refcount(wq) != 0)
+ continue;
+
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ mutex_lock(&wq->wq_lock);
+ if (idxd_wq_refcount(wq)) {
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ continue;
+ }
+
+ idxd_wq_get(wq);
+ mutex_unlock(&wq->wq_lock);
+ return wq;
+ }
+
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ return NULL;
+}
+
static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
struct vdcm_idxd_type *type)
{
struct vdcm_idxd *vidxd;
struct idxd_wq *wq = NULL;
- int i;
-
- /* PLACEHOLDER, wq matching comes later */
+ int i, rc;
+ if (type->type == IDXD_MDEV_TYPE_1_DWQ)
+ wq = find_any_dwq(idxd);
if (!wq)
return ERR_PTR(-ENODEV);
vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
- if (!vidxd)
- return ERR_PTR(-ENOMEM);
+ if (!vidxd) {
+ rc = -ENOMEM;
+ goto err;
+ }
mutex_init(&vidxd->dev_lock);
vidxd->idxd = idxd;
@@ -142,14 +179,23 @@ static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev
INIT_WORK(&vidxd->vdev.release_work, idxd_vdcm_release_work);
idxd_vdcm_init(vidxd);
- mutex_lock(&wq->wq_lock);
- idxd_wq_get(wq);
- mutex_unlock(&wq->wq_lock);
return vidxd;
+
+ err:
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_put(wq);
+ mutex_unlock(&wq->wq_lock);
+ return ERR_PTR(rc);
}
-static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES] = {
+ {
+ .name = "1dwq",
+ .description = "IDXD MDEV with 1 dedicated workqueue",
+ .type = IDXD_MDEV_TYPE_1_DWQ,
+ },
+};
static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
const char *name)
@@ -932,7 +978,85 @@ static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
return rc;
}
+static ssize_t name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+ struct vdcm_idxd_type *type;
+
+ type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+
+ if (type)
+ return sprintf(buf, "%s\n", type->description);
+
+ return -EINVAL;
+}
+static MDEV_TYPE_ATTR_RO(name);
+
+static int find_available_mdev_instances(struct idxd_device *idxd, struct vdcm_idxd_type *type)
+{
+ int count = 0, i;
+ unsigned long flags;
+
+ if (type->type != IDXD_MDEV_TYPE_1_DWQ)
+ return 0;
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq;
+
+ wq = &idxd->wqs[i];
+ if (!is_idxd_wq_mdev(wq) || !wq_dedicated(wq) || idxd_wq_refcount(wq))
+ continue;
+
+ count++;
+ }
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+ return count;
+}
+
+static ssize_t available_instances_show(struct kobject *kobj,
+ struct device *dev, char *buf)
+{
+ int count;
+ struct idxd_device *idxd = dev_get_drvdata(dev);
+ struct vdcm_idxd_type *type;
+
+ type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+ if (!type)
+ return -EINVAL;
+
+ count = find_available_mdev_instances(idxd, type);
+
+ return sprintf(buf, "%d\n", count);
+}
+static MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+ char *buf)
+{
+ return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+static MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *idxd_mdev_types_attrs[] = {
+ &mdev_type_attr_name.attr,
+ &mdev_type_attr_device_api.attr,
+ &mdev_type_attr_available_instances.attr,
+ NULL,
+};
+
+static struct attribute_group idxd_mdev_type_group0 = {
+ .name = "1dwq",
+ .attrs = idxd_mdev_types_attrs,
+};
+
+static struct attribute_group *idxd_mdev_type_groups[] = {
+ &idxd_mdev_type_group0,
+ NULL,
+};
+
static const struct mdev_parent_ops idxd_vdcm_ops = {
+ .supported_type_groups = idxd_mdev_type_groups,
.create = idxd_vdcm_create,
.remove = idxd_vdcm_remove,
.open = idxd_vdcm_open,
From: Jing Lin <[email protected]>
Add the sysfs attribute bits in ABI/stable for mediated device and guest
support.
Signed-off-by: Jing Lin <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
Documentation/ABI/stable/sysfs-driver-dma-idxd | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/Documentation/ABI/stable/sysfs-driver-dma-idxd b/Documentation/ABI/stable/sysfs-driver-dma-idxd
index a336d2b2009a..edbcf8a8796c 100644
--- a/Documentation/ABI/stable/sysfs-driver-dma-idxd
+++ b/Documentation/ABI/stable/sysfs-driver-dma-idxd
@@ -1,6 +1,6 @@
What: /sys/bus/dsa/devices/dsa<m>/version
-Date: Apr 15, 2020
-KernelVersion: 5.8.0
+Date: Jul 5, 2020
+KernelVersion: 5.9.0
Contact: [email protected]
Description: The hardware version number.
@@ -84,6 +84,12 @@ Contact: [email protected]
Description: To indicate if PASID (process address space identifier) is
enabled or not for this device.
+What: /sys/bus/dsa/devices/dsa<m>/ims_size
+Date: Jul 5, 2020
+KernelVersion: 5.9.0
+Contact: [email protected]
+Description: Number of entries in the interrupt message storage table.
+
What: /sys/bus/dsa/devices/dsa<m>/state
Date: Oct 25, 2019
KernelVersion: 5.6.0
@@ -147,8 +153,9 @@ Date: Oct 25, 2019
KernelVersion: 5.6.0
Contact: [email protected]
Description: The type of this work queue, it can be "kernel" type for work
- queue usages in the kernel space or "user" type for work queue
- usages by applications in user space.
+ queue usages in the kernel space, "user" type for work queue
+ usages by applications in user space, or "mdev" type for
+ VFIO mediated devices.
What: /sys/bus/dsa/devices/wq<m>.<n>/cdev_minor
Date: Oct 25, 2019
Add emulation routines for PCI config read/write, MMIO read/write, and
interrupt handling routine for the emulated device. The rw routines are
called when PCI config read/writes or BAR0 mmio read/writes and being
issued by the guest kernel through KVM/qemu.
Because we are supporting read-only configuration, most of the MMIO
emulations are simple memory copy except for cases such as handling device
commands and interrupts.
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
---
drivers/dma/idxd/registers.h | 4
drivers/dma/idxd/vdev.c | 428 +++++++++++++++++++++++++++++++++++++++++-
drivers/dma/idxd/vdev.h | 8 +
include/uapi/linux/idxd.h | 2
4 files changed, 434 insertions(+), 8 deletions(-)
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index ace7248ee195..f8e4dd10a738 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -268,6 +268,10 @@ union msix_perm {
u32 bits;
} __packed;
+#define IDXD_MSIX_PERM_MASK 0xfffff00c
+#define IDXD_MSIX_PERM_IGNORE 0x3
+#define MSIX_ENTRY_MASK_INT 0x1
+
union group_flags {
struct {
u32 tc_a:3;
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index af421852cc51..b4eace02199e 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -25,8 +25,23 @@
int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
{
- /* PLACE HOLDER */
- return 0;
+ int rc = -1;
+ struct device *dev = &vidxd->idxd->pdev->dev;
+
+ dev_dbg(dev, "%s interrput %d\n", __func__, msix_idx);
+
+ if (!vidxd->vdev.msix_trigger[msix_idx]) {
+ dev_warn(dev, "%s: intr evtfd not found %d\n", __func__, msix_idx);
+ return -EINVAL;
+ }
+
+ rc = eventfd_signal(vidxd->vdev.msix_trigger[msix_idx], 1);
+ if (rc != 1)
+ dev_err(dev, "eventfd signal failed (%d)\n", rc);
+ else
+ dev_dbg(dev, "vidxd interrupt triggered wq(%d) %d\n", vidxd->wq->id, msix_idx);
+
+ return rc;
}
int vidxd_disable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx)
@@ -41,31 +56,423 @@ int vidxd_enable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx)
return 0;
}
-int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+static void vidxd_report_error(struct vdcm_idxd *vidxd, unsigned int error)
{
- /* PLACEHOLDER */
- return 0;
+ u8 *bar0 = vidxd->bar0;
+ union sw_err_reg *swerr = (union sw_err_reg *)(bar0 + IDXD_SWERR_OFFSET);
+ union genctrl_reg *genctrl;
+ bool send = false;
+
+ if (!swerr->valid) {
+ memset(swerr, 0, sizeof(*swerr));
+ swerr->valid = 1;
+ swerr->error = error;
+ send = true;
+ } else if (swerr->valid && !swerr->overflow) {
+ swerr->overflow = 1;
+ }
+
+ genctrl = (union genctrl_reg *)(bar0 + IDXD_GENCTRL_OFFSET);
+ if (send && genctrl->softerr_int_en) {
+ u32 *intcause = (u32 *)(bar0 + IDXD_INTCAUSE_OFFSET);
+
+ *intcause |= IDXD_INTC_ERR;
+ vidxd_send_interrupt(vidxd, 0);
+ }
}
int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
{
- /* PLACEHOLDER */
+ u32 offset = pos & (vidxd->bar_size[0] - 1);
+ u8 *bar0 = vidxd->bar0;
+ struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+ dev_dbg(dev, "vidxd mmio W %d %x %x: %llx\n", vidxd->wq->id, size,
+ offset, get_reg_val(buf, size));
+
+ if (((size & (size - 1)) != 0) || (offset & (size - 1)) != 0)
+ return -EINVAL;
+
+ /* If we don't limit this, we potentially can write out of bound */
+ if (size > 4)
+ return -EINVAL;
+
+ switch (offset) {
+ case IDXD_GENCFG_OFFSET ... IDXD_GENCFG_OFFSET + 3:
+ /* Write only when device is disabled. */
+ if (vidxd_state(vidxd) == IDXD_DEVICE_STATE_DISABLED)
+ memcpy(bar0 + offset, buf, size);
+ break;
+
+ case IDXD_GENCTRL_OFFSET:
+ memcpy(bar0 + offset, buf, size);
+ break;
+
+ case IDXD_INTCAUSE_OFFSET:
+ bar0[offset] &= ~(get_reg_val(buf, 1) & 0x1f);
+ break;
+
+ case IDXD_CMD_OFFSET: {
+ u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+ u32 val = get_reg_val(buf, size);
+
+ if (size != 4)
+ return -EINVAL;
+
+ /* Check and set command in progress */
+ if (test_and_set_bit(31, (unsigned long *)cmdsts) == 0)
+ vidxd_do_command(vidxd, val);
+ else
+ vidxd_report_error(vidxd, DSA_ERR_CMD_REG);
+ break;
+ }
+
+ case IDXD_SWERR_OFFSET:
+ /* W1C */
+ bar0[offset] &= ~(get_reg_val(buf, 1) & 3);
+ break;
+
+ case VIDXD_WQCFG_OFFSET ... VIDXD_WQCFG_OFFSET + VIDXD_WQ_CTRL_SZ - 1:
+ case VIDXD_GRPCFG_OFFSET ... VIDXD_GRPCFG_OFFSET + VIDXD_GRP_CTRL_SZ - 1:
+ /* Nothing is written. Should be all RO */
+ break;
+
+ case VIDXD_MSIX_TABLE_OFFSET ... VIDXD_MSIX_TABLE_OFFSET + VIDXD_MSIX_TBL_SZ - 1: {
+ int index = (offset - VIDXD_MSIX_TABLE_OFFSET) / 0x10;
+ u8 *msix_entry = &bar0[VIDXD_MSIX_TABLE_OFFSET + index * 0x10];
+ u64 *pba = (u64 *)(bar0 + VIDXD_MSIX_PBA_OFFSET);
+ u8 cvec_byte;
+
+ cvec_byte = msix_entry[12];
+ memcpy(bar0 + offset, buf, size);
+ /* Handle clearing of UNMASK bit */
+ if (!(msix_entry[12] & MSIX_ENTRY_MASK_INT) && cvec_byte & MSIX_ENTRY_MASK_INT)
+ if (test_and_clear_bit(index, (unsigned long *)pba))
+ vidxd_send_interrupt(vidxd, index);
+ break;
+ }
+
+ case VIDXD_MSIX_PERM_OFFSET ... VIDXD_MSIX_PERM_OFFSET + VIDXD_MSIX_PERM_TBL_SZ - 1:
+ memcpy(bar0 + offset, buf, size);
+ break;
+ } /* offset */
+
+ return 0;
+}
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ u32 offset = pos & (vidxd->bar_size[0] - 1);
+ struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+ memcpy(buf, vidxd->bar0 + offset, size);
+
+ dev_dbg(dev, "vidxd mmio R %d %x %x: %llx\n",
+ vidxd->wq->id, size, offset, get_reg_val(buf, size));
return 0;
}
int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
{
- /* PLACEHOLDER */
+ u32 offset = pos & 0xfff;
+ struct device *dev = mdev_dev(vidxd->vdev.mdev);
+
+ memcpy(buf, &vidxd->cfg[offset], count);
+
+ dev_dbg(dev, "vidxd pci R %d %x %x: %llx\n",
+ vidxd->wq->id, count, offset, get_reg_val(buf, count));
+
+ return 0;
+}
+
+/*
+ * Much of the emulation code has been borrowed from Intel i915 cfg space
+ * emulation code.
+ * drivers/gpu/drm/i915/gvt/cfg_space.c:
+ */
+
+/*
+ * Bitmap for writable bits (RW or RW1C bits, but cannot co-exist in one
+ * byte) byte by byte in standard pci configuration space. (not the full
+ * 256 bytes.)
+ */
+static const u8 pci_cfg_space_rw_bmp[PCI_INTERRUPT_LINE + 4] = {
+ [PCI_COMMAND] = 0xff, 0x07,
+ [PCI_STATUS] = 0x00, 0xf9, /* the only one RW1C byte */
+ [PCI_CACHE_LINE_SIZE] = 0xff,
+ [PCI_BASE_ADDRESS_0 ... PCI_CARDBUS_CIS - 1] = 0xff,
+ [PCI_ROM_ADDRESS] = 0x01, 0xf8, 0xff, 0xff,
+ [PCI_INTERRUPT_LINE] = 0xff,
+};
+
+static void _pci_cfg_mem_write(struct vdcm_idxd *vidxd, unsigned int off, u8 *src,
+ unsigned int bytes)
+{
+ u8 *cfg_base = vidxd->cfg;
+ u8 mask, new, old;
+ int i = 0;
+
+ for (; i < bytes && (off + i < sizeof(pci_cfg_space_rw_bmp)); i++) {
+ mask = pci_cfg_space_rw_bmp[off + i];
+ old = cfg_base[off + i];
+ new = src[i] & mask;
+
+ /**
+ * The PCI_STATUS high byte has RW1C bits, here
+ * emulates clear by writing 1 for these bits.
+ * Writing a 0b to RW1C bits has no effect.
+ */
+ if (off + i == PCI_STATUS + 1)
+ new = (~new & old) & mask;
+
+ cfg_base[off + i] = (old & ~mask) | new;
+ }
+
+ /* For other configuration space directly copy as it is. */
+ if (i < bytes)
+ memcpy(cfg_base + off + i, src + i, bytes - i);
+}
+
+static inline void _write_pci_bar(struct vdcm_idxd *vidxd, u32 offset, u32 val, bool low)
+{
+ u32 *pval;
+
+ /* BAR offset should be 32 bits algiend */
+ offset = rounddown(offset, 4);
+ pval = (u32 *)(vidxd->cfg + offset);
+
+ if (low) {
+ /*
+ * only update bit 31 - bit 4,
+ * leave the bit 3 - bit 0 unchanged.
+ */
+ *pval = (val & GENMASK(31, 4)) | (*pval & GENMASK(3, 0));
+ } else {
+ *pval = val;
+ }
+}
+
+static int _pci_cfg_bar_write(struct vdcm_idxd *vidxd, unsigned int offset, void *p_data,
+ unsigned int bytes)
+{
+ u32 new = *(u32 *)(p_data);
+ bool lo = IS_ALIGNED(offset, 8);
+ u64 size;
+ unsigned int bar_id;
+
+ /*
+ * Power-up software can determine how much address
+ * space the device requires by writing a value of
+ * all 1's to the register and then reading the value
+ * back. The device will return 0's in all don't-care
+ * address bits.
+ */
+ if (new == 0xffffffff) {
+ switch (offset) {
+ case PCI_BASE_ADDRESS_0:
+ case PCI_BASE_ADDRESS_1:
+ case PCI_BASE_ADDRESS_2:
+ case PCI_BASE_ADDRESS_3:
+ bar_id = (offset - PCI_BASE_ADDRESS_0) / 8;
+ size = vidxd->bar_size[bar_id];
+ _write_pci_bar(vidxd, offset, size >> (lo ? 0 : 32), lo);
+ break;
+ default:
+ /* Unimplemented BARs */
+ _write_pci_bar(vidxd, offset, 0x0, false);
+ }
+ } else {
+ switch (offset) {
+ case PCI_BASE_ADDRESS_0:
+ case PCI_BASE_ADDRESS_1:
+ case PCI_BASE_ADDRESS_2:
+ case PCI_BASE_ADDRESS_3:
+ _write_pci_bar(vidxd, offset, new, lo);
+ break;
+ default:
+ break;
+ }
+ }
return 0;
}
int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size)
{
- /* PLACEHOLDER */
+ struct device *dev = &vidxd->idxd->pdev->dev;
+
+ if (size > 4)
+ return -EINVAL;
+
+ if (pos + size > VIDXD_MAX_CFG_SPACE_SZ)
+ return -EINVAL;
+
+ dev_dbg(dev, "vidxd pci W %d %x %x: %llx\n", vidxd->wq->id, size, pos,
+ get_reg_val(buf, size));
+
+ /* First check if it's PCI_COMMAND */
+ if (IS_ALIGNED(pos, 2) && pos == PCI_COMMAND) {
+ bool new_bme;
+ bool bme;
+
+ if (size > 2)
+ return -EINVAL;
+
+ new_bme = !!(get_reg_val(buf, 2) & PCI_COMMAND_MASTER);
+ bme = !!(vidxd->cfg[pos] & PCI_COMMAND_MASTER);
+ _pci_cfg_mem_write(vidxd, pos, buf, size);
+
+ /* Flag error if turning off BME while device is enabled */
+ if ((bme && !new_bme) && vidxd_state(vidxd) == IDXD_DEVICE_STATE_ENABLED)
+ vidxd_report_error(vidxd, DSA_ERR_PCI_CFG);
+ return 0;
+ }
+
+ switch (rounddown(pos, 4)) {
+ case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5:
+ if (!IS_ALIGNED(pos, 4))
+ return -EINVAL;
+ return _pci_cfg_bar_write(vidxd, pos, buf, size);
+
+ default:
+ _pci_cfg_mem_write(vidxd, pos, buf, size);
+ }
return 0;
}
+static void vidxd_mmio_init_grpcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union group_cap_reg *grp_cap = (union group_cap_reg *)(bar0 + IDXD_GRPCAP_OFFSET);
+
+ /* single group for current implementation */
+ grp_cap->token_en = 0;
+ grp_cap->token_limit = 0;
+ grp_cap->num_groups = 1;
+}
+
+static void vidxd_mmio_init_grpcfg(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ struct grpcfg *grpcfg = (struct grpcfg *)(bar0 + VIDXD_GRPCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_group *group = wq->group;
+ int i;
+
+ /*
+ * At this point, we are only exporting a single workqueue for
+ * each mdev. So we need to just fake it as first workqueue
+ * and also mark the available engines in this group.
+ */
+
+ /* Set single workqueue and the first one */
+ grpcfg->wqs[0] = 0x1;
+ grpcfg->engines = 0;
+ for (i = 0; i < group->num_engines; i++)
+ grpcfg->engines |= BIT(i);
+ grpcfg->flags.bits = group->grpcfg.flags.bits;
+}
+
+static void vidxd_mmio_init_wqcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ struct idxd_wq *wq = vidxd->wq;
+ union wq_cap_reg *wq_cap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+
+ wq_cap->occupancy_int = 0;
+ wq_cap->occupancy = 0;
+ wq_cap->priority = 0;
+ wq_cap->total_wq_size = wq->size;
+ wq_cap->num_wqs = VIDXD_MAX_WQS;
+ if (wq_dedicated(wq))
+ wq_cap->dedicated_mode = 1;
+}
+
+static void vidxd_mmio_init_wqcfg(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ struct idxd_wq *wq = vidxd->wq;
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+
+ wqcfg->wq_size = wq->size;
+ wqcfg->wq_thresh = wq->threshold;
+
+ if (wq_dedicated(wq))
+ wqcfg->mode = 1;
+
+ if (idxd->hw.gen_cap.block_on_fault &&
+ test_bit(WQ_FLAG_BLOCK_ON_FAULT, &wq->flags))
+ wqcfg->bof = 1;
+
+ wqcfg->priority = wq->priority;
+ wqcfg->max_xfer_shift = idxd->hw.gen_cap.max_xfer_shift;
+ wqcfg->max_batch_shift = idxd->hw.gen_cap.max_batch_shift;
+ /* make mode change read-only */
+ wqcfg->mode_support = 0;
+}
+
+static void vidxd_mmio_init_engcap(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union engine_cap_reg *engcap = (union engine_cap_reg *)(bar0 + IDXD_ENGCAP_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_group *group = wq->group;
+
+ engcap->num_engines = group->num_engines;
+}
+
+static void vidxd_mmio_init_gencap(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u8 *bar0 = vidxd->bar0;
+ union gen_cap_reg *gencap = (union gen_cap_reg *)(bar0 + IDXD_GENCAP_OFFSET);
+
+ gencap->bits = idxd->hw.gen_cap.bits;
+ gencap->config_en = 0;
+ gencap->max_ims_mult = 0;
+ gencap->cmd_cap = 1;
+}
+
+static void vidxd_mmio_init_cmdcap(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u8 *bar0 = vidxd->bar0;
+ u32 *cmdcap = (u32 *)(bar0 + IDXD_CMDCAP_OFFSET);
+
+ if (idxd->hw.cmd_cap)
+ *cmdcap = idxd->hw.cmd_cap;
+ else
+ *cmdcap = 0x1ffe;
+
+ *cmdcap |= BIT(IDXD_CMD_REQUEST_INT_HANDLE);
+}
+
void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+ struct idxd_device *idxd = vidxd->idxd;
+ u8 *bar0 = vidxd->bar0;
+ union offsets_reg *offsets;
+
+ /* Copy up to where table offset is */
+ memcpy_fromio(vidxd->bar0, idxd->reg_base, IDXD_TABLE_OFFSET);
+
+ vidxd_mmio_init_gencap(vidxd);
+ vidxd_mmio_init_cmdcap(vidxd);
+ vidxd_mmio_init_wqcap(vidxd);
+ vidxd_mmio_init_wqcfg(vidxd);
+ vidxd_mmio_init_grpcap(vidxd);
+ vidxd_mmio_init_grpcfg(vidxd);
+ vidxd_mmio_init_engcap(vidxd);
+
+ offsets = (union offsets_reg *)(bar0 + IDXD_TABLE_OFFSET);
+ offsets->grpcfg = VIDXD_GRPCFG_OFFSET / 0x100;
+ offsets->wqcfg = VIDXD_WQCFG_OFFSET / 0x100;
+ offsets->msix_perm = VIDXD_MSIX_PERM_OFFSET / 0x100;
+
+ memset(bar0 + VIDXD_MSIX_PERM_OFFSET, 0, VIDXD_MSIX_PERM_TBL_SZ);
+}
+
+static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
{
/* PLACEHOLDER */
}
@@ -74,3 +481,8 @@ void vidxd_reset(struct vdcm_idxd *vidxd)
{
/* PLACEHOLDER */
}
+
+void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
+{
+ /* PLACEHOLDER */
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
index 1a2fdda271e8..2dc8d22d3ea7 100644
--- a/drivers/dma/idxd/vdev.h
+++ b/drivers/dma/idxd/vdev.h
@@ -6,6 +6,13 @@
#include "mdev.h"
+static inline u8 vidxd_state(struct vdcm_idxd *vidxd)
+{
+ union gensts_reg *gensts = (union gensts_reg *)(vidxd->bar0 + IDXD_GENSTATS_OFFSET);
+
+ return gensts->state;
+}
+
int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
@@ -15,5 +22,6 @@ void vidxd_reset(struct vdcm_idxd *vidxd);
int vidxd_disable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx);
int vidxd_enable_host_ims_pasid(struct vdcm_idxd *vidxd, int ims_idx);
int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
+void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val);
#endif
diff --git a/include/uapi/linux/idxd.h b/include/uapi/linux/idxd.h
index fdcdfe414223..a0c0475a4626 100644
--- a/include/uapi/linux/idxd.h
+++ b/include/uapi/linux/idxd.h
@@ -78,6 +78,8 @@ enum dsa_completion_status {
DSA_COMP_HW_ERR1,
DSA_COMP_HW_ERR_DRB,
DSA_COMP_TRANSLATION_FAIL,
+ DSA_ERR_PCI_CFG = 0x51,
+ DSA_ERR_CMD_REG,
};
#define DSA_COMP_STATUS_MASK 0x7f
On Tue, Jul 21, 2020 at 09:02:28AM -0700, Dave Jiang wrote:
> From: Megha Dey <[email protected]>
>
> Add support for the creation of a new DEV_MSI irq domain. It creates a
> new irq chip associated with the DEV_MSI domain and adds the necessary
> domain operations to it.
>
> Add a new config option DEV_MSI which must be enabled by any
> driver that wants to support device-specific message-signaled-interrupts
> outside of PCI-MSI(-X).
>
> Lastly, add device specific mask/unmask callbacks in addition to a write
> function to the platform_msi_ops.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Megha Dey <[email protected]>
> Signed-off-by: Dave Jiang <[email protected]>
> arch/x86/include/asm/hw_irq.h | 5 ++
> drivers/base/Kconfig | 7 +++
> drivers/base/Makefile | 1
> drivers/base/dev-msi.c | 95 +++++++++++++++++++++++++++++++++++++++++
> drivers/base/platform-msi.c | 45 +++++++++++++------
> drivers/base/platform-msi.h | 23 ++++++++++
> include/linux/msi.h | 8 +++
> 7 files changed, 168 insertions(+), 16 deletions(-)
> create mode 100644 drivers/base/dev-msi.c
> create mode 100644 drivers/base/platform-msi.h
>
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index 74c12437401e..8ecd7570589d 100644
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -61,6 +61,11 @@ struct irq_alloc_info {
> irq_hw_number_t msi_hwirq;
> };
> #endif
> +#ifdef CONFIG_DEV_MSI
> + struct {
> + irq_hw_number_t hwirq;
> + };
> +#endif
Why is this in this patch? I didn't see an obvious place where it is
used?
>
> +static void __platform_msi_desc_mask_unmask_irq(struct msi_desc *desc, u32 mask)
> +{
> + const struct platform_msi_ops *ops;
> +
> + ops = desc->platform.msi_priv_data->ops;
> + if (!ops)
> + return;
> +
> + if (mask) {
> + if (ops->irq_mask)
> + ops->irq_mask(desc);
> + } else {
> + if (ops->irq_unmask)
> + ops->irq_unmask(desc);
> + }
> +}
> +
> +void platform_msi_mask_irq(struct irq_data *data)
> +{
> + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 1);
> +}
> +
> +void platform_msi_unmask_irq(struct irq_data *data)
> +{
> + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 0);
> +}
This is a bit convoluted, just call the op directly:
void platform_msi_unmask_irq(struct irq_data *data)
{
const struct platform_msi_ops *ops = desc->platform.msi_priv_data->ops;
if (ops->irq_unmask)
ops->irq_unmask(desc);
}
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 7f6a8eb51aca..1da97f905720 100644
> +++ b/include/linux/msi.h
> @@ -323,9 +323,13 @@ enum {
>
> /*
> * platform_msi_ops - Callbacks for platform MSI ops
> + * @irq_mask: mask an interrupt source
> + * @irq_unmask: unmask an interrupt source
> * @write_msg: write message content
> */
> struct platform_msi_ops {
> + unsigned int (*irq_mask)(struct msi_desc *desc);
> + unsigned int (*irq_unmask)(struct msi_desc *desc);
Why do these functions return things if the only call site throws it
away?
Jason
On Tue, Jul 21, 2020 at 09:02:35AM -0700, Dave Jiang wrote:
> From: Megha Dey <[email protected]>
>
> When DEV_MSI is enabled, the dev_msi_default_domain is updated to the
> base DEV-MSI irq domain. If interrupt remapping is enabled, we create
> a new IR-DEV-MSI irq domain and update the dev_msi_default domain to
> the same.
>
> For X86, introduce a new irq_alloc_type which will be used by the
> interrupt remapping driver.
Why? Shouldn't this by symmetrical with normal MSI? Does MSI do this?
I would have thought you'd want to switch to this remapping mode as
part of vfio or something like current cases.
> +struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
> + const char *name)
> +{
> + struct fwnode_handle *fn;
> + struct irq_domain *domain;
> +
> + fn = irq_domain_alloc_named_fwnode(name);
> + if (!fn)
> + return NULL;
> +
> + domain = msi_create_irq_domain(fn, &dev_msi_ir_domain_info, parent);
> + if (!domain) {
> + pr_warn("failed to initialize irqdomain for IR-DEV-MSI.\n");
> + return ERR_PTR(-ENXIO);
> + }
> +
> + irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
> +
> + if (!dev_msi_default_domain)
> + dev_msi_default_domain = domain;
> +
> + return domain;
> +}
What about this code creates a "remap" ? ie why is the function called
"create_remap" ?
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 1da97f905720..7098ba566bcd 100644
> +++ b/include/linux/msi.h
> @@ -378,6 +378,9 @@ void *platform_msi_get_host_data(struct irq_domain *domain);
> void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg);
> void platform_msi_unmask_irq(struct irq_data *data);
> void platform_msi_mask_irq(struct irq_data *data);
> +
> +int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> + int nvec, msi_alloc_info_t *arg);
I wonder if this should use the popular #ifdef dev_msi_prepare scheme
instead of a weak symbol?
Jason
On Tue, Jul 21, 2020 at 09:02:41AM -0700, Dave Jiang wrote:
> From: Megha Dey <[email protected]>
>
> The dev-msi interrupts are to be allocated/freed only for custom devices,
> not standard PCI-MSIX devices.
>
> These interrupts are device-defined and they are distinct from the already
> existing msi interrupts:
> pci-msi: Standard PCI MSI/MSI-X setup format
> platform-msi: Platform custom, but device-driver opaque MSI setup/control
> arch-msi: fallback for devices not assigned to the generic PCI domain
> dev-msi: device defined IRQ domain for ancillary devices. For e.g. DSA
> portal devices use device specific IMS(Interrupt message store) interrupts.
>
> dev-msi interrupts are represented by their own device-type. That means
> dev->msi_list is never contended for different interrupt types. It
> will either be all PCI-MSI or all device-defined.
Not sure I follow this, where is the enforcement that only dev-msi or
normal MSI is being used at one time on a single struct device?
Jason
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:
"RFC" to me means "I don't really think this is mergable, so I'm
throwing it out there." Which implies you know it needs more work
before others should review it as you are not comfortable with it :(
So, back-of-the-queue you go...
greg k-h
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:
> IMS (now dev-msi):
> With recommendations from Jason/Thomas/Dan on making IMS more generic:
> Pass a non-pci generic device(struct device) for IMS management instead of mdev
> Remove all references to mdev and symbol_get/put
> Remove all references to IMS in common code and replace with dev-msi
> remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers
> Create a generic dev-msi domain with and without interrupt remapping enabled.
> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
I didn't dig into the details of irq handling to really check this,
but the big picture of this is much more in line with what I would
expect for this kind of ability.
> Link to previous discussions with Jason:
> https://lore.kernel.org/lkml/[email protected]/
> The emulation part that can be moved to user space is very small due to the majority of the
> emulations being control bits and need to reside in the kernel. We can revisit the necessity of
> moving the small emulation part to userspace and required architectural changes at a later time.
The point here is that you already have a user space interface for
these queues that already has kernel support to twiddle the control
bits. Generally I'd expect extending that existing kernel code to do
the small bit more needed for mapping the queue through to PCI
emulation to be smaller than the 2kloc of new code here to put all the
emulation and support framework in the kernel, and exposes a lower
attack surface of kernel code to the guest.
> The kernel can specify the requirements for these callback functions
> (e.g., the driver is not expected to block, or not expected to take
> a lock in the callback function).
I didn't notice any of this in the patch series? What is the calling
context for the platform_msi_ops ? I think I already mentioned that
ideally we'd need blocking/sleeping. The big selling point is that IMS
allows this data to move off-chip, which means accessing it is no
longer just an atomic write to some on-chip memory.
These details should be documented in the comment on top of
platform_msi_ops
I'm actually a little confused how idxd_ims_irq_mask() manages this -
I thought IRQ masking should be synchronous, shouldn't there at least be a
flushing read to ensure that new MSI's are stopped and any in flight
are flushed to the APIC?
Jason
On 7/21/2020 9:28 AM, Greg KH wrote:
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
>> v2:
>
> "RFC" to me means "I don't really think this is mergable, so I'm
> throwing it out there." Which implies you know it needs more work
> before others should review it as you are not comfortable with it :(
>
> So, back-of-the-queue you go...
>
> greg k-h
>
Hi Greg! Yes this absolutely needs more work! I think it's in pretty good shape,
but it has reached the point where it needs the talented eyes of reviewers from
outside of Intel. I was really hoping to get feedback from folks like Jason
(Thanks Jason!!) and KVM and VFIO experts like Alex, Paolo, Eric, and Kirti.
I can understand that you are quite busy and can not necessarily provide a
detailed review at this phase. Would you prefer to be cc'd on code at this phase
in the future? Or, should we reserve putting you on the cc for times when we
know it's ready for merge?
On 7/21/2020 9:45 AM, Jason Gunthorpe wrote:
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
>> v2:
>> IMS (now dev-msi):
>> With recommendations from Jason/Thomas/Dan on making IMS more generic:
>> Pass a non-pci generic device(struct device) for IMS management instead of mdev
>> Remove all references to mdev and symbol_get/put
>> Remove all references to IMS in common code and replace with dev-msi
>> remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers
>> Create a generic dev-msi domain with and without interrupt remapping enabled.
>> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
>
> I didn't dig into the details of irq handling to really check this,
> but the big picture of this is much more in line with what I would
> expect for this kind of ability.
>
>> Link to previous discussions with Jason:
>> https://lore.kernel.org/lkml/[email protected]/
>> The emulation part that can be moved to user space is very small due to the majority of the
>> emulations being control bits and need to reside in the kernel. We can revisit the necessity of
>> moving the small emulation part to userspace and required architectural changes at a later time.
>
> The point here is that you already have a user space interface for
> these queues that already has kernel support to twiddle the control
> bits. Generally I'd expect extending that existing kernel code to do
> the small bit more needed for mapping the queue through to PCI
> emulation to be smaller than the 2kloc of new code here to put all the
> emulation and support framework in the kernel, and exposes a lower
> attack surface of kernel code to the guest.
>
>> The kernel can specify the requirements for these callback functions
>> (e.g., the driver is not expected to block, or not expected to take
>> a lock in the callback function).
>
> I didn't notice any of this in the patch series? What is the calling
> context for the platform_msi_ops ? I think I already mentioned that
> ideally we'd need blocking/sleeping. The big selling point is that IMS
> allows this data to move off-chip, which means accessing it is no
> longer just an atomic write to some on-chip memory.
>
> These details should be documented in the comment on top of
> platform_msi_ops
>
> I'm actually a little confused how idxd_ims_irq_mask() manages this -
> I thought IRQ masking should be synchronous, shouldn't there at least be a
> flushing read to ensure that new MSI's are stopped and any in flight
> are flushed to the APIC?
You are right Jason. It's missing a flushing read.
>
> Jason
>
On Tue, Jul 21, 2020 at 9:29 AM Greg KH <[email protected]> wrote:
>
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> > v2:
>
> "RFC" to me means "I don't really think this is mergable, so I'm
> throwing it out there." Which implies you know it needs more work
> before others should review it as you are not comfortable with it :(
There's full blown reviewed-by from me on the irq changes. The VFIO /
mdev changes looked ok to me, but I did not feel comfortable / did not
have time to sign-off on them. At the same time I did not see much to
be gained to keeping those internal. So "RFC" in this case is a bit
modest. It's more internal reviewer said this looks like it is going
in the right direction, but wants more community discussion on the
approach.
> So, back-of-the-queue you go...
Let's consider this not RFC in that context. The drivers/base/ pieces
have my review for you, the rest are dmaengine and vfio subsystem
concerns that could use some commentary.
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 22, 2020 12:45 AM
>
> > Link to previous discussions with Jason:
> > https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-
> [email protected]/
> > The emulation part that can be moved to user space is very small due to
> the majority of the
> > emulations being control bits and need to reside in the kernel. We can
> revisit the necessity of
> > moving the small emulation part to userspace and required architectural
> changes at a later time.
>
> The point here is that you already have a user space interface for
> these queues that already has kernel support to twiddle the control
> bits. Generally I'd expect extending that existing kernel code to do
> the small bit more needed for mapping the queue through to PCI
> emulation to be smaller than the 2kloc of new code here to put all the
> emulation and support framework in the kernel, and exposes a lower
> attack surface of kernel code to the guest.
>
We discussed in v1 about why extending user space interface is not a
strong motivation at current stage:
https://lore.kernel.org/lkml/[email protected]/
In a nutshell, applications don't require raw WQ controllability as guest
kernel drivers may expect. Extending DSA user space interface to be another
passthrough interface just for virtualization needs is less compelling than
leveraging established VFIO/mdev framework (with the major merit that
existing user space VMMs just work w/o any change as long as they already
support VFIO uAPI).
And in this version we split previous 2kloc mdev patch into three parts:
[09] mdev framework and callbacks; [10] mmio/pci_cfg emulation; and
[11] handling of control commands. Only patch[10] is purely about
emulation (~500LOC), while the other two parts are tightly coupled to
physical resource management.
In last review you said that you didn't hard nak this approach and would
like to hear opinion from virtualization guys. In this version we CCed KVM
mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
hypervisor), etc. Let's see how they feel about this approach.
Thanks
Kevin
Hi Jason,
On 7/21/2020 9:13 AM, Jason Gunthorpe wrote:
> On Tue, Jul 21, 2020 at 09:02:28AM -0700, Dave Jiang wrote:
>> From: Megha Dey <[email protected]>
>>
>> Add support for the creation of a new DEV_MSI irq domain. It creates a
>> new irq chip associated with the DEV_MSI domain and adds the necessary
>> domain operations to it.
>>
>> Add a new config option DEV_MSI which must be enabled by any
>> driver that wants to support device-specific message-signaled-interrupts
>> outside of PCI-MSI(-X).
>>
>> Lastly, add device specific mask/unmask callbacks in addition to a write
>> function to the platform_msi_ops.
>>
>> Reviewed-by: Dan Williams <[email protected]>
>> Signed-off-by: Megha Dey <[email protected]>
>> Signed-off-by: Dave Jiang <[email protected]>
>> arch/x86/include/asm/hw_irq.h | 5 ++
>> drivers/base/Kconfig | 7 +++
>> drivers/base/Makefile | 1
>> drivers/base/dev-msi.c | 95 +++++++++++++++++++++++++++++++++++++++++
>> drivers/base/platform-msi.c | 45 +++++++++++++------
>> drivers/base/platform-msi.h | 23 ++++++++++
>> include/linux/msi.h | 8 +++
>> 7 files changed, 168 insertions(+), 16 deletions(-)
>> create mode 100644 drivers/base/dev-msi.c
>> create mode 100644 drivers/base/platform-msi.h
>>
>> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
>> index 74c12437401e..8ecd7570589d 100644
>> +++ b/arch/x86/include/asm/hw_irq.h
>> @@ -61,6 +61,11 @@ struct irq_alloc_info {
>> irq_hw_number_t msi_hwirq;
>> };
>> #endif
>> +#ifdef CONFIG_DEV_MSI
>> + struct {
>> + irq_hw_number_t hwirq;
>> + };
>> +#endif
>
> Why is this in this patch? I didn't see an obvious place where it is
> used?
Since I have introduced the DEV-MSI domain and related ops, this is
required in the dev_msi_set_hwirq and dev_msi_set_desc in this patch.
>>
>> +static void __platform_msi_desc_mask_unmask_irq(struct msi_desc *desc, u32 mask)
>> +{
>> + const struct platform_msi_ops *ops;
>> +
>> + ops = desc->platform.msi_priv_data->ops;
>> + if (!ops)
>> + return;
>> +
>> + if (mask) {
>> + if (ops->irq_mask)
>> + ops->irq_mask(desc);
>> + } else {
>> + if (ops->irq_unmask)
>> + ops->irq_unmask(desc);
>> + }
>> +}
>> +
>> +void platform_msi_mask_irq(struct irq_data *data)
>> +{
>> + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 1);
>> +}
>> +
>> +void platform_msi_unmask_irq(struct irq_data *data)
>> +{
>> + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 0);
>> +}
>
> This is a bit convoluted, just call the op directly:
>
> void platform_msi_unmask_irq(struct irq_data *data)
> {
> const struct platform_msi_ops *ops = desc->platform.msi_priv_data->ops;
>
> if (ops->irq_unmask)
> ops->irq_unmask(desc);
> }
>
Sure, I will update this.
>> diff --git a/include/linux/msi.h b/include/linux/msi.h
>> index 7f6a8eb51aca..1da97f905720 100644
>> +++ b/include/linux/msi.h
>> @@ -323,9 +323,13 @@ enum {
>>
>> /*
>> * platform_msi_ops - Callbacks for platform MSI ops
>> + * @irq_mask: mask an interrupt source
>> + * @irq_unmask: unmask an interrupt source
>> * @write_msg: write message content
>> */
>> struct platform_msi_ops {
>> + unsigned int (*irq_mask)(struct msi_desc *desc);
>> + unsigned int (*irq_unmask)(struct msi_desc *desc);
>
> Why do these functions return things if the only call site throws it
> away?
Hmmm, fair enough, I will change it to void.
>
> Jason
>
Hi Dan,
On 7/21/2020 9:21 AM, Jason Gunthorpe wrote:
> On Tue, Jul 21, 2020 at 09:02:35AM -0700, Dave Jiang wrote:
>> From: Megha Dey <[email protected]>
>>
>> When DEV_MSI is enabled, the dev_msi_default_domain is updated to the
>> base DEV-MSI irq domain. If interrupt remapping is enabled, we create
>> a new IR-DEV-MSI irq domain and update the dev_msi_default domain to
>> the same.
>>
>> For X86, introduce a new irq_alloc_type which will be used by the
>> interrupt remapping driver.
>
> Why? Shouldn't this by symmetrical with normal MSI? Does MSI do this?
Since I am introducing the new dev msi domain for the case when IR_REMAP
is turned on, I have introduced the new type in this patch.
MSI/MSIX have their own irq alloc types which are also only used by the
intel remapping driver..
>
> I would have thought you'd want to switch to this remapping mode as
> part of vfio or something like current cases.
Can you let me know what current case you are referring to?
>
>> +struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
>> + const char *name)
>> +{
>> + struct fwnode_handle *fn;
>> + struct irq_domain *domain;
>> +
>> + fn = irq_domain_alloc_named_fwnode(name);
>> + if (!fn)
>> + return NULL;
>> +
>> + domain = msi_create_irq_domain(fn, &dev_msi_ir_domain_info, parent);
>> + if (!domain) {
>> + pr_warn("failed to initialize irqdomain for IR-DEV-MSI.\n");
>> + return ERR_PTR(-ENXIO);
>> + }
>> +
>> + irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
>> +
>> + if (!dev_msi_default_domain)
>> + dev_msi_default_domain = domain;
>> +
>> + return domain;
>> +}
>
> What about this code creates a "remap" ? ie why is the function called
> "create_remap" ?
Well, this function creates a new domain for the case when IR_REMAP is
enabled, hence I called it create_remap...
>
>> diff --git a/include/linux/msi.h b/include/linux/msi.h
>> index 1da97f905720..7098ba566bcd 100644
>> +++ b/include/linux/msi.h
>> @@ -378,6 +378,9 @@ void *platform_msi_get_host_data(struct irq_domain *domain);
>> void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg);
>> void platform_msi_unmask_irq(struct irq_data *data);
>> void platform_msi_mask_irq(struct irq_data *data);
>> +
>> +int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
>> + int nvec, msi_alloc_info_t *arg);
>
> I wonder if this should use the popular #ifdef dev_msi_prepare scheme
> instead of a weak symbol?
Ok, I will look into the #ifdef option.
>
> Jason
>
On 7/21/2020 9:25 AM, Jason Gunthorpe wrote:
> On Tue, Jul 21, 2020 at 09:02:41AM -0700, Dave Jiang wrote:
>> From: Megha Dey <[email protected]>
>>
>> The dev-msi interrupts are to be allocated/freed only for custom devices,
>> not standard PCI-MSIX devices.
>>
>> These interrupts are device-defined and they are distinct from the already
>> existing msi interrupts:
>> pci-msi: Standard PCI MSI/MSI-X setup format
>> platform-msi: Platform custom, but device-driver opaque MSI setup/control
>> arch-msi: fallback for devices not assigned to the generic PCI domain
>> dev-msi: device defined IRQ domain for ancillary devices. For e.g. DSA
>> portal devices use device specific IMS(Interrupt message store) interrupts.
>>
>> dev-msi interrupts are represented by their own device-type. That means
>> dev->msi_list is never contended for different interrupt types. It
>> will either be all PCI-MSI or all device-defined.
>
> Not sure I follow this, where is the enforcement that only dev-msi or
> normal MSI is being used at one time on a single struct device?
>
So, in the dev_msi_alloc_irqs, I first check if the dev_is_pci..
If it is a pci device, it is forbidden to use dev-msi and must use the
pci subsystem calls. dev-msi is to be used for all other custom devices,
mdev or otherwise.
> Jason
>
On 7/21/2020 11:00 AM, Dave Jiang wrote:
>
>
> On 7/21/2020 9:45 AM, Jason Gunthorpe wrote:
>> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
>>> v2:
>>> IMS (now dev-msi):
>>> With recommendations from Jason/Thomas/Dan on making IMS more generic:
>>> Pass a non-pci generic device(struct device) for IMS management
>>> instead of mdev
>>> Remove all references to mdev and symbol_get/put
>>> Remove all references to IMS in common code and replace with dev-msi
>>> remove dynamic allocation of platform-msi interrupts: no groups,no
>>> new msi list or list helpers
>>> Create a generic dev-msi domain with and without interrupt remapping
>>> enabled.
>>> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
>>
>> I didn't dig into the details of irq handling to really check this,
>> but the big picture of this is much more in line with what I would
>> expect for this kind of ability.
>>
>>> Link to previous discussions with Jason:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> The emulation part that can be moved to user space is very small due
>>> to the majority of the
>>> emulations being control bits and need to reside in the kernel. We
>>> can revisit the necessity of
>>> moving the small emulation part to userspace and required
>>> architectural changes at a later time.
>>
>> The point here is that you already have a user space interface for
>> these queues that already has kernel support to twiddle the control
>> bits. Generally I'd expect extending that existing kernel code to do
>> the small bit more needed for mapping the queue through to PCI
>> emulation to be smaller than the 2kloc of new code here to put all the
>> emulation and support framework in the kernel, and exposes a lower
>> attack surface of kernel code to the guest.
>>
>>> The kernel can specify the requirements for these callback functions
>>> (e.g., the driver is not expected to block, or not expected to take
>>> a lock in the callback function).
>>
>> I didn't notice any of this in the patch series? What is the calling
>> context for the platform_msi_ops ? I think I already mentioned that
>> ideally we'd need blocking/sleeping. The big selling point is that IMS
>> allows this data to move off-chip, which means accessing it is no
>> longer just an atomic write to some on-chip memory.
>>
>> These details should be documented in the comment on top of
>> platform_msi_ops
so the platform_msi_ops care called from the same context as the
existing msi_ops for instance, we are not adding anything new. I think
the above comment is a little misleading I will remove it next time around.
Also, I thought even the current write to on-chip memory is not atomic..
could you let me know which piece of code you are referring to?
Since the driver gets to write to the off chip memory, shouldn't it be
the drivers responsibility to call it from a sleeping/blocking context?
>>
>> I'm actually a little confused how idxd_ims_irq_mask() manages this -
>> I thought IRQ masking should be synchronous, shouldn't there at least
>> be a
>> flushing read to ensure that new MSI's are stopped and any in flight
>> are flushed to the APIC?
>
> You are right Jason. It's missing a flushing read.
>
>>
>> Jason
>>
> .
On Wed, Jul 22, 2020 at 10:05:52AM -0700, Dey, Megha wrote:
>
>
> On 7/21/2020 9:25 AM, Jason Gunthorpe wrote:
> > On Tue, Jul 21, 2020 at 09:02:41AM -0700, Dave Jiang wrote:
> > > From: Megha Dey <[email protected]>
> > >
> > > The dev-msi interrupts are to be allocated/freed only for custom devices,
> > > not standard PCI-MSIX devices.
> > >
> > > These interrupts are device-defined and they are distinct from the already
> > > existing msi interrupts:
> > > pci-msi: Standard PCI MSI/MSI-X setup format
> > > platform-msi: Platform custom, but device-driver opaque MSI setup/control
> > > arch-msi: fallback for devices not assigned to the generic PCI domain
> > > dev-msi: device defined IRQ domain for ancillary devices. For e.g. DSA
> > > portal devices use device specific IMS(Interrupt message store) interrupts.
> > >
> > > dev-msi interrupts are represented by their own device-type. That means
> > > dev->msi_list is never contended for different interrupt types. It
> > > will either be all PCI-MSI or all device-defined.
> >
> > Not sure I follow this, where is the enforcement that only dev-msi or
> > normal MSI is being used at one time on a single struct device?
> >
>
> So, in the dev_msi_alloc_irqs, I first check if the dev_is_pci..
> If it is a pci device, it is forbidden to use dev-msi and must use the pci
> subsystem calls. dev-msi is to be used for all other custom devices, mdev or
> otherwise.
What prevents creating a dev-msi directly on the struct pci_device ?
Jason
On Wed, Jul 22, 2020 at 10:03:45AM -0700, Dey, Megha wrote:
> Hi Dan,
>
> On 7/21/2020 9:21 AM, Jason Gunthorpe wrote:
> > On Tue, Jul 21, 2020 at 09:02:35AM -0700, Dave Jiang wrote:
> > > From: Megha Dey <[email protected]>
> > >
> > > When DEV_MSI is enabled, the dev_msi_default_domain is updated to the
> > > base DEV-MSI irq domain. If interrupt remapping is enabled, we create
> > > a new IR-DEV-MSI irq domain and update the dev_msi_default domain to
> > > the same.
> > >
> > > For X86, introduce a new irq_alloc_type which will be used by the
> > > interrupt remapping driver.
> >
> > Why? Shouldn't this by symmetrical with normal MSI? Does MSI do this?
>
> Since I am introducing the new dev msi domain for the case when IR_REMAP is
> turned on, I have introduced the new type in this patch.
>
> MSI/MSIX have their own irq alloc types which are also only used by the
> intel remapping driver..
>
> >
> > I would have thought you'd want to switch to this remapping mode as
> > part of vfio or something like current cases.
>
> Can you let me know what current case you are referring to?
My mistake, I see Intel unconditionally globally enables IR, so this
seems consistent with Intel's MSI
> > > +struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
> > > + const char *name)
> > > +{
> > > + struct fwnode_handle *fn;
> > > + struct irq_domain *domain;
> > > +
> > > + fn = irq_domain_alloc_named_fwnode(name);
> > > + if (!fn)
> > > + return NULL;
> > > +
> > > + domain = msi_create_irq_domain(fn, &dev_msi_ir_domain_info, parent);
> > > + if (!domain) {
> > > + pr_warn("failed to initialize irqdomain for IR-DEV-MSI.\n");
> > > + return ERR_PTR(-ENXIO);
> > > + }
> > > +
> > > + irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
> > > +
> > > + if (!dev_msi_default_domain)
> > > + dev_msi_default_domain = domain;
> > > +
> > > + return domain;
> > > +}
> >
> > What about this code creates a "remap" ? ie why is the function called
> > "create_remap" ?
>
> Well, this function creates a new domain for the case when IR_REMAP is
> enabled, hence I called it create_remap...
This looks like it just creates a new domain - the thing that makes it
remapping is the caller putting it under the ir_domain - so this code
here in base shouldn't have the word 'remap' in it, this is just
creating a generic domain.
It also kinda looks like create_dev_msi_domain() can just call the
above directly instead of duplicating everything - eg why do we need
two identical dev_msi_ir_controller vs dev_msi_controller just to have
the irq_set_vcpu_affinity difference?
Jason
On Wed, Jul 22, 2020 at 10:31:28AM -0700, Dey, Megha wrote:
> > > I didn't notice any of this in the patch series? What is the calling
> > > context for the platform_msi_ops ? I think I already mentioned that
> > > ideally we'd need blocking/sleeping. The big selling point is that IMS
> > > allows this data to move off-chip, which means accessing it is no
> > > longer just an atomic write to some on-chip memory.
> > >
> > > These details should be documented in the comment on top of
> > > platform_msi_ops
>
> so the platform_msi_ops care called from the same context as the existing
> msi_ops for instance, we are not adding anything new. I think the above
> comment is a little misleading I will remove it next time around.
If it is true that all calls are under driver control then I think it
would be good to document that. I actually don't know off hand if
mask/unmask are restricted like that
As this is a op a driver has to implement vs the arch it probably need
a bit more hand holding.
> Also, I thought even the current write to on-chip memory is not
> atomic..
The writel to the MSI-X table in MMIO memory is 'atomic'
Jason
On Tue, 21 Jul 2020 17:02:28 +0100,
Dave Jiang <[email protected]> wrote:
>
> From: Megha Dey <[email protected]>
>
> Add support for the creation of a new DEV_MSI irq domain. It creates a
> new irq chip associated with the DEV_MSI domain and adds the necessary
> domain operations to it.
>
> Add a new config option DEV_MSI which must be enabled by any
> driver that wants to support device-specific message-signaled-interrupts
> outside of PCI-MSI(-X).
Which is exactly what platform-MSI already does. Why do we need
something else?
>
> Lastly, add device specific mask/unmask callbacks in addition to a write
> function to the platform_msi_ops.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Megha Dey <[email protected]>
> Signed-off-by: Dave Jiang <[email protected]>
> ---
> arch/x86/include/asm/hw_irq.h | 5 ++
> drivers/base/Kconfig | 7 +++
> drivers/base/Makefile | 1
> drivers/base/dev-msi.c | 95 +++++++++++++++++++++++++++++++++++++++++
> drivers/base/platform-msi.c | 45 +++++++++++++------
> drivers/base/platform-msi.h | 23 ++++++++++
> include/linux/msi.h | 8 +++
> 7 files changed, 168 insertions(+), 16 deletions(-)
> create mode 100644 drivers/base/dev-msi.c
> create mode 100644 drivers/base/platform-msi.h
>
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index 74c12437401e..8ecd7570589d 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -61,6 +61,11 @@ struct irq_alloc_info {
> irq_hw_number_t msi_hwirq;
> };
> #endif
> +#ifdef CONFIG_DEV_MSI
> + struct {
> + irq_hw_number_t hwirq;
> + };
> +#endif
> #ifdef CONFIG_X86_IO_APIC
> struct {
> int ioapic_id;
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 8d7001712062..f00901bac056 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -210,4 +210,11 @@ config GENERIC_ARCH_TOPOLOGY
> appropriate scaling, sysfs interface for reading capacity values at
> runtime.
>
> +config DEV_MSI
> + bool "Device Specific Interrupt Messages"
> + select IRQ_DOMAIN_HIERARCHY
> + select GENERIC_MSI_IRQ_DOMAIN
> + help
> + Allow device drivers to generate device-specific interrupt messages
> + for devices independent of PCI MSI/-X.
> endmenu
> diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> index 157452080f3d..ca1e4d39164e 100644
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -21,6 +21,7 @@ obj-$(CONFIG_REGMAP) += regmap/
> obj-$(CONFIG_SOC_BUS) += soc.o
> obj-$(CONFIG_PINCTRL) += pinctrl.o
> obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
> +obj-$(CONFIG_DEV_MSI) += dev-msi.o
> obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
> obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
>
> diff --git a/drivers/base/dev-msi.c b/drivers/base/dev-msi.c
> new file mode 100644
> index 000000000000..240ccc353933
> --- /dev/null
> +++ b/drivers/base/dev-msi.c
> @@ -0,0 +1,95 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright © 2020 Intel Corporation.
> + *
> + * Author: Megha Dey <[email protected]>
> + */
> +
> +#include <linux/irq.h>
> +#include <linux/irqdomain.h>
> +#include <linux/msi.h>
> +#include "platform-msi.h"
> +
> +struct irq_domain *dev_msi_default_domain;
> +
> +static irq_hw_number_t dev_msi_get_hwirq(struct msi_domain_info *info,
> + msi_alloc_info_t *arg)
> +{
> + return arg->hwirq;
> +}
> +
> +static irq_hw_number_t dev_msi_calc_hwirq(struct msi_desc *desc)
> +{
> + u32 devid;
> +
> + devid = desc->platform.msi_priv_data->devid;
> +
> + return (devid << (32 - DEV_ID_SHIFT)) | desc->platform.msi_index;
> +}
> +
> +static void dev_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
> +{
> + arg->hwirq = dev_msi_calc_hwirq(desc);
> +}
> +
> +static int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> + int nvec, msi_alloc_info_t *arg)
> +{
> + memset(arg, 0, sizeof(*arg));
> +
> + return 0;
> +}
> +
> +static struct msi_domain_ops dev_msi_domain_ops = {
> + .get_hwirq = dev_msi_get_hwirq,
> + .set_desc = dev_msi_set_desc,
> + .msi_prepare = dev_msi_prepare,
> +};
> +
> +static struct irq_chip dev_msi_controller = {
> + .name = "DEV-MSI",
> + .irq_unmask = platform_msi_unmask_irq,
> + .irq_mask = platform_msi_mask_irq,
This seems pretty odd, see below.
> + .irq_write_msi_msg = platform_msi_write_msg,
> + .irq_ack = irq_chip_ack_parent,
> + .irq_retrigger = irq_chip_retrigger_hierarchy,
> + .flags = IRQCHIP_SKIP_SET_WAKE,
> +};
> +
> +static struct msi_domain_info dev_msi_domain_info = {
> + .flags = MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
> + .ops = &dev_msi_domain_ops,
> + .chip = &dev_msi_controller,
> + .handler = handle_edge_irq,
> + .handler_name = "edge",
> +};
> +
> +static int __init create_dev_msi_domain(void)
> +{
> + struct irq_domain *parent = NULL;
> + struct fwnode_handle *fn;
> +
> + /*
> + * Modern code should never have to use irq_get_default_host. But since
> + * dev-msi is invisible to DT/ACPI, this is an exception case.
> + */
> + parent = irq_get_default_host();
Really? How is it going to work once you have devices sending their
MSIs to two different downstream blocks? This looks rather short-sighted.
> + if (!parent)
> + return -ENXIO;
> +
> + fn = irq_domain_alloc_named_fwnode("DEV_MSI");
> + if (!fn)
> + return -ENXIO;
> +
> + dev_msi_default_domain = msi_create_irq_domain(fn, &dev_msi_domain_info, parent);
> + if (!dev_msi_default_domain) {
> + pr_warn("failed to initialize irqdomain for DEV-MSI.\n");
> + return -ENXIO;
> + }
> +
> + irq_domain_update_bus_token(dev_msi_default_domain, DOMAIN_BUS_PLATFORM_MSI);
> + irq_domain_free_fwnode(fn);
> +
> + return 0;
> +}
> +device_initcall(create_dev_msi_domain);
> diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
> index 9d94cd699468..5e1f210d65ee 100644
> --- a/drivers/base/platform-msi.c
> +++ b/drivers/base/platform-msi.c
> @@ -12,21 +12,7 @@
> #include <linux/irqdomain.h>
> #include <linux/msi.h>
> #include <linux/slab.h>
> -
> -#define DEV_ID_SHIFT 21
> -#define MAX_DEV_MSIS (1 << (32 - DEV_ID_SHIFT))
> -
> -/*
> - * Internal data structure containing a (made up, but unique) devid
> - * and the platform-msi ops
> - */
> -struct platform_msi_priv_data {
> - struct device *dev;
> - void *host_data;
> - msi_alloc_info_t arg;
> - const struct platform_msi_ops *ops;
> - int devid;
> -};
> +#include "platform-msi.h"
>
> /* The devid allocator */
> static DEFINE_IDA(platform_msi_devid_ida);
> @@ -76,7 +62,7 @@ static void platform_msi_update_dom_ops(struct msi_domain_info *info)
> ops->set_desc = platform_msi_set_desc;
> }
>
> -static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
> +void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
It really begs the question: Why are you inventing a whole new
"DEV-MSI" when this really is platform-MSI?
> {
> struct msi_desc *desc = irq_data_get_msi_desc(data);
> struct platform_msi_priv_data *priv_data;
> @@ -86,6 +72,33 @@ static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
> priv_data->ops->write_msg(desc, msg);
> }
>
> +static void __platform_msi_desc_mask_unmask_irq(struct msi_desc *desc, u32 mask)
> +{
> + const struct platform_msi_ops *ops;
> +
> + ops = desc->platform.msi_priv_data->ops;
> + if (!ops)
> + return;
> +
> + if (mask) {
> + if (ops->irq_mask)
> + ops->irq_mask(desc);
> + } else {
> + if (ops->irq_unmask)
> + ops->irq_unmask(desc);
> + }
> +}
> +
> +void platform_msi_mask_irq(struct irq_data *data)
> +{
> + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 1);
> +}
> +
> +void platform_msi_unmask_irq(struct irq_data *data)
> +{
> + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data), 0);
> +}
> +
I don't immediately get why you have this code at the platform MSI
level. Until now, we only had the programming of the message into the
end-point, which is a device-specific action (and the whole reason why
this silly platform MSI exists).
On the other hand, masking an interrupt is an irqchip operation, and
only concerns the irqchip level. Here, you seem to be making it an
end-point operation, which doesn't really make sense to me. Or is this
device its own interrupt controller as well? That would be extremely
surprising, and I'd expect some block downstream of the device to be
able to control the masking of the interrupt.
> static void platform_msi_update_chip_ops(struct msi_domain_info *info)
> {
> struct irq_chip *chip = info->chip;
> diff --git a/drivers/base/platform-msi.h b/drivers/base/platform-msi.h
> new file mode 100644
> index 000000000000..1de8c2874218
> --- /dev/null
> +++ b/drivers/base/platform-msi.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright © 2020 Intel Corporation.
> + *
> + * Author: Megha Dey <[email protected]>
> + */
Or not. You are merely moving existing code, not authoring it. Either
keep the original copyright attribution, or drop this mention
altogether.
> +
> +#include <linux/msi.h>
> +
> +#define DEV_ID_SHIFT 21
> +#define MAX_DEV_MSIS (1 << (32 - DEV_ID_SHIFT))
> +
> +/*
> + * Data structure containing a (made up, but unique) devid
> + * and the platform-msi ops.
> + */
> +struct platform_msi_priv_data {
> + struct device *dev;
> + void *host_data;
> + msi_alloc_info_t arg;
> + const struct platform_msi_ops *ops;
> + int devid;
> +};
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 7f6a8eb51aca..1da97f905720 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -323,9 +323,13 @@ enum {
>
> /*
> * platform_msi_ops - Callbacks for platform MSI ops
> + * @irq_mask: mask an interrupt source
> + * @irq_unmask: unmask an interrupt source
> * @write_msg: write message content
> */
> struct platform_msi_ops {
> + unsigned int (*irq_mask)(struct msi_desc *desc);
> + unsigned int (*irq_unmask)(struct msi_desc *desc);
> irq_write_msi_msg_t write_msg;
> };
>
> @@ -370,6 +374,10 @@ int platform_msi_domain_alloc(struct irq_domain *domain, unsigned int virq,
> void platform_msi_domain_free(struct irq_domain *domain, unsigned int virq,
> unsigned int nvec);
> void *platform_msi_get_host_data(struct irq_domain *domain);
> +
> +void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg);
> +void platform_msi_unmask_irq(struct irq_data *data);
> +void platform_msi_mask_irq(struct irq_data *data);
> #endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
>
> #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
>
>
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Wed, Jul 22, 2020 at 07:52:33PM +0100, Marc Zyngier wrote:
> Which is exactly what platform-MSI already does. Why do we need
> something else?
It looks to me like all the code is around managing the
dev->msi_domain of the devices.
The intended use would have PCI drivers create children devices using
mdev or virtbus and those devices wouldn't have a msi_domain from the
platform. Looks like platform_msi_alloc_priv_data() fails immediately
because dev->msi_domain will be NULL for these kinds of devices.
Maybe that issue should be handled directly instead of wrappering
platform_msi_*?
For instance a trivial addition to the platform_msi API:
platform_msi_assign_domain(struct_device *newly_created_virtual_device,
struct device *physical_device);
Which could set the msi_domain of new device using the topology of
physical_device to deduce the correct domain?
Then the question is how to properly create a domain within the
hardware topology of physical_device with the correct parameters for
the platform.
Why do we need a dummy msi_domain anyhow? Can this just use
physical_device->msi_domain directly? (I'm at my limit here of how
much of this I remember, sorry)
If you solve that it should solve the remapping problem too, as the
physical_device is already assigned by the platform to a remapping irq
domain if that is what the platform wants.
>> + parent = irq_get_default_host();
> Really? How is it going to work once you have devices sending their
> MSIs to two different downstream blocks? This looks rather
> short-sighted.
.. and fix this too, the parent domain should be derived from the
topology of the physical_device which is originating the interrupt
messages.
> On the other hand, masking an interrupt is an irqchip operation, and
> only concerns the irqchip level. Here, you seem to be making it an
> end-point operation, which doesn't really make sense to me. Or is this
> device its own interrupt controller as well? That would be extremely
> surprising, and I'd expect some block downstream of the device to be
> able to control the masking of the interrupt.
These are message interrupts so they originate directly from the
device and generally travel directly to the CPU APIC. On the wire
there is no difference between a MSI, MSI-X and a device using the
dev-msi approach.
IIRC on Intel/AMD at least once a MSI is launched it is not maskable.
So the model for MSI is always "mask at source". The closest mapping
to the Linux IRQ model is to say the end device has a irqchip that
encapsulates the ability of the device to generate the MSI in the
first place.
It looks like existing platform_msi drivers deal with "masking"
implicitly by halting the device interrupt generation before releasing
the interrupt and have no way for the generic irqchip layer to mask
the interrupt.
I suppose the motivation to make it explicit is related to vfio using
the generic mask/unmask functionality?
Explicit seems better, IMHO.
Jason
Dave Jiang <[email protected]> writes:
> From: Megha Dey <[email protected]>
>
> When DEV_MSI is enabled, the dev_msi_default_domain is updated to the
> base DEV-MSI irq domain. If interrupt remapping is enabled, we create
s/we//
> a new IR-DEV-MSI irq domain and update the dev_msi_default domain to
> the same.
>
> For X86, introduce a new irq_alloc_type which will be used by the
> interrupt remapping driver.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Megha Dey <[email protected]>
> Signed-off-by: Dave Jiang <[email protected]>
> ---
> arch/x86/include/asm/hw_irq.h | 1 +
> arch/x86/kernel/apic/msi.c | 12 ++++++
> drivers/base/dev-msi.c | 66 +++++++++++++++++++++++++++++++----
> drivers/iommu/intel/irq_remapping.c | 11 +++++-
> include/linux/intel-iommu.h | 1 +
> include/linux/irqdomain.h | 11 ++++++
> include/linux/msi.h | 3 ++
Why is this mixing generic code, x86 core code and intel specific driver
code? This is new functionality so:
1) Provide the infrastructure
2) Add support to architecture specific parts
3) Enable it
> +
> +#ifdef CONFIG_DEV_MSI
> +int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> + int nvec, msi_alloc_info_t *arg)
> +{
> + memset(arg, 0, sizeof(*arg));
> +
> + arg->type = X86_IRQ_ALLOC_TYPE_DEV_MSI;
> +
> + return 0;
> +}
> +#endif
What is this? Tons of new lines for taking up more space and not a
single comment.
> -static int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> +int __weak dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> int nvec, msi_alloc_info_t *arg)
> {
> memset(arg, 0, sizeof(*arg));
Oh well. So every architecure which needs to override this and I assume
all which are eventually going to support it need to do the memset() in
their override.
memset(arg,,,);
arch_dev_msi_prepare();
> - dev_msi_default_domain = msi_create_irq_domain(fn, &dev_msi_domain_info, parent);
> + /*
> + * This initcall may come after remap code is initialized. Ensure that
> + * dev_msi_default domain is updated correctly.
What? No, this is a disgusting hack. Get your ordering straight, that's
not rocket science.
> +#ifdef CONFIG_IRQ_REMAP
IRQ_REMAP is x86 specific. Is this file x86 only or intended to be for
general use? If it's x86 only, then this should be clearly
documented. If not, then these x86'isms have no place here.
> +struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain *parent,
> + const char *name)
So we have msi_create_irq_domain() and this is about dev_msi, right? So
can you please stick with a consistent naming scheme?
> +{
> + struct fwnode_handle *fn;
> + struct irq_domain *domain;
> +
> + fn = irq_domain_alloc_named_fwnode(name);
> + if (!fn)
> + return NULL;
> +
> + domain = msi_create_irq_domain(fn, &dev_msi_ir_domain_info, parent);
> + if (!domain) {
> + pr_warn("failed to initialize irqdomain for IR-DEV-MSI.\n");
> + return ERR_PTR(-ENXIO);
> + }
> +
> + irq_domain_update_bus_token(domain, DOMAIN_BUS_PLATFORM_MSI);
> +
> + if (!dev_msi_default_domain)
> + dev_msi_default_domain = domain;
Can this be called several times? If so, then this lacks a comment. If
not, then this condition is useless.
Thanks,
tglx
On 2020-07-22 20:59, Jason Gunthorpe wrote:
> On Wed, Jul 22, 2020 at 07:52:33PM +0100, Marc Zyngier wrote:
>
>> Which is exactly what platform-MSI already does. Why do we need
>> something else?
>
> It looks to me like all the code is around managing the
> dev->msi_domain of the devices.
>
> The intended use would have PCI drivers create children devices using
> mdev or virtbus and those devices wouldn't have a msi_domain from the
> platform. Looks like platform_msi_alloc_priv_data() fails immediately
> because dev->msi_domain will be NULL for these kinds of devices.
>
> Maybe that issue should be handled directly instead of wrappering
> platform_msi_*?
>
> For instance a trivial addition to the platform_msi API:
>
> platform_msi_assign_domain(struct_device
> *newly_created_virtual_device,
> struct device *physical_device);
>
> Which could set the msi_domain of new device using the topology of
> physical_device to deduce the correct domain?
That would seem like a sensible course of action, as losing
the topology information will likely result in problems down
the line.
> Then the question is how to properly create a domain within the
> hardware topology of physical_device with the correct parameters for
> the platform.
>
> Why do we need a dummy msi_domain anyhow? Can this just use
> physical_device->msi_domain directly? (I'm at my limit here of how
> much of this I remember, sorry)
The parent device would be a PCI device, if I follow you correctly.
It would thus expect to be able to program the MSI the PCI way,
which wouldn't work. So we end-up with this custom MSI domain
that knows about *this* particular family of devices.
> If you solve that it should solve the remapping problem too, as the
> physical_device is already assigned by the platform to a remapping irq
> domain if that is what the platform wants.
>
>>> + parent = irq_get_default_host();
>> Really? How is it going to work once you have devices sending their
>> MSIs to two different downstream blocks? This looks rather
>> short-sighted.
>
> .. and fix this too, the parent domain should be derived from the
> topology of the physical_device which is originating the interrupt
> messages.
>
>> On the other hand, masking an interrupt is an irqchip operation, and
>> only concerns the irqchip level. Here, you seem to be making it an
>> end-point operation, which doesn't really make sense to me. Or is this
>> device its own interrupt controller as well? That would be extremely
>> surprising, and I'd expect some block downstream of the device to be
>> able to control the masking of the interrupt.
>
> These are message interrupts so they originate directly from the
> device and generally travel directly to the CPU APIC. On the wire
> there is no difference between a MSI, MSI-X and a device using the
> dev-msi approach.
I understand that.
> IIRC on Intel/AMD at least once a MSI is launched it is not maskable.
Really? So you can't shut a device with a screaming interrupt,
for example, should it become otherwise unresponsive?
> So the model for MSI is always "mask at source". The closest mapping
> to the Linux IRQ model is to say the end device has a irqchip that
> encapsulates the ability of the device to generate the MSI in the
> first place.
This is an x86'ism, I'm afraid. Systems I deal with can mask any
interrupt at the interrupt controller level, MSI or not.
> It looks like existing platform_msi drivers deal with "masking"
> implicitly by halting the device interrupt generation before releasing
> the interrupt and have no way for the generic irqchip layer to mask
> the interrupt.
No. As I said above, the interrupt controller is perfectly capable
of masking interrupts on its own, without touching the device.
> I suppose the motivation to make it explicit is related to vfio using
> the generic mask/unmask functionality?
>
> Explicit seems better, IMHO.
If masking at the source is the only way to shut the device up,
and assuming that the device provides the expected semantics
(a MSI raised by the device while the interrupt is masked
isn't lost and gets sent when unmasked), that's fair enough.
It's just ugly.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Thu, Jul 23, 2020 at 09:51:52AM +0100, Marc Zyngier wrote:
> > IIRC on Intel/AMD at least once a MSI is launched it is not maskable.
>
> Really? So you can't shut a device with a screaming interrupt,
> for example, should it become otherwise unresponsive?
Well, it used to be like that in the APICv1 days. I suppose modern
interrupt remapping probably changes things.
> > So the model for MSI is always "mask at source". The closest mapping
> > to the Linux IRQ model is to say the end device has a irqchip that
> > encapsulates the ability of the device to generate the MSI in the
> > first place.
>
> This is an x86'ism, I'm afraid. Systems I deal with can mask any
> interrupt at the interrupt controller level, MSI or not.
Sure. However it feels like a bad practice to leave the source
unmasked and potentially continuing to generate messages if the
intention was to disable the IRQ that was assigned to it - even if the
messages do not result in CPU interrupts they will still consume
system resources.
> > I suppose the motivation to make it explicit is related to vfio using
> > the generic mask/unmask functionality?
> >
> > Explicit seems better, IMHO.
>
> If masking at the source is the only way to shut the device up,
> and assuming that the device provides the expected semantics
> (a MSI raised by the device while the interrupt is masked
> isn't lost and gets sent when unmasked), that's fair enough.
> It's just ugly.
It makes sense that the masking should follow the same semantics for
PCI MSI masking.
Jason
On Tue, Jul 21, 2020 at 11:54:49PM +0000, Tian, Kevin wrote:
> In a nutshell, applications don't require raw WQ controllability as guest
> kernel drivers may expect. Extending DSA user space interface to be another
> passthrough interface just for virtualization needs is less compelling than
> leveraging established VFIO/mdev framework (with the major merit that
> existing user space VMMs just work w/o any change as long as they already
> support VFIO uAPI).
Sure, but the above is how the cover letter should have summarized
that discussion, not as "it is not much code difference"
> In last review you said that you didn't hard nak this approach and would
> like to hear opinion from virtualization guys. In this version we CCed KVM
> mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
> hypervisor), etc. Let's see how they feel about this approach.
Yes, the VFIO community should decide.
If we are doing emulation tasks in the kernel now, then I can think of
several nice semi-emulated mdevs to propose.
This will not be some one off, but the start of a widely copied
pattern.
Jason
Jason Gunthorpe <[email protected]> writes:
> On Thu, Jul 23, 2020 at 09:51:52AM +0100, Marc Zyngier wrote:
>> > IIRC on Intel/AMD at least once a MSI is launched it is not maskable.
>>
>> Really? So you can't shut a device with a screaming interrupt,
>> for example, should it become otherwise unresponsive?
>
> Well, it used to be like that in the APICv1 days. I suppose modern
> interrupt remapping probably changes things.
The MSI side of affairs has nothing to do with Intel and neither with
ACPIv1. It's a trainwreck on the PCI side.
MSI interrupts do not have mandatory masking. For those which do not
implement it (and that's still the case with devices designed today
especially CPU internal peripherals) there are only a few options to
shut them up:
1) Disable MSI which has the problem that the interrupt gets
redirected to legacy PCI #A-#D interrupt unless the hardware
supports to disable that redirection, which is another optional
thing and hopeless case
2) Disable it at the IRQ remapping level which fortunately allows by
design to do so.
3) Disable it at the device level which is feasible for a device
driver but impossible for the irq side
>> > So the model for MSI is always "mask at source". The closest mapping
>> > to the Linux IRQ model is to say the end device has a irqchip that
>> > encapsulates the ability of the device to generate the MSI in the
>> > first place.
>>
>> This is an x86'ism, I'm afraid. Systems I deal with can mask any
>> interrupt at the interrupt controller level, MSI or not.
Yes, it's a pain, but reality.
> Sure. However it feels like a bad practice to leave the source
> unmasked and potentially continuing to generate messages if the
> intention was to disable the IRQ that was assigned to it - even if the
> messages do not result in CPU interrupts they will still consume
> system resources.
See above. You cannot reach out to the device driver to disable the
underlying interrupt source, which is the ultimate ratio if #1 or #2 are
not working or not there. That would be squaring the circle and
violating all rules of layering and locking at once.
The bad news is that we can't change the hardware. We have to deal with
it. And yes, I told HW people publicly and in private conversations that
unmaskable interrupts are broken by definition for more than a
decade. They still get designed that way ...
>> If masking at the source is the only way to shut the device up,
>> and assuming that the device provides the expected semantics
>> (a MSI raised by the device while the interrupt is masked
>> isn't lost and gets sent when unmasked), that's fair enough.
>> It's just ugly.
>
> It makes sense that the masking should follow the same semantics for
> PCI MSI masking.
Which semantics? The horrors of MSI or the halfways reasonable MSI-X
variant?
Thanks,
tglx
Hi Marc,
> -----Original Message-----
> From: Marc Zyngier <[email protected]>
> Sent: Wednesday, July 22, 2020 11:53 AM
> To: Jiang, Dave <[email protected]>
> Cc: [email protected]; Dey, Megha <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Pan, Jacob
> jun <[email protected]>; Raj, Ashok <[email protected]>;
> [email protected]; Liu, Yi L <[email protected]>; Lu, Baolu
> <[email protected]>; Tian, Kevin <[email protected]>; Kumar, Sanjay K
> <[email protected]>; Luck, Tony <[email protected]>; Lin, Jing
> <[email protected]>; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> Hansen, Dave <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> Ortiz, Samuel <[email protected]>; Hossain, Mona
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
> irq domain
>
> On Tue, 21 Jul 2020 17:02:28 +0100,
> Dave Jiang <[email protected]> wrote:
> >
> > From: Megha Dey <[email protected]>
> >
> > Add support for the creation of a new DEV_MSI irq domain. It creates a
> > new irq chip associated with the DEV_MSI domain and adds the necessary
> > domain operations to it.
> >
> > Add a new config option DEV_MSI which must be enabled by any driver
> > that wants to support device-specific message-signaled-interrupts
> > outside of PCI-MSI(-X).
>
> Which is exactly what platform-MSI already does. Why do we need something
> else?
True, dev-msi is a mere extension of platform-msi, which apart from providing a
custom write msg also provides a custom mask/unmask to the device.
Also, we introduce a new IRQ domain to be associated with these classes of devices.
There is nothing more to dev-msi than this currently.
>
> >
> > Lastly, add device specific mask/unmask callbacks in addition to a
> > write function to the platform_msi_ops.
> >
> > Reviewed-by: Dan Williams <[email protected]>
> > Signed-off-by: Megha Dey <[email protected]>
> > Signed-off-by: Dave Jiang <[email protected]>
> > ---
> > arch/x86/include/asm/hw_irq.h | 5 ++
> > drivers/base/Kconfig | 7 +++
> > drivers/base/Makefile | 1
> > drivers/base/dev-msi.c | 95
> +++++++++++++++++++++++++++++++++++++++++
> > drivers/base/platform-msi.c | 45 +++++++++++++------
> > drivers/base/platform-msi.h | 23 ++++++++++
> > include/linux/msi.h | 8 +++
> > 7 files changed, 168 insertions(+), 16 deletions(-) create mode
> > 100644 drivers/base/dev-msi.c create mode 100644
> > drivers/base/platform-msi.h
> >
> > diff --git a/arch/x86/include/asm/hw_irq.h
> > b/arch/x86/include/asm/hw_irq.h index 74c12437401e..8ecd7570589d
> > 100644
> > --- a/arch/x86/include/asm/hw_irq.h
> > +++ b/arch/x86/include/asm/hw_irq.h
> > @@ -61,6 +61,11 @@ struct irq_alloc_info {
> > irq_hw_number_t msi_hwirq;
> > };
> > #endif
> > +#ifdef CONFIG_DEV_MSI
> > + struct {
> > + irq_hw_number_t hwirq;
> > + };
> > +#endif
> > #ifdef CONFIG_X86_IO_APIC
> > struct {
> > int ioapic_id;
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index
> > 8d7001712062..f00901bac056 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -210,4 +210,11 @@ config GENERIC_ARCH_TOPOLOGY
> > appropriate scaling, sysfs interface for reading capacity values at
> > runtime.
> >
> > +config DEV_MSI
> > + bool "Device Specific Interrupt Messages"
> > + select IRQ_DOMAIN_HIERARCHY
> > + select GENERIC_MSI_IRQ_DOMAIN
> > + help
> > + Allow device drivers to generate device-specific interrupt messages
> > + for devices independent of PCI MSI/-X.
> > endmenu
> > diff --git a/drivers/base/Makefile b/drivers/base/Makefile index
> > 157452080f3d..ca1e4d39164e 100644
> > --- a/drivers/base/Makefile
> > +++ b/drivers/base/Makefile
> > @@ -21,6 +21,7 @@ obj-$(CONFIG_REGMAP) += regmap/
> > obj-$(CONFIG_SOC_BUS) += soc.o
> > obj-$(CONFIG_PINCTRL) += pinctrl.o
> > obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
> > +obj-$(CONFIG_DEV_MSI) += dev-msi.o
> > obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
> > obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
> >
> > diff --git a/drivers/base/dev-msi.c b/drivers/base/dev-msi.c new file
> > mode 100644 index 000000000000..240ccc353933
> > --- /dev/null
> > +++ b/drivers/base/dev-msi.c
> > @@ -0,0 +1,95 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright © 2020 Intel Corporation.
> > + *
> > + * Author: Megha Dey <[email protected]> */
> > +
> > +#include <linux/irq.h>
> > +#include <linux/irqdomain.h>
> > +#include <linux/msi.h>
> > +#include "platform-msi.h"
> > +
> > +struct irq_domain *dev_msi_default_domain;
> > +
> > +static irq_hw_number_t dev_msi_get_hwirq(struct msi_domain_info *info,
> > + msi_alloc_info_t *arg)
> > +{
> > + return arg->hwirq;
> > +}
> > +
> > +static irq_hw_number_t dev_msi_calc_hwirq(struct msi_desc *desc) {
> > + u32 devid;
> > +
> > + devid = desc->platform.msi_priv_data->devid;
> > +
> > + return (devid << (32 - DEV_ID_SHIFT)) | desc->platform.msi_index; }
> > +
> > +static void dev_msi_set_desc(msi_alloc_info_t *arg, struct msi_desc
> > +*desc) {
> > + arg->hwirq = dev_msi_calc_hwirq(desc); }
> > +
> > +static int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> > + int nvec, msi_alloc_info_t *arg) {
> > + memset(arg, 0, sizeof(*arg));
> > +
> > + return 0;
> > +}
> > +
> > +static struct msi_domain_ops dev_msi_domain_ops = {
> > + .get_hwirq = dev_msi_get_hwirq,
> > + .set_desc = dev_msi_set_desc,
> > + .msi_prepare = dev_msi_prepare,
> > +};
> > +
> > +static struct irq_chip dev_msi_controller = {
> > + .name = "DEV-MSI",
> > + .irq_unmask = platform_msi_unmask_irq,
> > + .irq_mask = platform_msi_mask_irq,
>
> This seems pretty odd, see below.
Ok..
>
> > + .irq_write_msi_msg = platform_msi_write_msg,
> > + .irq_ack = irq_chip_ack_parent,
> > + .irq_retrigger = irq_chip_retrigger_hierarchy,
> > + .flags = IRQCHIP_SKIP_SET_WAKE,
> > +};
> > +
> > +static struct msi_domain_info dev_msi_domain_info = {
> > + .flags = MSI_FLAG_USE_DEF_DOM_OPS |
> MSI_FLAG_USE_DEF_CHIP_OPS,
> > + .ops = &dev_msi_domain_ops,
> > + .chip = &dev_msi_controller,
> > + .handler = handle_edge_irq,
> > + .handler_name = "edge",
> > +};
> > +
> > +static int __init create_dev_msi_domain(void) {
> > + struct irq_domain *parent = NULL;
> > + struct fwnode_handle *fn;
> > +
> > + /*
> > + * Modern code should never have to use irq_get_default_host. But
> since
> > + * dev-msi is invisible to DT/ACPI, this is an exception case.
> > + */
> > + parent = irq_get_default_host();
>
> Really? How is it going to work once you have devices sending their MSIs to two
> different downstream blocks? This looks rather short-sighted.
So after some thought, I've realized that we don’t need to introduce 2 IRQ domains- with/without
Interrupt remapping enabled.
Hence, the above is void in the next version of patches.
>
> > + if (!parent)
> > + return -ENXIO;
> > +
> > + fn = irq_domain_alloc_named_fwnode("DEV_MSI");
> > + if (!fn)
> > + return -ENXIO;
> > +
> > + dev_msi_default_domain = msi_create_irq_domain(fn,
> &dev_msi_domain_info, parent);
> > + if (!dev_msi_default_domain) {
> > + pr_warn("failed to initialize irqdomain for DEV-MSI.\n");
> > + return -ENXIO;
> > + }
> > +
> > + irq_domain_update_bus_token(dev_msi_default_domain,
> DOMAIN_BUS_PLATFORM_MSI);
> > + irq_domain_free_fwnode(fn);
> > +
> > + return 0;
> > +}
> > +device_initcall(create_dev_msi_domain);
> > diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
> > index 9d94cd699468..5e1f210d65ee 100644
> > --- a/drivers/base/platform-msi.c
> > +++ b/drivers/base/platform-msi.c
> > @@ -12,21 +12,7 @@
> > #include <linux/irqdomain.h>
> > #include <linux/msi.h>
> > #include <linux/slab.h>
> > -
> > -#define DEV_ID_SHIFT 21
> > -#define MAX_DEV_MSIS (1 << (32 - DEV_ID_SHIFT))
> > -
> > -/*
> > - * Internal data structure containing a (made up, but unique) devid
> > - * and the platform-msi ops
> > - */
> > -struct platform_msi_priv_data {
> > - struct device *dev;
> > - void *host_data;
> > - msi_alloc_info_t arg;
> > - const struct platform_msi_ops *ops;
> > - int devid;
> > -};
> > +#include "platform-msi.h"
> >
> > /* The devid allocator */
> > static DEFINE_IDA(platform_msi_devid_ida);
> > @@ -76,7 +62,7 @@ static void platform_msi_update_dom_ops(struct
> msi_domain_info *info)
> > ops->set_desc = platform_msi_set_desc; }
> >
> > -static void platform_msi_write_msg(struct irq_data *data, struct
> > msi_msg *msg)
> > +void platform_msi_write_msg(struct irq_data *data, struct msi_msg
> > +*msg)
>
> It really begs the question: Why are you inventing a whole new "DEV-MSI" when
> this really is platform-MSI?
platform-msi is platform custom, but device-driver opaque MSI setup/control. With dev-msi, we add the
following
1. device specific mask/unmask functions
2. new dev-msi domain to setup/control MSI on these devices
3. explicitly deny pci devices from using the dev_msi alloc/free calls, something not currently in platform-msi..
We are not really inventing anything new, but only extending platform-msi to cover new groups of devices.
We will be sending out the next version of patches shortly, please let me know if you have any naming suggestions
for this extension.
>
> > {
> > struct msi_desc *desc = irq_data_get_msi_desc(data);
> > struct platform_msi_priv_data *priv_data; @@ -86,6 +72,33 @@ static
> > void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
> > priv_data->ops->write_msg(desc, msg); }
> >
> > +static void __platform_msi_desc_mask_unmask_irq(struct msi_desc
> > +*desc, u32 mask) {
> > + const struct platform_msi_ops *ops;
> > +
> > + ops = desc->platform.msi_priv_data->ops;
> > + if (!ops)
> > + return;
> > +
> > + if (mask) {
> > + if (ops->irq_mask)
> > + ops->irq_mask(desc);
> > + } else {
> > + if (ops->irq_unmask)
> > + ops->irq_unmask(desc);
> > + }
> > +}
> > +
> > +void platform_msi_mask_irq(struct irq_data *data) {
> > + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data),
> 1);
> > +}
> > +
> > +void platform_msi_unmask_irq(struct irq_data *data) {
> > + __platform_msi_desc_mask_unmask_irq(irq_data_get_msi_desc(data),
> 0);
> > +}
> > +
>
> I don't immediately get why you have this code at the platform MSI level. Until
> now, we only had the programming of the message into the end-point, which is
> a device-specific action (and the whole reason why this silly platform MSI exists)
>
> On the other hand, masking an interrupt is an irqchip operation, and only
> concerns the irqchip level. Here, you seem to be making it an end-point
> operation, which doesn't really make sense to me. Or is this device its own
> interrupt controller as well? That would be extremely surprising, and I'd expect
> some block downstream of the device to be able to control the masking of the
> interrupt.
Hmmm, I don’t fully understand this. Ultimately the mask/unmask is a device operation right?
Some new devices may want the option to mask/unmask interrupts at a non-standard location.
These callbacks are a way for the device to inform how exactly interrupts could be masked/unmasked
on my device, no different from pci mask/unmask, except this is at a custom location...
>
> > static void platform_msi_update_chip_ops(struct msi_domain_info
> > *info) {
> > struct irq_chip *chip = info->chip;
> > diff --git a/drivers/base/platform-msi.h b/drivers/base/platform-msi.h
> > new file mode 100644 index 000000000000..1de8c2874218
> > --- /dev/null
> > +++ b/drivers/base/platform-msi.h
> > @@ -0,0 +1,23 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright © 2020 Intel Corporation.
> > + *
> > + * Author: Megha Dey <[email protected]> */
>
> Or not. You are merely moving existing code, not authoring it. Either keep the
> original copyright attribution, or drop this mention altogether.
sure
>
> > +
> > +#include <linux/msi.h>
> > +
> > +#define DEV_ID_SHIFT 21
> > +#define MAX_DEV_MSIS (1 << (32 - DEV_ID_SHIFT))
> > +
> > +/*
> > + * Data structure containing a (made up, but unique) devid
> > + * and the platform-msi ops.
> > + */
> > +struct platform_msi_priv_data {
> > + struct device *dev;
> > + void *host_data;
> > + msi_alloc_info_t arg;
> > + const struct platform_msi_ops *ops;
> > + int devid;
> > +};
> > diff --git a/include/linux/msi.h b/include/linux/msi.h index
> > 7f6a8eb51aca..1da97f905720 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -323,9 +323,13 @@ enum {
> >
> > /*
> > * platform_msi_ops - Callbacks for platform MSI ops
> > + * @irq_mask: mask an interrupt source
> > + * @irq_unmask: unmask an interrupt source
> > * @write_msg: write message content
> > */
> > struct platform_msi_ops {
> > + unsigned int (*irq_mask)(struct msi_desc *desc);
> > + unsigned int (*irq_unmask)(struct msi_desc *desc);
> > irq_write_msi_msg_t write_msg;
> > };
> >
> > @@ -370,6 +374,10 @@ int platform_msi_domain_alloc(struct irq_domain
> > *domain, unsigned int virq, void platform_msi_domain_free(struct irq_domain
> *domain, unsigned int virq,
> > unsigned int nvec);
> > void *platform_msi_get_host_data(struct irq_domain *domain);
> > +
> > +void platform_msi_write_msg(struct irq_data *data, struct msi_msg
> > +*msg); void platform_msi_unmask_irq(struct irq_data *data); void
> > +platform_msi_mask_irq(struct irq_data *data);
> > #endif /* CONFIG_GENERIC_MSI_IRQ_DOMAIN */
> >
> > #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
> >
> >
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
Hi Thomas,
> -----Original Message-----
> From: Thomas Gleixner <[email protected]>
> Sent: Wednesday, July 22, 2020 1:45 PM
> To: Jiang, Dave <[email protected]>; [email protected]; Dey, Megha
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Pan, Jacob jun <[email protected]>; Raj,
> Ashok <[email protected]>; [email protected]; Liu, Yi L <[email protected]>;
> Lu, Baolu <[email protected]>; Tian, Kevin <[email protected]>; Kumar,
> Sanjay K <[email protected]>; Luck, Tony <[email protected]>; Lin,
> Jing <[email protected]>; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Hansen, Dave
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; Ortiz, Samuel
> <[email protected]>; Hossain, Mona <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH RFC v2 03/18] irq/dev-msi: Create IR-DEV-MSI irq domain
>
> Dave Jiang <[email protected]> writes:
> > From: Megha Dey <[email protected]>
> >
> > When DEV_MSI is enabled, the dev_msi_default_domain is updated to the
> > base DEV-MSI irq domain. If interrupt remapping is enabled, we create
>
> s/we//
ok
>
> > a new IR-DEV-MSI irq domain and update the dev_msi_default domain to
> > the same.
> >
> > For X86, introduce a new irq_alloc_type which will be used by the
> > interrupt remapping driver.
> >
> > Reviewed-by: Dan Williams <[email protected]>
> > Signed-off-by: Megha Dey <[email protected]>
> > Signed-off-by: Dave Jiang <[email protected]>
> > ---
> > arch/x86/include/asm/hw_irq.h | 1 +
> > arch/x86/kernel/apic/msi.c | 12 ++++++
> > drivers/base/dev-msi.c | 66 +++++++++++++++++++++++++++++++----
> > drivers/iommu/intel/irq_remapping.c | 11 +++++-
> > include/linux/intel-iommu.h | 1 +
> > include/linux/irqdomain.h | 11 ++++++
> > include/linux/msi.h | 3 ++
>
> Why is this mixing generic code, x86 core code and intel specific driver code?
> This is new functionality so:
>
> 1) Provide the infrastructure
> 2) Add support to architecture specific parts
> 3) Enable it
Ok, I will try to adhere to the layering next time around..
>
> > +
> > +#ifdef CONFIG_DEV_MSI
> > +int dev_msi_prepare(struct irq_domain *domain, struct device *dev,
> > + int nvec, msi_alloc_info_t *arg) {
> > + memset(arg, 0, sizeof(*arg));
> > +
> > + arg->type = X86_IRQ_ALLOC_TYPE_DEV_MSI;
> > +
> > + return 0;
> > +}
> > +#endif
>
> What is this? Tons of new lines for taking up more space and not a single
> comment.
Hmm, I will add a comment..
>
> > -static int dev_msi_prepare(struct irq_domain *domain, struct device
> > *dev,
> > +int __weak dev_msi_prepare(struct irq_domain *domain, struct device
> > +*dev,
> > int nvec, msi_alloc_info_t *arg) {
> > memset(arg, 0, sizeof(*arg));
>
> Oh well. So every architecure which needs to override this and I assume all
> which are eventually going to support it need to do the memset() in their
> override.
>
> memset(arg,,,);
> arch_dev_msi_prepare();
>
>
Per you suggestion, I have introduced arch_dev_msi_prepare which returns 0 by default unless
overridden by arch code in the next patch set.
> > - dev_msi_default_domain = msi_create_irq_domain(fn,
> &dev_msi_domain_info, parent);
> > + /*
> > + * This initcall may come after remap code is initialized. Ensure that
> > + * dev_msi_default domain is updated correctly.
>
> What? No, this is a disgusting hack. Get your ordering straight, that's not rocket
> science.
>
Hmm yeah, actually I realized we don't really need to have 2 new IRQ domains for dev-msi
(with and without interrupt remapping enabled). Hence all this will go away in the next round
of patches.
> > +#ifdef CONFIG_IRQ_REMAP
>
> IRQ_REMAP is x86 specific. Is this file x86 only or intended to be for general use?
> If it's x86 only, then this should be clearly documented. If not, then these
> x86'isms have no place here.
True, I will take care of this in the next patch set.
>
> > +struct irq_domain *create_remap_dev_msi_irq_domain(struct irq_domain
> *parent,
> > + const char *name)
>
> So we have msi_create_irq_domain() and this is about dev_msi, right? So can
> you please stick with a consistent naming scheme?
sure
>
> > +{
> > + struct fwnode_handle *fn;
> > + struct irq_domain *domain;
> > +
> > + fn = irq_domain_alloc_named_fwnode(name);
> > + if (!fn)
> > + return NULL;
> > +
> > + domain = msi_create_irq_domain(fn, &dev_msi_ir_domain_info,
> parent);
> > + if (!domain) {
> > + pr_warn("failed to initialize irqdomain for IR-DEV-MSI.\n");
> > + return ERR_PTR(-ENXIO);
> > + }
> > +
> > + irq_domain_update_bus_token(domain,
> DOMAIN_BUS_PLATFORM_MSI);
> > +
> > + if (!dev_msi_default_domain)
> > + dev_msi_default_domain = domain;
>
> Can this be called several times? If so, then this lacks a comment. If not, then
> this condition is useless.
Hmm this will go way in the next patch set, thank you for your input!
>
> Thanks,
>
> tglx
Hi Jason,
> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 22, 2020 12:59 PM
> To: Marc Zyngier <[email protected]>
> Cc: Jiang, Dave <[email protected]>; [email protected]; Dey, Megha
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Pan, Jacob jun <[email protected]>; Raj,
> Ashok <[email protected]>; Liu, Yi L <[email protected]>; Lu, Baolu
> <[email protected]>; Tian, Kevin <[email protected]>; Kumar, Sanjay K
> <[email protected]>; Luck, Tony <[email protected]>; Lin, Jing
> <[email protected]>; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> Hansen, Dave <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> Ortiz, Samuel <[email protected]>; Hossain, Mona
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
> irq domain
>
> On Wed, Jul 22, 2020 at 07:52:33PM +0100, Marc Zyngier wrote:
>
> > Which is exactly what platform-MSI already does. Why do we need
> > something else?
>
> It looks to me like all the code is around managing the
> dev->msi_domain of the devices.
>
> The intended use would have PCI drivers create children devices using mdev or
> virtbus and those devices wouldn't have a msi_domain from the platform. Looks
> like platform_msi_alloc_priv_data() fails immediately because dev->msi_domain
> will be NULL for these kinds of devices.
>
> Maybe that issue should be handled directly instead of wrappering
> platform_msi_*?
>
> For instance a trivial addition to the platform_msi API:
>
> platform_msi_assign_domain(struct_device *newly_created_virtual_device,
> struct device *physical_device);
>
> Which could set the msi_domain of new device using the topology of
> physical_device to deduce the correct domain?
>
> Then the question is how to properly create a domain within the hardware
> topology of physical_device with the correct parameters for the platform.
>
> Why do we need a dummy msi_domain anyhow? Can this just use
> physical_device->msi_domain directly? (I'm at my limit here of how much of this
> I remember, sorry)
>
> If you solve that it should solve the remapping problem too, as the
> physical_device is already assigned by the platform to a remapping irq domain if
> that is what the platform wants.
Yeah most of what you said is right. For the most part, we are simply introducing a new IRQ domain
which provides specific domain info ops for the classes of devices which want to provide custom
mask/unmask callbacks..
Also, from your other comments, I've realized the same IRQ domain can be used when interrupt
remapping is enabled/disabled.
Hence we will only have one create_dev_msi_domain which can be called by any device driver that
wants to use the dev-msi IRQ domain to alloc/free IRQs. It would be the responsibility of the device
driver to provide the correct device and update the dev->msi_domain.
>
> >> + parent = irq_get_default_host();
> > Really? How is it going to work once you have devices sending their
> > MSIs to two different downstream blocks? This looks rather
> > short-sighted.
>
> .. and fix this too, the parent domain should be derived from the topology of the
> physical_device which is originating the interrupt messages.
>
Yes
> > On the other hand, masking an interrupt is an irqchip operation, and
> > only concerns the irqchip level. Here, you seem to be making it an
> > end-point operation, which doesn't really make sense to me. Or is this
> > device its own interrupt controller as well? That would be extremely
> > surprising, and I'd expect some block downstream of the device to be
> > able to control the masking of the interrupt.
>
> These are message interrupts so they originate directly from the device and
> generally travel directly to the CPU APIC. On the wire there is no difference
> between a MSI, MSI-X and a device using the dev-msi approach.
>
> IIRC on Intel/AMD at least once a MSI is launched it is not maskable.
>
> So the model for MSI is always "mask at source". The closest mapping to the
> Linux IRQ model is to say the end device has a irqchip that encapsulates the
> ability of the device to generate the MSI in the first place.
>
> It looks like existing platform_msi drivers deal with "masking"
> implicitly by halting the device interrupt generation before releasing the
> interrupt and have no way for the generic irqchip layer to mask the interrupt.
>
> I suppose the motivation to make it explicit is related to vfio using the generic
> mask/unmask functionality?
>
> Explicit seems better, IMHO.
I don't think I understand this fully, ive still kept the device specific mask/unmask calls in the next
patch series, please let me know if it needs further modifications.
>
> Jason
Hi Jason,
> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 22, 2020 10:35 AM
> To: Dey, Megha <[email protected]>
> Cc: Jiang, Dave <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Pan, Jacob
> jun <[email protected]>; Raj, Ashok <[email protected]>; Liu, Yi L
> <[email protected]>; Lu, Baolu <[email protected]>; Tian, Kevin
> <[email protected]>; Kumar, Sanjay K <[email protected]>; Luck,
> Tony <[email protected]>; Lin, Jing <[email protected]>; Williams, Dan J
> <[email protected]>; [email protected]; [email protected];
> [email protected]; Hansen, Dave <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; Ortiz, Samuel <[email protected]>; Hossain, Mona
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC v2 04/18] irq/dev-msi: Introduce APIs to allocate/free
> dev-msi interrupts
>
> On Wed, Jul 22, 2020 at 10:05:52AM -0700, Dey, Megha wrote:
> >
> >
> > On 7/21/2020 9:25 AM, Jason Gunthorpe wrote:
> > > On Tue, Jul 21, 2020 at 09:02:41AM -0700, Dave Jiang wrote:
> > > > From: Megha Dey <[email protected]>
> > > >
> > > > The dev-msi interrupts are to be allocated/freed only for custom
> > > > devices, not standard PCI-MSIX devices.
> > > >
> > > > These interrupts are device-defined and they are distinct from the
> > > > already existing msi interrupts:
> > > > pci-msi: Standard PCI MSI/MSI-X setup format
> > > > platform-msi: Platform custom, but device-driver opaque MSI
> > > > setup/control
> > > > arch-msi: fallback for devices not assigned to the generic PCI
> > > > domain
> > > > dev-msi: device defined IRQ domain for ancillary devices. For e.g.
> > > > DSA portal devices use device specific IMS(Interrupt message store)
> interrupts.
> > > >
> > > > dev-msi interrupts are represented by their own device-type. That
> > > > means
> > > > dev->msi_list is never contended for different interrupt types. It
> > > > will either be all PCI-MSI or all device-defined.
> > >
> > > Not sure I follow this, where is the enforcement that only dev-msi
> > > or normal MSI is being used at one time on a single struct device?
> > >
> >
> > So, in the dev_msi_alloc_irqs, I first check if the dev_is_pci..
> > If it is a pci device, it is forbidden to use dev-msi and must use the
> > pci subsystem calls. dev-msi is to be used for all other custom
> > devices, mdev or otherwise.
>
> What prevents creating a dev-msi directly on the struct pci_device ?
In the next patchset, I have explicitly added code which denies PCI devices from using the dev_msi alloc/free APIS
>
> Jason
On Wed, Aug 05, 2020 at 07:18:39PM +0000, Dey, Megha wrote:
> Hence we will only have one create_dev_msi_domain which can be
> called by any device driver that wants to use the dev-msi IRQ domain
> to alloc/free IRQs. It would be the responsibility of the device
> driver to provide the correct device and update the dev->msi_domain.
I'm not sure that sounds like a good idea, why should a device driver
touch dev->msi_domain?
There was a certain appeal to the api I suggested by having everything
related to setting up the new IRQs being in the core code.
Jason
Hi Jason,
> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, August 5, 2020 3:16 PM
> To: Dey, Megha <[email protected]>
> Cc: Marc Zyngier <[email protected]>; Jiang, Dave <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Pan, Jacob jun <[email protected]>; Raj,
> Ashok <[email protected]>; Liu, Yi L <[email protected]>; Lu, Baolu
> <[email protected]>; Tian, Kevin <[email protected]>; Kumar, Sanjay K
> <[email protected]>; Luck, Tony <[email protected]>; Lin, Jing
> <[email protected]>; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> Hansen, Dave <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> Ortiz, Samuel <[email protected]>; Hossain, Mona
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
> irq domain
>
> On Wed, Aug 05, 2020 at 07:18:39PM +0000, Dey, Megha wrote:
>
> > Hence we will only have one create_dev_msi_domain which can be called
> > by any device driver that wants to use the dev-msi IRQ domain to
> > alloc/free IRQs. It would be the responsibility of the device driver
> > to provide the correct device and update the dev->msi_domain.
>
> I'm not sure that sounds like a good idea, why should a device driver touch dev-
> >msi_domain?
>
> There was a certain appeal to the api I suggested by having everything related to
> setting up the new IRQs being in the core code.
The basic API to create the dev_msi domain would be :
struct irq_domain *create_dev_msi_irq_domain(struct irq_domain *parent)
This can be called by devices according to their use case.
For e.g. in dsa case, it is called from the irq remapping driver:
iommu->ir_dev_msi_domain = create_dev_msi_domain(iommu->ir_domain)
and from the dsa mdev driver:
p_dev = get_parent_pci_dev(dev);
iommu = device_to_iommu(p_dev);
dev->msi_domain = iommu->ir_dev_msi_domain;
So we are creating the domain in the IRQ remapping domain which can be used by other devices which want to have the same IRQ parent domain and use dev-msi APIs. We are only updating that device's msi_domain to the already created dev-msi domain in the driver.
Other devices (your rdma driver etc) can create their own dev-msi domain by passing the appropriate parent IRq domain.
We cannot have this in the core code since the parent domain cannot be the same?
Please let me know if you think otherwise..
>
> Jason
On Wed, Aug 05, 2020 at 10:36:23PM +0000, Dey, Megha wrote:
> Hi Jason,
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, August 5, 2020 3:16 PM
> > To: Dey, Megha <[email protected]>
> > Cc: Marc Zyngier <[email protected]>; Jiang, Dave <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; Pan, Jacob jun <[email protected]>; Raj,
> > Ashok <[email protected]>; Liu, Yi L <[email protected]>; Lu, Baolu
> > <[email protected]>; Tian, Kevin <[email protected]>; Kumar, Sanjay K
> > <[email protected]>; Luck, Tony <[email protected]>; Lin, Jing
> > <[email protected]>; Williams, Dan J <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > Hansen, Dave <[email protected]>; [email protected];
> > [email protected]; [email protected]; [email protected];
> > Ortiz, Samuel <[email protected]>; Hossain, Mona
> > <[email protected]>; [email protected]; linux-
> > [email protected]; [email protected]; [email protected];
> > [email protected]
> > Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
> > irq domain
> >
> > On Wed, Aug 05, 2020 at 07:18:39PM +0000, Dey, Megha wrote:
> >
> > > Hence we will only have one create_dev_msi_domain which can be called
> > > by any device driver that wants to use the dev-msi IRQ domain to
> > > alloc/free IRQs. It would be the responsibility of the device driver
> > > to provide the correct device and update the dev->msi_domain.
> >
> > I'm not sure that sounds like a good idea, why should a device driver touch dev-
> > >msi_domain?
> >
> > There was a certain appeal to the api I suggested by having everything related to
> > setting up the new IRQs being in the core code.
>
> The basic API to create the dev_msi domain would be :
>
> struct irq_domain *create_dev_msi_irq_domain(struct irq_domain *parent)
>
> This can be called by devices according to their use case.
>
> For e.g. in dsa case, it is called from the irq remapping driver:
> iommu->ir_dev_msi_domain = create_dev_msi_domain(iommu->ir_domain)
>
> and from the dsa mdev driver:
> p_dev = get_parent_pci_dev(dev);
> iommu = device_to_iommu(p_dev);
>
> dev->msi_domain = iommu->ir_dev_msi_domain;
>
> So we are creating the domain in the IRQ remapping domain which can be used by other devices which want to have the same IRQ parent domain and use dev-msi APIs. We are only updating that device's msi_domain to the already created dev-msi domain in the driver.
>
> Other devices (your rdma driver etc) can create their own dev-msi domain by passing the appropriate parent IRq domain.
>
> We cannot have this in the core code since the parent domain cannot
> be the same?
Well, I had suggested to pass in the parent struct device, but it
could certainly use an irq_domain instead:
platform_msi_assign_domain(dev, device_to_iommu(p_dev)->ir_domain);
Or
platform_msi_assign_domain(dev, pdev->msi_domain)
?
Any maybe the natural expression is to add a version of
platform_msi_create_device_domain() that accepts a parent irq_domain()
and if the device doesn't already have a msi_domain then it creates
one. Might be too tricky to manage lifetime of the new irq_domain
though..
It feels cleaner to me if everything related to this is contained in
the platform_msi and the driver using it. Not sure it makes sense to
involve the iommu?
Jason
Hi Jason,
> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, August 5, 2020 3:54 PM
> To: Dey, Megha <[email protected]>
> Cc: Marc Zyngier <[email protected]>; Jiang, Dave <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Pan, Jacob jun <[email protected]>; Raj,
> Ashok <[email protected]>; Liu, Yi L <[email protected]>; Lu, Baolu
> <[email protected]>; Tian, Kevin <[email protected]>; Kumar, Sanjay K
> <[email protected]>; Luck, Tony <[email protected]>; Lin, Jing
> <[email protected]>; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> Hansen, Dave <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> Ortiz, Samuel <[email protected]>; Hossain, Mona
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
> irq domain
>
> On Wed, Aug 05, 2020 at 10:36:23PM +0000, Dey, Megha wrote:
> > Hi Jason,
> >
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, August 5, 2020 3:16 PM
> > > To: Dey, Megha <[email protected]>
> > > Cc: Marc Zyngier <[email protected]>; Jiang, Dave
> > > <[email protected]>; [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; Pan, Jacob jun
> > > <[email protected]>; Raj, Ashok <[email protected]>; Liu, Yi
> > > L <[email protected]>; Lu, Baolu <[email protected]>; Tian, Kevin
> > > <[email protected]>; Kumar, Sanjay K <[email protected]>;
> > > Luck, Tony <[email protected]>; Lin, Jing <[email protected]>;
> > > Williams, Dan J <[email protected]>; [email protected];
> > > [email protected]; [email protected]; Hansen, Dave
> > > <[email protected]>; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; Ortiz, Samuel <[email protected]>;
> > > Hossain, Mona <[email protected]>; [email protected];
> > > linux- [email protected]; [email protected];
> > > [email protected]; [email protected]
> > > Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new
> > > DEV_MSI irq domain
> > >
> > > On Wed, Aug 05, 2020 at 07:18:39PM +0000, Dey, Megha wrote:
> > >
> > > > Hence we will only have one create_dev_msi_domain which can be
> > > > called by any device driver that wants to use the dev-msi IRQ
> > > > domain to alloc/free IRQs. It would be the responsibility of the
> > > > device driver to provide the correct device and update the dev-
> >msi_domain.
> > >
> > > I'm not sure that sounds like a good idea, why should a device
> > > driver touch dev-
> > > >msi_domain?
> > >
> > > There was a certain appeal to the api I suggested by having
> > > everything related to setting up the new IRQs being in the core code.
> >
> > The basic API to create the dev_msi domain would be :
> >
> > struct irq_domain *create_dev_msi_irq_domain(struct irq_domain
> > *parent)
> >
> > This can be called by devices according to their use case.
> >
> > For e.g. in dsa case, it is called from the irq remapping driver:
> > iommu->ir_dev_msi_domain = create_dev_msi_domain(iommu->ir_domain)
> >
> > and from the dsa mdev driver:
> > p_dev = get_parent_pci_dev(dev);
> > iommu = device_to_iommu(p_dev);
> >
> > dev->msi_domain = iommu->ir_dev_msi_domain;
> >
> > So we are creating the domain in the IRQ remapping domain which can be
> used by other devices which want to have the same IRQ parent domain and use
> dev-msi APIs. We are only updating that device's msi_domain to the already
> created dev-msi domain in the driver.
> >
> > Other devices (your rdma driver etc) can create their own dev-msi domain by
> passing the appropriate parent IRq domain.
> >
> > We cannot have this in the core code since the parent domain cannot be
> > the same?
>
> Well, I had suggested to pass in the parent struct device, but it could certainly
> use an irq_domain instead:
>
> platform_msi_assign_domain(dev, device_to_iommu(p_dev)->ir_domain);
>
> Or
>
> platform_msi_assign_domain(dev, pdev->msi_domain)
>
> ?
>
> Any maybe the natural expression is to add a version of
> platform_msi_create_device_domain() that accepts a parent irq_domain() and if
> the device doesn't already have a msi_domain then it creates one. Might be too
> tricky to manage lifetime of the new irq_domain though..
>
> It feels cleaner to me if everything related to this is contained in the
> platform_msi and the driver using it. Not sure it makes sense to involve the
> iommu?
Well yeah something like this can be done, but what is the missing piece is where the IRQ domain actually gets created, i.e where this new version of platform_msi_create_device_domain() is called. That is the only piece that is currently done in the IOMMU driver only for DSA mdev. Not that all devices need to do it this way.. do you have suggestions as to where you want to call this function?
>
> Jason
On Thu, Aug 06, 2020 at 12:13:24AM +0000, Dey, Megha wrote:
> > Well, I had suggested to pass in the parent struct device, but it could certainly
> > use an irq_domain instead:
> >
> > platform_msi_assign_domain(dev, device_to_iommu(p_dev)->ir_domain);
> >
> > Or
> >
> > platform_msi_assign_domain(dev, pdev->msi_domain)
> >
> > ?
> >
> > Any maybe the natural expression is to add a version of
> > platform_msi_create_device_domain() that accepts a parent irq_domain() and if
> > the device doesn't already have a msi_domain then it creates one. Might be too
> > tricky to manage lifetime of the new irq_domain though..
> >
> > It feels cleaner to me if everything related to this is contained in the
> > platform_msi and the driver using it. Not sure it makes sense to involve the
> > iommu?
>
> Well yeah something like this can be done, but what is the missing
> piece is where the IRQ domain actually gets created, i.e where this
> new version of platform_msi_create_device_domain() is called. That
> is the only piece that is currently done in the IOMMU driver only
> for DSA mdev. Not that all devices need to do it this way.. do you
> have suggestions as to where you want to call this function?
Oops, I was thinking of platform_msi_domain_alloc_irqs() not
create_device_domain()
ie call it in the device driver that wishes to consume the extra
MSIs.
Is there a harm if each device driver creates a new irq_domain for its
use?
Jason
Hi Jason,
> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, August 5, 2020 5:19 PM
> To: Dey, Megha <[email protected]>
> Cc: Marc Zyngier <[email protected]>; Jiang, Dave <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Pan, Jacob jun <[email protected]>; Raj,
> Ashok <[email protected]>; Liu, Yi L <[email protected]>; Lu, Baolu
> <[email protected]>; Tian, Kevin <[email protected]>; Kumar, Sanjay K
> <[email protected]>; Luck, Tony <[email protected]>; Lin, Jing
> <[email protected]>; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> Hansen, Dave <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> Ortiz, Samuel <[email protected]>; Hossain, Mona
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
> irq domain
>
> On Thu, Aug 06, 2020 at 12:13:24AM +0000, Dey, Megha wrote:
> > > Well, I had suggested to pass in the parent struct device, but it
> > > could certainly use an irq_domain instead:
> > >
> > > platform_msi_assign_domain(dev,
> > > device_to_iommu(p_dev)->ir_domain);
> > >
> > > Or
> > >
> > > platform_msi_assign_domain(dev, pdev->msi_domain)
> > >
> > > ?
> > >
> > > Any maybe the natural expression is to add a version of
> > > platform_msi_create_device_domain() that accepts a parent
> > > irq_domain() and if the device doesn't already have a msi_domain
> > > then it creates one. Might be too tricky to manage lifetime of the new
> irq_domain though..
> > >
> > > It feels cleaner to me if everything related to this is contained in
> > > the platform_msi and the driver using it. Not sure it makes sense to
> > > involve the iommu?
> >
> > Well yeah something like this can be done, but what is the missing
> > piece is where the IRQ domain actually gets created, i.e where this
> > new version of platform_msi_create_device_domain() is called. That is
> > the only piece that is currently done in the IOMMU driver only for DSA
> > mdev. Not that all devices need to do it this way.. do you have
> > suggestions as to where you want to call this function?
>
> Oops, I was thinking of platform_msi_domain_alloc_irqs() not
> create_device_domain()
>
> ie call it in the device driver that wishes to consume the extra MSIs.
>
> Is there a harm if each device driver creates a new irq_domain for its use?
Well, the only harm is if we want to reuse the irq domain.
As of today, we only have DSA mdev which uses the dev-msi domain. In the IRQ domain hierarchy,
We will have this:
Vector-> intel-ir->dev-msi
So tmrw if we have a new device, which would also want to have the intel-ir as the parent and use the same domain ops, we will simply be creating a copy of this IRQ domain, which may not be very fruitful.
But apart from that, I don't think there are any issues..
What do you think is the best approach here?
>
> Jason
On Thu, Aug 06, 2020 at 12:32:31AM +0000, Dey, Megha wrote:
> > Oops, I was thinking of platform_msi_domain_alloc_irqs() not
> > create_device_domain()
> >
> > ie call it in the device driver that wishes to consume the extra MSIs.
> >
> > Is there a harm if each device driver creates a new irq_domain for its use?
>
> Well, the only harm is if we want to reuse the irq domain.
>
> As of today, we only have DSA mdev which uses the dev-msi domain. In the IRQ domain hierarchy,
> We will have this:
>
> Vector-> intel-ir->dev-msi
>
> So tmrw if we have a new device, which would also want to have the
> intel-ir as the parent and use the same domain ops, we will simply
> be creating a copy of this IRQ domain, which may not be very
> fruitful.
>
> But apart from that, I don't think there are any issues..
>
> What do you think is the best approach here?
I've surely forgotten these details, I can't advise if duplicate
irq_domains are very bad.
A single domain per parent irq_domain does seem more elegant, but I'm
not sure it is worth the extra work to do it?
In any event the API seems cleaner if it is all contained in the
platform_msi and strongly connected to the driver, not spread to the
iommu as well. If it had to create single dev-msi domain per parent
irq_domain then it certainly could be done in a few ways.
Jason
On Thu, 23 Jul 2020 21:19:30 -0300
Jason Gunthorpe <[email protected]> wrote:
> On Tue, Jul 21, 2020 at 11:54:49PM +0000, Tian, Kevin wrote:
> > In a nutshell, applications don't require raw WQ controllability as guest
> > kernel drivers may expect. Extending DSA user space interface to be another
> > passthrough interface just for virtualization needs is less compelling than
> > leveraging established VFIO/mdev framework (with the major merit that
> > existing user space VMMs just work w/o any change as long as they already
> > support VFIO uAPI).
>
> Sure, but the above is how the cover letter should have summarized
> that discussion, not as "it is not much code difference"
>
> > In last review you said that you didn't hard nak this approach and would
> > like to hear opinion from virtualization guys. In this version we CCed KVM
> > mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
> > hypervisor), etc. Let's see how they feel about this approach.
>
> Yes, the VFIO community should decide.
>
> If we are doing emulation tasks in the kernel now, then I can think of
> several nice semi-emulated mdevs to propose.
>
> This will not be some one off, but the start of a widely copied
> pattern.
And that's definitely a concern, there should be a reason for
implementing device emulation in the kernel beyond an easy path to get
a device exposed up through a virtualization stack. The entire idea of
mdev is the mediation of access to a device to make it safe for a user
and to fit within the vfio device API. Mediation, emulation, and
virtualization can be hard to differentiate, and there is some degree of
emulation required to fill out the device API, for vfio-pci itself
included. So I struggle with a specific measure of where to draw the
line, and also whose authority it is to draw that line. I don't think
it's solely mine, that's something we need to decide as a community.
If you see this as an abuse of the framework, then let's identify those
specific issues and come up with a better approach. As we've discussed
before, things like basic PCI config space emulation are acceptable
overhead and low risk (imo) and some degree of register emulation is
well within the territory of an mdev driver. Drivers are accepting
some degree of increased attack surface by each addition of a uAPI and
the complexity of those uAPIs, but it seems largely a decision for
those drivers whether they're willing to take on that responsibility
and burden.
At some point, possibly in the near-ish future, we might have a
vfio-user interface with userspace vfio-over-socket servers that might
be able to consume existing uAPIs and offload some of this complexity
and emulation to userspace while still providing an easy path to insert
devices into the virtualization stack. Hopefully if/when that comes
along, it would provide these sorts of drivers an opportunity to
offload some of the current overhead out to userspace, but I'm not sure
it's worth denying a mainline implementation now. Thanks,
Alex
Megha,
"Dey, Megha" <[email protected]> writes:
>> -----Original Message-----
>> From: Jason Gunthorpe <[email protected]>
<SNIP>
>> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
>> irq domain
can you please fix your mail client not to copy the whole header of the
mail you are replying to into the mail body?
>> > > Well, I had suggested to pass in the parent struct device, but it
>> Oops, I was thinking of platform_msi_domain_alloc_irqs() not
>> create_device_domain()
>>
>> ie call it in the device driver that wishes to consume the extra MSIs.
>>
>> Is there a harm if each device driver creates a new irq_domain for its use?
>
> Well, the only harm is if we want to reuse the irq domain.
You cannot reuse the irq domain if you create a domain per driver. The
way how hierarchical domains work is:
vector --- DMAR-MSI
|
|-- ....
|
|-- IR-0 --- IO/APIC-0
| |
| |-- IO/APIC-1
| |
| |-- PCI/MSI-0
| |
| |-- HPET/MSI-0
|
|-- IR-1 --- PCI/MSI-1
| |
The outermost domain is what the actual device driver uses. I.e. for
PCI-MSI it's the msi domain which is associated to the bus the device is
connected to. Each domain has its own interrupt chip instance and its
own data set.
Domains of the same type share the code, but neither the data nor the
interrupt chip instance.
Also there is a strict parent child relationship in terms of resources.
Let's look at PCI.
PCI/MSI-0 depends on IR-0 which depends on the vector domain. That's
reflecting both the flow of the interrupt and the steps required for
various tasks, e.g. allocation/deallocation and also interrupt chip
operations. In order to allocate a PCI/MSI interrupt in domain PCI/MSI-0
a slot in the remapping unit and a vector needs to be allocated.
If you disable interrupt remapping all the outermost domains in the
scheme above become childs of the vector domain.
So if we look at DEV/MSI as a infrastructure domain then the scheme
looks like this:
vector --- DMAR-MSI
|
|-- ....
|
|-- IR-0 --- IO/APIC-0
| |
| |-- IO/APIC-1
| |
| |-- PCI/MSI-0
| |
| |-- HPET/MSI-0
| |
| |-- DEV/MSI-0
|
|-- IR-1 --- PCI/MSI-1
| |
| |-- DEV/MSI-1
But if you make it per device then you have multiple DEV/MSI domains per
IR unit.
What's the right thing to do?
If the DEV/MSI domain has it's own per IR unit resource management, then
you need one per IR unit.
If the resource management is solely per device then having a domain per
device is the right choice.
Thanks,
tglx
Hi Thomas,
On 8/6/2020 10:10 AM, Thomas Gleixner wrote:
> Megha,
>
> "Dey, Megha" <[email protected]> writes:
>
>>> -----Original Message-----
>>> From: Jason Gunthorpe <[email protected]>
> <SNIP>
>>> Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI
>>> irq domain
> can you please fix your mail client not to copy the whole header of the
> mail you are replying to into the mail body?
oops, i hope i have fixed it now..
>
>>>>> Well, I had suggested to pass in the parent struct device, but it
>>> Oops, I was thinking of platform_msi_domain_alloc_irqs() not
>>> create_device_domain()
>>>
>>> ie call it in the device driver that wishes to consume the extra MSIs.
>>>
>>> Is there a harm if each device driver creates a new irq_domain for its use?
>> Well, the only harm is if we want to reuse the irq domain.
> You cannot reuse the irq domain if you create a domain per driver. The
> way how hierarchical domains work is:
>
> vector --- DMAR-MSI
> |
> |-- ....
> |
> |-- IR-0 --- IO/APIC-0
> | |
> | |-- IO/APIC-1
> | |
> | |-- PCI/MSI-0
> | |
> | |-- HPET/MSI-0
> |
> |-- IR-1 --- PCI/MSI-1
> | |
>
> The outermost domain is what the actual device driver uses. I.e. for
> PCI-MSI it's the msi domain which is associated to the bus the device is
> connected to. Each domain has its own interrupt chip instance and its
> own data set.
>
> Domains of the same type share the code, but neither the data nor the
> interrupt chip instance.
>
> Also there is a strict parent child relationship in terms of resources.
> Let's look at PCI.
>
> PCI/MSI-0 depends on IR-0 which depends on the vector domain. That's
> reflecting both the flow of the interrupt and the steps required for
> various tasks, e.g. allocation/deallocation and also interrupt chip
> operations. In order to allocate a PCI/MSI interrupt in domain PCI/MSI-0
> a slot in the remapping unit and a vector needs to be allocated.
>
> If you disable interrupt remapping all the outermost domains in the
> scheme above become childs of the vector domain.
>
> So if we look at DEV/MSI as a infrastructure domain then the scheme
> looks like this:
>
> vector --- DMAR-MSI
> |
> |-- ....
> |
> |-- IR-0 --- IO/APIC-0
> | |
> | |-- IO/APIC-1
> | |
> | |-- PCI/MSI-0
> | |
> | |-- HPET/MSI-0
> | |
> | |-- DEV/MSI-0
> |
> |-- IR-1 --- PCI/MSI-1
> | |
> | |-- DEV/MSI-1
>
>
> But if you make it per device then you have multiple DEV/MSI domains per
> IR unit.
>
> What's the right thing to do?
>
> If the DEV/MSI domain has it's own per IR unit resource management, then
> you need one per IR unit.
>
> If the resource management is solely per device then having a domain per
> device is the right choice.
Thanks a lot Thomas for this detailed explanation!!
The dev-msi domain can be used by other devices if they too would want
to follow the
vector->intel IR->dev-msi IRQ hierarchy.
I do create one dev-msi IRQ domain instance per IR unit. So I guess for
this case,
it makes most sense to have a dev-msi IRQ domain per IR unit as opposed
to create one
per individual driver..
>
> Thanks,
>
> tglx
Megha,
"Dey, Megha" <[email protected]> writes:
> On 8/6/2020 10:10 AM, Thomas Gleixner wrote:
>> If the DEV/MSI domain has it's own per IR unit resource management, then
>> you need one per IR unit.
>>
>> If the resource management is solely per device then having a domain per
>> device is the right choice.
>
> The dev-msi domain can be used by other devices if they too would want
> to follow the vector->intel IR->dev-msi IRQ hierarchy. I do create
> one dev-msi IRQ domain instance per IR unit. So I guess for this case,
> it makes most sense to have a dev-msi IRQ domain per IR unit as
> opposed to create one per individual driver..
I'm not really convinced. I looked at the idxd driver and that has it's
own interrupt related resource management for the IMS slots and provides
the mask,unmask callbacks for the interrupt chip via this crude platform
data indirection.
So I don't see the value of the dev-msi domain per IR unit. The domain
itself does not provide much functionality other than indirections and
you clearly need per device interrupt resource management on the side
and a customized irq chip. I rather see it as a plain layering
violation.
The point is that your IDXD driver manages the per device IMS slots
which is a interrupt related resource. The story would be different if
the IMS slots would be managed by some central or per IR unit entity,
but in that case you'd need IMS specific domain(s).
So the obvious consequence of the hierarchical irq design is:
vector -> IR -> IDXD
which makes the control flow of allocating an interrupt for a subdevice
straight forward following the irq hierarchy rules.
This still wants to inherit the existing msi domain functionality, but
the amount of code required is small and removes all these pointless
indirections and integrates the slot management naturally.
If you expect or know that there are other devices coming up with IMS
integrated then most of that code can be made a common library. But for
this to make sense, you really want to make sure that these other
devices do not require yet another horrible layer of indirection.
A side note: I just read back on the specification and stumbled over
the following gem:
"IMS may also optionally support per-message masking and pending bit
status, similar to the per-vector mask and pending bit array in the
PCI Express MSI-X capability."
Optionally? Please tell the hardware folks to make this mandatory. We
have enough pain with non maskable MSI interrupts already so introducing
yet another non maskable interrupt trainwreck is not an option.
It's more than a decade now that I tell HW people not to repeat the
non-maskable MSI failure, but obviously they still think that
non-maskable interrupts are a brilliant idea. I know that HW folks
believe that everything they omit can be fixed in software, but they
have to finally understand that this particular issue _cannot_ be fixed
at all.
Thanks,
tglx
Hi Thomas,
On 8/6/2020 1:21 PM, Thomas Gleixner wrote:
> Megha,
>
> "Dey, Megha" <[email protected]> writes:
>> On 8/6/2020 10:10 AM, Thomas Gleixner wrote:
>>> If the DEV/MSI domain has it's own per IR unit resource management, then
>>> you need one per IR unit.
>>>
>>> If the resource management is solely per device then having a domain per
>>> device is the right choice.
>> The dev-msi domain can be used by other devices if they too would want
>> to follow the vector->intel IR->dev-msi IRQ hierarchy. I do create
>> one dev-msi IRQ domain instance per IR unit. So I guess for this case,
>> it makes most sense to have a dev-msi IRQ domain per IR unit as
>> opposed to create one per individual driver..
> I'm not really convinced. I looked at the idxd driver and that has it's
> own interrupt related resource management for the IMS slots and provides
> the mask,unmask callbacks for the interrupt chip via this crude platform
> data indirection.
>
> So I don't see the value of the dev-msi domain per IR unit. The domain
> itself does not provide much functionality other than indirections and
> you clearly need per device interrupt resource management on the side
> and a customized irq chip. I rather see it as a plain layering
> violation.
>
> The point is that your IDXD driver manages the per device IMS slots
> which is a interrupt related resource. The story would be different if
> the IMS slots would be managed by some central or per IR unit entity,
> but in that case you'd need IMS specific domain(s).
>
> So the obvious consequence of the hierarchical irq design is:
>
> vector -> IR -> IDXD
>
> which makes the control flow of allocating an interrupt for a subdevice
> straight forward following the irq hierarchy rules.
>
> This still wants to inherit the existing msi domain functionality, but
> the amount of code required is small and removes all these pointless
> indirections and integrates the slot management naturally.
>
> If you expect or know that there are other devices coming up with IMS
> integrated then most of that code can be made a common library. But for
> this to make sense, you really want to make sure that these other
> devices do not require yet another horrible layer of indirection.
Yes Thomas, for now this may look odd since there is only one device
using this
IRQ domain. But there will be other devices following suit, hence I have
added
all the IRQ chip/domain bits in a separate file in drivers/irqchip in
the next
version of patches. I'll submit the patches shortly and it will be great
if I
can get more feedback on it.
> A side note: I just read back on the specification and stumbled over
> the following gem:
>
> "IMS may also optionally support per-message masking and pending bit
> status, similar to the per-vector mask and pending bit array in the
> PCI Express MSI-X capability."
>
> Optionally? Please tell the hardware folks to make this mandatory. We
> have enough pain with non maskable MSI interrupts already so introducing
> yet another non maskable interrupt trainwreck is not an option.
>
> It's more than a decade now that I tell HW people not to repeat the
> non-maskable MSI failure, but obviously they still think that
> non-maskable interrupts are a brilliant idea. I know that HW folks
> believe that everything they omit can be fixed in software, but they
> have to finally understand that this particular issue _cannot_ be fixed
> at all.
hmm, I asked the hardware folks and they have informed me that all IMS
devices
will support per vector masking/pending bit. This will be updated in the
next SIOV
spec which will be published soon.
>
> Thanks,
>
> tglx
Megha,
"Dey, Megha" <[email protected]> writes:
> On 8/6/2020 1:21 PM, Thomas Gleixner wrote:
>> If you expect or know that there are other devices coming up with IMS
>> integrated then most of that code can be made a common library. But for
>> this to make sense, you really want to make sure that these other
>> devices do not require yet another horrible layer of indirection.
>
> Yes Thomas, for now this may look odd since there is only one device
> using this IRQ domain. But there will be other devices following suit,
> hence I have added all the IRQ chip/domain bits in a separate file in
> drivers/irqchip in the next version of patches. I'll submit the
> patches shortly and it will be great if I can get more feedback on it.
Again. The common domain makes only sense if it provides actual
functionality and resource management at the domain level. The IMS slot
management CANNOT happen at the common domain level simply because IMS
is strictly per device. So your "common" domain is just a shim layer
which pretends to be common and requires warts at the side to do the IMS
management at the device level.
Let's see what you came up with this time :)
>> A side note: I just read back on the specification and stumbled over
>> the following gem:
>>
>> "IMS may also optionally support per-message masking and pending bit
>> status, similar to the per-vector mask and pending bit array in the
>> PCI Express MSI-X capability."
>>
>> Optionally? Please tell the hardware folks to make this mandatory. We
>> have enough pain with non maskable MSI interrupts already so introducing
>> yet another non maskable interrupt trainwreck is not an option.
>>
>> It's more than a decade now that I tell HW people not to repeat the
>> non-maskable MSI failure, but obviously they still think that
>> non-maskable interrupts are a brilliant idea. I know that HW folks
>> believe that everything they omit can be fixed in software, but they
>> have to finally understand that this particular issue _cannot_ be fixed
>> at all.
>
> hmm, I asked the hardware folks and they have informed me that all IMS
> devices will support per vector masking/pending bit. This will be
> updated in the next SIOV spec which will be published soon.
I seriously hope so...
Thanks,
tglx
On Thu, Aug 06, 2020 at 10:21:11PM +0200, Thomas Gleixner wrote:
> Optionally? Please tell the hardware folks to make this mandatory. We
> have enough pain with non maskable MSI interrupts already so introducing
> yet another non maskable interrupt trainwreck is not an option.
Can you elaborate on the flows where Linux will need to trigger
masking?
I expect that masking will be available in our NIC HW too - but it
will require a spin loop if masking has to be done in an atomic
context.
> It's more than a decade now that I tell HW people not to repeat the
> non-maskable MSI failure, but obviously they still think that
> non-maskable interrupts are a brilliant idea. I know that HW folks
> believe that everything they omit can be fixed in software, but they
> have to finally understand that this particular issue _cannot_ be fixed
> at all.
Sure, the CPU should always be able to shut off an interrupt!
Maybe explaining the goals would help understand the HW perspective.
Today HW can process > 100k queues of work at once. Interrupt delivery
works by having a MSI index in each queue's metadata and the interrupt
indirects through a MSI-X table on-chip which has the
addr/data/mask/etc.
What IMS proposes is that the interrupt data can move into the queue
meta data (which is not required to be on-chip), eg along side the
producer/consumer pointers, and the central MSI-X table is not
needed. This is necessary because the PCI spec has very harsh design
requirements for a MSI-X table that make scaling it prohibitive.
So an IRQ can be silenced by deleting or stopping the queue(s)
triggering it. It can be masked by including masking in the queue
metadata. We can detect pending by checking the producer/consumer
values.
However synchronizing all the HW and all the state is now more
complicated than just writing a mask bit via MMIO to an on-die memory.
Jason
On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> If you see this as an abuse of the framework, then let's identify those
> specific issues and come up with a better approach. As we've discussed
> before, things like basic PCI config space emulation are acceptable
> overhead and low risk (imo) and some degree of register emulation is
> well within the territory of an mdev driver.
What troubles me is that idxd already has a direct userspace interface
to its HW, and does userspace DMA. The purpose of this mdev is to
provide a second direct userspace interface that is a little different
and trivially plugs into the virtualization stack.
I don't think VFIO should be the only entry point to
virtualization. If we say the universe of devices doing user space DMA
must also implement a VFIO mdev to plug into virtualization then it
will be alot of mdevs.
I would prefer to see that the existing userspace interface have the
extra needed bits for virtualization (eg by having appropriate
internal kernel APIs to make this easy) and all the emulation to build
the synthetic PCI device be done in userspace.
Not only is it better for security, it keeps things to one device
driver per device..
Jason
On Fri, Aug 07, 2020 at 09:06:50AM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 06, 2020 at 10:21:11PM +0200, Thomas Gleixner wrote:
>
> > Optionally? Please tell the hardware folks to make this mandatory. We
> > have enough pain with non maskable MSI interrupts already so introducing
> > yet another non maskable interrupt trainwreck is not an option.
>
> Can you elaborate on the flows where Linux will need to trigger
> masking?
>
> I expect that masking will be available in our NIC HW too - but it
> will require a spin loop if masking has to be done in an atomic
> context.
>
> > It's more than a decade now that I tell HW people not to repeat the
> > non-maskable MSI failure, but obviously they still think that
> > non-maskable interrupts are a brilliant idea. I know that HW folks
> > believe that everything they omit can be fixed in software, but they
> > have to finally understand that this particular issue _cannot_ be fixed
> > at all.
>
> Sure, the CPU should always be able to shut off an interrupt!
>
> Maybe explaining the goals would help understand the HW perspective.
>
> Today HW can process > 100k queues of work at once. Interrupt delivery
> works by having a MSI index in each queue's metadata and the interrupt
> indirects through a MSI-X table on-chip which has the
> addr/data/mask/etc.
>
> What IMS proposes is that the interrupt data can move into the queue
> meta data (which is not required to be on-chip), eg along side the
> producer/consumer pointers, and the central MSI-X table is not
> needed. This is necessary because the PCI spec has very harsh design
> requirements for a MSI-X table that make scaling it prohibitive.
>
> So an IRQ can be silenced by deleting or stopping the queue(s)
> triggering it. It can be masked by including masking in the queue
> metadata. We can detect pending by checking the producer/consumer
> values.
>
> However synchronizing all the HW and all the state is now more
> complicated than just writing a mask bit via MMIO to an on-die memory.
Because doing all of the work that used to be done in HW in software is
so much faster and scalable? Feels really wrong to me :(
Do you all have a pointer to the spec for this newly proposed stuff
anywhere to try to figure out how the HW wants this to all work?
thanks,
greg k-h
On Fri, Aug 07, 2020 at 02:38:31PM +0200, [email protected] wrote:
> On Fri, Aug 07, 2020 at 09:06:50AM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 06, 2020 at 10:21:11PM +0200, Thomas Gleixner wrote:
> >
> > > Optionally? Please tell the hardware folks to make this mandatory. We
> > > have enough pain with non maskable MSI interrupts already so introducing
> > > yet another non maskable interrupt trainwreck is not an option.
> >
> > Can you elaborate on the flows where Linux will need to trigger
> > masking?
> >
> > I expect that masking will be available in our NIC HW too - but it
> > will require a spin loop if masking has to be done in an atomic
> > context.
> >
> > > It's more than a decade now that I tell HW people not to repeat the
> > > non-maskable MSI failure, but obviously they still think that
> > > non-maskable interrupts are a brilliant idea. I know that HW folks
> > > believe that everything they omit can be fixed in software, but they
> > > have to finally understand that this particular issue _cannot_ be fixed
> > > at all.
> >
> > Sure, the CPU should always be able to shut off an interrupt!
> >
> > Maybe explaining the goals would help understand the HW perspective.
> >
> > Today HW can process > 100k queues of work at once. Interrupt delivery
> > works by having a MSI index in each queue's metadata and the interrupt
> > indirects through a MSI-X table on-chip which has the
> > addr/data/mask/etc.
> >
> > What IMS proposes is that the interrupt data can move into the queue
> > meta data (which is not required to be on-chip), eg along side the
> > producer/consumer pointers, and the central MSI-X table is not
> > needed. This is necessary because the PCI spec has very harsh design
> > requirements for a MSI-X table that make scaling it prohibitive.
> >
> > So an IRQ can be silenced by deleting or stopping the queue(s)
> > triggering it. It can be masked by including masking in the queue
> > metadata. We can detect pending by checking the producer/consumer
> > values.
> >
> > However synchronizing all the HW and all the state is now more
> > complicated than just writing a mask bit via MMIO to an on-die memory.
>
> Because doing all of the work that used to be done in HW in software is
> so much faster and scalable? Feels really wrong to me :(
Yes, it is more scalable. The problem with MSI-X is you need actual
physical silicon for each and every vector. This really limits the
number of vectors.
Placing the vector metadata with the queue means it can potentially
live in system memory which is significantly more scalable.
setup/mask/unmask will be slower. The driver might have
complexity. They are not performance path, right?
I don't think it is wrong or right. IMHO the current design where the
addr/data is hidden inside the platform is an artifact of x86's
compatibility legacy back when there was no such thing as message
interrupts.
If you were starting from a green field I don't think a design would
include the IOAPIC/MSI/MSI-X indirection tables.
> Do you all have a pointer to the spec for this newly proposed stuff
> anywhere to try to figure out how the HW wants this to all work?
Intel's SIOV document is an interesting place to start:
https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html
Though it is more of a rational and a cookbook on how to combine
existing technology pieces. (eg PASID, platform_msi, etc)
The basic approach of SIOV's IMS is that there is no longer a generic
interrupt indirection from numbers to addr/data pairs like
IOAPIC/MSI/MSI-X owned by the common OS code.
Instead the driver itself is responsible to set the addr/data pair
into the device in a device specific way, deal with masking, etc.
This lets the device use an implementation that is not limited by the
harsh MSI-X semantics.
In Linux we already have 'IMS' it is called platform_msi and a few
embedded drivers already work like this. The idea here is to bring it
to PCI.
Jason
Jason,
Jason Gunthorpe <[email protected]> writes:
> On Thu, Aug 06, 2020 at 10:21:11PM +0200, Thomas Gleixner wrote:
>
>> Optionally? Please tell the hardware folks to make this mandatory. We
>> have enough pain with non maskable MSI interrupts already so introducing
>> yet another non maskable interrupt trainwreck is not an option.
>
> Can you elaborate on the flows where Linux will need to trigger
> masking?
1) disable/enable_irq() obviously needs masking
2) Affinity changes are preferrably done with masking to avoid a
boatload of nasty side effect. We have a "fix" for 32bit addressing
mode which works by chance due to the layout but it would fail
miserably with 64bit addressing mode. 64bit addressing mode is only
relevant for more than 256 CPUs which requires X2APIC which in turn
requires interrupt remapping. Interrupt remappind saves us here
because the interrupt can be disabled at the remapping level.
3) The ability to shutdown an irq at the interrupt level in case of
malfunction. Of course that's pure paranoia because devices are
perfect and never misbehave :)
So it's nowhere in the hot path of interrupt handling itself.
> I expect that masking will be available in our NIC HW too - but it
> will require a spin loop if masking has to be done in an atomic
> context.
Yes, it's all in atomic context.
We have functionality in the interrupt core to do #1 and #2 from task
context (requires the caller to be in task context as well). #3 not so
much.
>> It's more than a decade now that I tell HW people not to repeat the
>> non-maskable MSI failure, but obviously they still think that
>> non-maskable interrupts are a brilliant idea. I know that HW folks
>> believe that everything they omit can be fixed in software, but they
>> have to finally understand that this particular issue _cannot_ be fixed
>> at all.
>
> Sure, the CPU should always be able to shut off an interrupt!
Oh yes!
> Maybe explaining the goals would help understand the HW perspective.
>
> Today HW can process > 100k queues of work at once. Interrupt delivery
> works by having a MSI index in each queue's metadata and the interrupt
> indirects through a MSI-X table on-chip which has the
> addr/data/mask/etc.
>
> What IMS proposes is that the interrupt data can move into the queue
> meta data (which is not required to be on-chip), eg along side the
> producer/consumer pointers, and the central MSI-X table is not
> needed. This is necessary because the PCI spec has very harsh design
> requirements for a MSI-X table that make scaling it prohibitive.
I know.
> So an IRQ can be silenced by deleting or stopping the queue(s)
> triggering it.
We cannot do that from the interrupt layer without squaring the
circle and violating all locking and layering rules in one go.
> It can be masked by including masking in the queue metadata. We can
> detect pending by checking the producer/consumer values.
>
> However synchronizing all the HW and all the state is now more
> complicated than just writing a mask bit via MMIO to an on-die memory.
That's one of the reasons why I think that the IMS handling has to be a
per device irqdomain with it's own interrupt chip because the way how
IMS is managed is completely device specific.
There is certainly opportunity for sharing some of the functionality and
code, but not by creating a pseudo-shared entity which is customized per
device with indirections and magic storage plus device specific IMS slot
management glued at it as a wart. Such concepts fall apart in no time or
end up in a completely unmaintainable mess.
Coming back to mask/unmask. We could lift that requirement if and only
if irq remapping is mandatory to make use of those magic devices because
the remapping unit allows us to do the masking. That still would not
justify the pseudo-shared irqdomain because the IMS slot management
still stays per device.
Thanks,
tglx
Jason Gunthorpe <[email protected]> writes:
> Though it is more of a rational and a cookbook on how to combine
> existing technology pieces. (eg PASID, platform_msi, etc)
>
> The basic approach of SIOV's IMS is that there is no longer a generic
> interrupt indirection from numbers to addr/data pairs like
> IOAPIC/MSI/MSI-X owned by the common OS code.
>
> Instead the driver itself is responsible to set the addr/data pair
> into the device in a device specific way, deal with masking, etc.
>
> This lets the device use an implementation that is not limited by the
> harsh MSI-X semantics.
>
> In Linux we already have 'IMS' it is called platform_msi and a few
> embedded drivers already work like this. The idea here is to bring it
> to PCI.
platform_msi as it exists today is a crutch and in hindsight I should
have payed more attention back then and shoot it down before it got
merged.
IMS can be somehow mapped to platform MSI but the proposed approach to
extend platform MSI with the extra bolts for IMS (valid for one
particular incarnation) is just going into the wrong direction.
We've been there and the main reason why hierarchical irq domains exist
is that we needed to make a clear cut between the involved hardware
pieces and their drivers. The pre hierarchy model was a maze of stuff
calling back and forth between layers with lots of duct tape added to
make it "work". This finally fell apart when Intel tried to support
I/O-APIC hotplug. The ARM people had similar issues with all the special
irq related SoC specific IP blocks which are placed between the CPU
level interrupt controller and the device.
The hierarchy strictly seperates the per layer resource management and
each layer can work mostly independent of the actual available parent
layer.
Now looking at IMS. It's a subsystem inside a physical device. It has
slot management (where to place the Message) and mask/unmask. Resource
management at that level is what irq domains are for and mask/unmask is
what a irq chip handles.
So the right thing to do is to create shared infrastructure which is
utilized by the device drivers by providing a few bog standard data
structures and the handful of device specific domain and irq functions.
That keeps the functionality common, but avoids that we end up with
- msi_desc becoming a dump ground for random driver data
- a zoo of platform callbacks
- glued on driver specific resource management
and all the great hacks which it requires to work on hundreds of
different devices which all implement IMS differently.
I'm all for sharing code and making the life of driver writers simple
because that makes my life simple as well, but not by creating a layer
at the wrong level and then hacking it into submission until it finally
collapses.
Designing the infrastructure following the clear layering rules of
hierarchical domains so it works for IMS and also replaces the platform
MSI hack is the only sane way to go forward, not the other way round.
Thanks,
tglx
Hi Thomas,
On 8/7/2020 9:47 AM, Thomas Gleixner wrote:
> Jason Gunthorpe <[email protected]> writes:
>> Though it is more of a rational and a cookbook on how to combine
>> existing technology pieces. (eg PASID, platform_msi, etc)
>>
>> The basic approach of SIOV's IMS is that there is no longer a generic
>> interrupt indirection from numbers to addr/data pairs like
>> IOAPIC/MSI/MSI-X owned by the common OS code.
>>
>> Instead the driver itself is responsible to set the addr/data pair
>> into the device in a device specific way, deal with masking, etc.
>>
>> This lets the device use an implementation that is not limited by the
>> harsh MSI-X semantics.
>>
>> In Linux we already have 'IMS' it is called platform_msi and a few
>> embedded drivers already work like this. The idea here is to bring it
>> to PCI.
> platform_msi as it exists today is a crutch and in hindsight I should
> have payed more attention back then and shoot it down before it got
> merged.
>
> IMS can be somehow mapped to platform MSI but the proposed approach to
> extend platform MSI with the extra bolts for IMS (valid for one
> particular incarnation) is just going into the wrong direction.
>
> We've been there and the main reason why hierarchical irq domains exist
> is that we needed to make a clear cut between the involved hardware
> pieces and their drivers. The pre hierarchy model was a maze of stuff
> calling back and forth between layers with lots of duct tape added to
> make it "work". This finally fell apart when Intel tried to support
> I/O-APIC hotplug. The ARM people had similar issues with all the special
> irq related SoC specific IP blocks which are placed between the CPU
> level interrupt controller and the device.
>
> The hierarchy strictly seperates the per layer resource management and
> each layer can work mostly independent of the actual available parent
> layer.
>
> Now looking at IMS. It's a subsystem inside a physical device. It has
> slot management (where to place the Message) and mask/unmask. Resource
> management at that level is what irq domains are for and mask/unmask is
> what a irq chip handles.
>
> So the right thing to do is to create shared infrastructure which is
> utilized by the device drivers by providing a few bog standard data
> structures and the handful of device specific domain and irq functions.
>
> That keeps the functionality common, but avoids that we end up with
>
> - msi_desc becoming a dump ground for random driver data
>
> - a zoo of platform callbacks
>
> - glued on driver specific resource management
>
> and all the great hacks which it requires to work on hundreds of
> different devices which all implement IMS differently.
>
> I'm all for sharing code and making the life of driver writers simple
> because that makes my life simple as well, but not by creating a layer
> at the wrong level and then hacking it into submission until it finally
> collapses.
>
> Designing the infrastructure following the clear layering rules of
> hierarchical domains so it works for IMS and also replaces the platform
> MSI hack is the only sane way to go forward, not the other way round.
From what I've gathered, I need to:
1. Get rid of the mantra that "IMS" is an extension of platform-msi.
2. Make this new infra devoid of any platform-msi references
3. Come up with a ground up approach which adheres to the layering
constraints of the IRQ subsystem
4. Have common code (drivers/irqchip maybe??) where we put in all the
generic ims-specific bits for the IRQ chip and domain
which can be used by all device drivers belonging to this "IMS"class.
5. Have the device driver do the rest:
create the chip/domain (one chip/domain per device?)
provide device specific callbacks for masking, unmasking, write message
So from the hierarchical domain standpoint, we will have:
- For DSA device: vector->intel-IR->IDXD
- For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
- For any other intel IMS device in the future which
does not require interrupt remapping: vector->new device IRQ domain
requires interrupt remapping: vector->intel-IR->new device IRQ
domain (i.e. create a new domain even though IDXD is already present?)
Please let me know if my understanding is correct.
What I still don't understand fully is what if all the IMS devices need
the same domain ops and chip callbacks, we will be creating various
instances of the same IRQ chip and domain right? Is that ok?
Currently the creation of the IRQ domain happens at the IR level so that
we can reuse the same domain but if it advisable to have a per device
interrupt domain, I will shift this to the device driver.
>
> Thanks,
>
> tglx
On Fri, Aug 07, 2020 at 10:54:51AM -0700, Dey, Megha wrote:
> So from the hierarchical domain standpoint, we will have:
> - For DSA device: vector->intel-IR->IDXD
> - For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
> - For any other intel IMS device in the future which
> does not require interrupt remapping: vector->new device IRQ domain
> requires interrupt remapping: vector->intel-IR->new device IRQ domain
I think you need a better classification than Jason's device or
Intel's device :)
Shouldn't the two cases be either you take the parent domain from the
IOMMU or you take the parent domain from the pci device?
What other choices could a PCI driver make?
Jason
On 8/7/2020 11:39 AM, Jason Gunthorpe wrote:
> On Fri, Aug 07, 2020 at 10:54:51AM -0700, Dey, Megha wrote:
>
>> So from the hierarchical domain standpoint, we will have:
>> - For DSA device: vector->intel-IR->IDXD
>> - For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
>> - For any other intel IMS device in the future which
>> does not require interrupt remapping: vector->new device IRQ domain
>> requires interrupt remapping: vector->intel-IR->new device IRQ domain
> I think you need a better classification than Jason's device or
> Intel's device :)
hehe yeah, for sure, just wanted to get my point across :)
>
> Shouldn't the two cases be either you take the parent domain from the
> IOMMU or you take the parent domain from the pci device?
Hmm yeah this makes sense..
Although in the case of DSA, we find the iommu corresponding to the
parent PCI device.
>
> What other choices could a PCI driver make?
Currently I think based on the devices we have, I don't think there are
any others
>
> Jason
Megha,
"Dey, Megha" <[email protected]> writes:
> On 8/7/2020 9:47 AM, Thomas Gleixner wrote:
>> I'm all for sharing code and making the life of driver writers simple
>> because that makes my life simple as well, but not by creating a layer
>> at the wrong level and then hacking it into submission until it finally
>> collapses.
>>
>> Designing the infrastructure following the clear layering rules of
>> hierarchical domains so it works for IMS and also replaces the platform
>> MSI hack is the only sane way to go forward, not the other way round.
> From what I've gathered, I need to:
>
> 1. Get rid of the mantra that "IMS" is an extension of platform-msi.
> 2. Make this new infra devoid of any platform-msi references
See below.
> 3. Come up with a ground up approach which adheres to the layering
> constraints of the IRQ subsystem
Yes. It's something which can be used by all devices which have:
1) A device specific irq chip implementation including a msi write function
2) Device specific resource management (slots in the IMS case)
The infrastructure you need is basically a wrapper around the core MSI
domain (similar to PCI, platform-MSI etc,) which provides the specific
functionality to handle the above.
> 4. Have common code (drivers/irqchip maybe??) where we put in all the
> generic ims-specific bits for the IRQ chip and domain
> which can be used by all device drivers belonging to this "IMS"class.
Yes, you can provide a common implementation for devices which share the
same irq chip and domain (slot management functionality)
> 5. Have the device driver do the rest:
> create the chip/domain (one chip/domain per device?)
> provide device specific callbacks for masking, unmasking, write
> message
Correct, but you don't need any magic new data structures for that, the
existing msi_domain_info/msi_domain_ops and related structures are
either sufficient or can be extended when necessary.
So for the IDXD case you need:
1) An irq chip with mask/unmask callbacks and a write msg function
2) A slot allocation or association function and their 'free'
counterpart (irq_domain_ops)
The function and struct pointers go into the appropriate
msi_info/msi_ops structures along with the correct flags to set up the
whole thing and then the infrastructure creates your domain, fills in
the shared functions and sets the whole thing up.
That's all what a device driver needs to provide, i.e. stick the device
specific functionality into right data structures and let the common
infrastructure deal with it. The rest just works and the device specific
functions are invoked from the right places when required.
> So from the hierarchical domain standpoint, we will have:
> - For DSA device: vector->intel-IR->IDXD
> - For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
> - For any other intel IMS device in the future which
> does not require interrupt remapping: vector->new device IRQ domain
> requires interrupt remapping: vector->intel-IR->new device IRQ
> domain (i.e. create a new domain even though IDXD is already present?)
What's special about IDXD? It's just one specific implementation of IMS
and any other device implementing IMS is completely independent and as
documented in the specification the IMS slot management and therefore
the mask/unmask functionality can and will be completely different. IDXD
has a storage array with slots, Jason's hardware puts the IMS slot into
the queue storage.
It does not matter whether a device comes from Intel or any other vendor,
it does neither matter whether the device works with direct vector
delivery or interrupt remapping.
IDXD is not any different from any other IMS capable device when you
look at it from the interrupt hierarchy. It's either:
vector -> IR -> device
or
vector -> device
The only point where this is differentiated is when the irq domain is
created. Anything else just falls into place.
To answer Jason's question: No, the parent is never the PCI/MSI irq
domain because that sits at the same level as that device
domain. Remember the scheme:
vector --- DMAR-MSI
|
|-- ....
|
|-- IR-0 --- IO/APIC-0
| |
| |-- IO/APIC-1
| |
| |-- PCI/MSI-0
| |
| |-- HPET/MSI-0
| |
| |-- DEV-A/MSI-0
| |-- DEV-A/MSI-1
| |-- DEV-B/MSI-2
|
|-- IR-1 --- PCI/MSI-1
| |
| |-- DEV-C/MSI-3
The PCI/MSI domain(s) are dealing solely with PCI standard compliant
MSI/MSI-X. IMS or similar (platform-MSI being one variant) sit at the
same level as the PCI/MSI domains.
Why? It's how the hardware operates.
The PCI/MSI "irq chip" is configured by the PCI/MSI domain level and it
sends its message to the interrupt parent in the hierarchy, i.e. either
to the Interrupt Remap unit or to the configured vector of the target
CPU.
IMS does not send it to some magic PCI layer first at least not at the
conceptual level. The fact that the message is transported by PCIe does
not change that at all. PCIe in that case is solely the transport, but
the "irq chip" at the PCI/MSI level of the device is not involved at
all. If it were that would be a different story.
So now you might ask why we have a single PCI/MSI domain per IR unit and
why I want seperate IMS domains.
The answer is in the hardware again. PCI/MSI is uniform accross devices
so the irq chip and all of the domain functionality can be shared. But
then we have two PCI/MSI domains in the above example because again the
hardware has one connected to IR unit 0 and the other to IR unit 1.
IR 0 and IR 1 manage different resources (remap tables) so PCI/MSI-0
depends on IR-0 and PCI/MSI-1 on IR-1 which is reflected in the
parent/child relation ship of the domains.
There is another reason why we can spawn a single PCI/MSI domain per
root complex / IR unit. The PCI/MSI domains are not doing any resource
management at all. The resulting message is created from the allocated
vector (direct CPU delivery) or from the allocated Interrupt remapping
slot information. The domain just deals with the logic required to
handle PCI/MSI(X) and the necessary resources are provided by the parent
interrupt layers.
IMS is different. It needs device specific resource management to
allocate an IMS slot which is clearly part of the "irq chip" management
layer, aka. irq domain. If the IMS slot management would happen in a
global or per IR unit table and as a consequence the management, layout,
mask/unmask operations would be uniform then an IMS domain per system or
IR unit would be the right choice, but that's not how the hardware is
specified and implemented.
Now coming back to platform MSI. The way it looks is:
CPU --- (IR) ---- PLATFORM-MSI --- PLATFORM-DEVICE-MSI-0
|-- PLATFORM-DEVICE-MSI-1
|...
PLATFORM-MSI is a common resource management which also provides a
shared interrupt chip which operates at the PLATFORM-MSI level with one
exception:
The irq_msi_write_msg() callback has an indirection so the actual
devices can provide their device specific msi_write_msg() function.
That's a borderline abuse of the hierarchy, but it makes sense to some
extent as the actual PLATFORM-MSI domain is a truly shared resource and
the only device specific functionality required is the message
write. But that message write is not something which has it's own
resource management, it's just a non uniform storage accessor. IOW, the
underlying PLATFORM-MSI domain does all resource management including
message creation and the quirk allows to write the message in the device
specific way. Not that I love it, but ...
That is the main difference between platform MSI and IMS. IMS is
completely non uniform and the devices do not share any common resource
or chip functionality. Each device has its own message store management,
slot allocation/assignment and a device specifc interrupt chip
functionality which goes way beyond the nasty write msg quirk.
> What I still don't understand fully is what if all the IMS devices
> need the same domain ops and chip callbacks, we will be creating
> various instances of the same IRQ chip and domain right? Is that ok?
Why would it be not ok? Are you really worried about a few hundred bytes
of memory required for this?
Sharing an instance only makes sense if the instance handles a shared or
uniform resource space, which is clearly not the case with IMS.
We create several PCI/MSI domains and several IO/APIC domains on larger
systems. They all share the code, but they are dealing with seperate
resources so they have seperate storage.
> Currently the creation of the IRQ domain happens at the IR level so that
> we can reuse the same domain but if it advisable to have a per device
> interrupt domain, I will shift this to the device driver.
Again. Look at the layering. What you created now is a pseudo shared
domain which needs
1) An indirection layer for providing device specific functions
2) An extra allocation layer in the device specific driver to assign
IMS slots completely outside of the domain allocation mechanism.
In other words you try to make things which are neither uniform nor
share a resource space look the same way. That's the "all I have is a
hammer so everything is a nail" approach. That never worked out well.
With a per device domain/chip approach you get one consistent domain
per device which provides
1) The device specific resource management (i.e. slot allocation
becomes part of the irq domain operations)
2) The device specific irq chip functions at the correct point in the
layering without the horrid indirections
3) Consolidated data storage at the device level where the actual
data is managed.
This is of course sharing as much code as possible with the MSI core
implementation.
As a side effect any extension of this be it on the domain or the irq
chip side is just a matter of adding the functionality to that
particular incarnation and not by having yet another indirection
logic at the wrong place.
The price you pay is a bit of memory but you get a clean layering and
seperation of functionality as a reward. The amount of code in the
actual IMS device driver is not going to be much more than with the
approach you have now.
The infrastructure itself is not more than a thin wrapper around the
existing msi domain infrastructure and might even share code with
platform-msi.
Thanks,
tglx
> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, August 7, 2020 8:20 PM
>
> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
>
> > If you see this as an abuse of the framework, then let's identify those
> > specific issues and come up with a better approach. As we've discussed
> > before, things like basic PCI config space emulation are acceptable
> > overhead and low risk (imo) and some degree of register emulation is
> > well within the territory of an mdev driver.
>
> What troubles me is that idxd already has a direct userspace interface
> to its HW, and does userspace DMA. The purpose of this mdev is to
> provide a second direct userspace interface that is a little different
> and trivially plugs into the virtualization stack.
No. Userspace DMA and subdevice passthrough (what mdev provides)
are two distinct usages IMO (at least in idxd context). and this might
be the main divergence between us, thus let me put more words here.
If we could reach consensus in this matter, which direction to go
would be clearer.
First, a passthrough interface requires some unique requirements
which are not commonly observed in an userspace DMA interface, e.g.:
- Tracking DMA dirty pages for live migration;
- A set of interfaces for using SVA inside guest;
* PASID allocation/free (on some platforms);
* bind/unbind guest mm/page table (nested translation);
* invalidate IOMMU cache/iotlb for guest page table changes;
* report page request from device to guest;
* forward page response from guest to device;
- Configuring irqbypass for posted interrupt;
- ...
Second, a passthrough interface requires delegating raw controllability
of subdevice to guest driver, while the same delegation might not be
required for implementing an userspace DMA interface (especially for
modern devices which support SVA). For example, idxd allows following
setting per wq (guest driver may configure them in any combination):
- put in dedicated or shared mode;
- enable/disable SVA;
- Associate guest-provided PASID to MSI/IMS entry;
- set threshold;
- allow/deny privileged access;
- allocate/free interrupt handle (enlightened for guest);
- collect error status;
- ...
We plan to support idxd userspace DMA with SVA. The driver just needs
to prepare a wq with a predefined configuration (e.g. shared, SVA,
etc.), bind the process mm to IOMMU (non-nested) and then map
the portal to userspace. The goal that userspace can do DMA to
associated wq doesn't change the fact that the wq is still *owned*
and *controlled* by kernel driver. However as far as passthrough
is concerned, the wq is considered 'owned' by the guest driver thus
we need an interface which can support low-level *controllability*
from guest driver. It is sort of a mess in uAPI when mixing the
two together.
Based on above two reasons, we see distinct requirements between
userspace DMA and passthrough interfaces, at least in idxd context
(though other devices may have less distinction in-between). Therefore,
we didn't see the value/necessity of reinventing the wheel that mdev
already handles well to evolve an simple application-oriented usespace
DMA interface to a complex guest-driver-oriented passthrough interface.
The complexity of doing so would incur far more kernel-side changes
than the portion of emulation code that you've been concerned about...
>
> I don't think VFIO should be the only entry point to
> virtualization. If we say the universe of devices doing user space DMA
> must also implement a VFIO mdev to plug into virtualization then it
> will be alot of mdevs.
Certainly VFIO will not be the only entry point. and This has to be a
case-by-case decision. If an userspace DMA interface can be easily
adapted to be a passthrough one, it might be the choice. But for idxd,
we see mdev a much better fit here, given the big difference between
what userspace DMA requires and what guest driver requires in this hw.
>
> I would prefer to see that the existing userspace interface have the
> extra needed bits for virtualization (eg by having appropriate
> internal kernel APIs to make this easy) and all the emulation to build
> the synthetic PCI device be done in userspace.
In the end what decides the direction is the amount of changes that
we have to put in kernel, not whether we call it 'emulation'. For idxd,
adding special passthrough requirements (guest SVA, dirty tracking,
etc.) and raw controllability to the simple userspace DMA interface
is for sure making kernel more complex than reusing the mdev
framework (plus some degree of emulation mockup behind). Not to
mention the merit of uAPI compatibility with mdev...
Thanks
Kevin
Thomas Gleixner <[email protected]> writes:
> The infrastructure itself is not more than a thin wrapper around the
> existing msi domain infrastructure and might even share code with
> platform-msi.
And the annoying fact that you need XEN support which opens another can
of worms...
Thomas Gleixner <[email protected]> writes:
CC+: XEN folks
> Thomas Gleixner <[email protected]> writes:
>> The infrastructure itself is not more than a thin wrapper around the
>> existing msi domain infrastructure and might even share code with
>> platform-msi.
>
> And the annoying fact that you need XEN support which opens another can
> of worms...
which needs some real cleanup first.
x86 still does not associate the irq domain to devices at device
discovery time, i.e. the device::msi_domain pointer is never populated.
So to support this new fangled device MSI stuff we'd need yet more
x86/xen specific arch_*msi_irqs() indirection and hackery, which is not
going to happen.
The right thing to do is to convert XEN MSI support over to proper irq
domains. This allows to populate device::msi_domain which makes a lot of
things simpler and also more consistent.
Thanks,
tglx
On Mon, 10 Aug 2020 07:32:24 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, August 7, 2020 8:20 PM
> >
> > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >
> > > If you see this as an abuse of the framework, then let's identify those
> > > specific issues and come up with a better approach. As we've discussed
> > > before, things like basic PCI config space emulation are acceptable
> > > overhead and low risk (imo) and some degree of register emulation is
> > > well within the territory of an mdev driver.
> >
> > What troubles me is that idxd already has a direct userspace interface
> > to its HW, and does userspace DMA. The purpose of this mdev is to
> > provide a second direct userspace interface that is a little different
> > and trivially plugs into the virtualization stack.
>
> No. Userspace DMA and subdevice passthrough (what mdev provides)
> are two distinct usages IMO (at least in idxd context). and this might
> be the main divergence between us, thus let me put more words here.
> If we could reach consensus in this matter, which direction to go
> would be clearer.
>
> First, a passthrough interface requires some unique requirements
> which are not commonly observed in an userspace DMA interface, e.g.:
>
> - Tracking DMA dirty pages for live migration;
> - A set of interfaces for using SVA inside guest;
> * PASID allocation/free (on some platforms);
> * bind/unbind guest mm/page table (nested translation);
> * invalidate IOMMU cache/iotlb for guest page table changes;
> * report page request from device to guest;
> * forward page response from guest to device;
> - Configuring irqbypass for posted interrupt;
> - ...
>
> Second, a passthrough interface requires delegating raw controllability
> of subdevice to guest driver, while the same delegation might not be
> required for implementing an userspace DMA interface (especially for
> modern devices which support SVA). For example, idxd allows following
> setting per wq (guest driver may configure them in any combination):
> - put in dedicated or shared mode;
> - enable/disable SVA;
> - Associate guest-provided PASID to MSI/IMS entry;
> - set threshold;
> - allow/deny privileged access;
> - allocate/free interrupt handle (enlightened for guest);
> - collect error status;
> - ...
>
> We plan to support idxd userspace DMA with SVA. The driver just needs
> to prepare a wq with a predefined configuration (e.g. shared, SVA,
> etc.), bind the process mm to IOMMU (non-nested) and then map
> the portal to userspace. The goal that userspace can do DMA to
> associated wq doesn't change the fact that the wq is still *owned*
> and *controlled* by kernel driver. However as far as passthrough
> is concerned, the wq is considered 'owned' by the guest driver thus
> we need an interface which can support low-level *controllability*
> from guest driver. It is sort of a mess in uAPI when mixing the
> two together.
>
> Based on above two reasons, we see distinct requirements between
> userspace DMA and passthrough interfaces, at least in idxd context
> (though other devices may have less distinction in-between). Therefore,
> we didn't see the value/necessity of reinventing the wheel that mdev
> already handles well to evolve an simple application-oriented usespace
> DMA interface to a complex guest-driver-oriented passthrough interface.
> The complexity of doing so would incur far more kernel-side changes
> than the portion of emulation code that you've been concerned about...
>
> >
> > I don't think VFIO should be the only entry point to
> > virtualization. If we say the universe of devices doing user space DMA
> > must also implement a VFIO mdev to plug into virtualization then it
> > will be alot of mdevs.
>
> Certainly VFIO will not be the only entry point. and This has to be a
> case-by-case decision. If an userspace DMA interface can be easily
> adapted to be a passthrough one, it might be the choice. But for idxd,
> we see mdev a much better fit here, given the big difference between
> what userspace DMA requires and what guest driver requires in this hw.
>
> >
> > I would prefer to see that the existing userspace interface have the
> > extra needed bits for virtualization (eg by having appropriate
> > internal kernel APIs to make this easy) and all the emulation to build
> > the synthetic PCI device be done in userspace.
>
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. For idxd,
> adding special passthrough requirements (guest SVA, dirty tracking,
> etc.) and raw controllability to the simple userspace DMA interface
> is for sure making kernel more complex than reusing the mdev
> framework (plus some degree of emulation mockup behind). Not to
> mention the merit of uAPI compatibility with mdev...
I agree with a lot of this argument, exposing a device through a
userspace interface versus allowing user access to a device through a
userspace interface are different levels of abstraction and control.
In an ideal world, perhaps we could compose one from the other, but I
don't think the existence of one is proof that the other is redundant.
That's not to say that mdev/vfio isn't ripe for abuse in this space,
but I'm afraid the test for that abuse is probably much more subtle.
I'll also remind folks that LPC is coming up in just a couple short
weeks and this might be something we should discuss (virtually)
in-person. uconf CfPs are currently open. </plug> Thanks,
Alex
Hi Thomas,
On 8/8/2020 12:47 PM, Thomas Gleixner wrote:
> Megha,
>
> "Dey, Megha" <[email protected]> writes:
>> On 8/7/2020 9:47 AM, Thomas Gleixner wrote:
>>> I'm all for sharing code and making the life of driver writers simple
>>> because that makes my life simple as well, but not by creating a layer
>>> at the wrong level and then hacking it into submission until it finally
>>> collapses.
>>>
>>> Designing the infrastructure following the clear layering rules of
>>> hierarchical domains so it works for IMS and also replaces the platform
>>> MSI hack is the only sane way to go forward, not the other way round.
>> From what I've gathered, I need to:
>>
>> 1. Get rid of the mantra that "IMS" is an extension of platform-msi.
>> 2. Make this new infra devoid of any platform-msi references
> See below.
ok..
>
>> 3. Come up with a ground up approach which adheres to the layering
>> constraints of the IRQ subsystem
> Yes. It's something which can be used by all devices which have:
>
> 1) A device specific irq chip implementation including a msi write function
> 2) Device specific resource management (slots in the IMS case)
>
> The infrastructure you need is basically a wrapper around the core MSI
> domain (similar to PCI, platform-MSI etc,) which provides the specific
> functionality to handle the above.
ok, i will create a per device irq chip which will directly have the
device specific callbacks instead of another layer of redirection.
This way i will get rid of the 'platform_msi_ops' data structure.
I am not sure what you mean by device specific resource management, are
you referring to dev_msi_alloc/free_irqs?
>> 4. Have common code (drivers/irqchip maybe??) where we put in all the
>> generic ims-specific bits for the IRQ chip and domain
>> which can be used by all device drivers belonging to this "IMS"class.
> Yes, you can provide a common implementation for devices which share the
> same irq chip and domain (slot management functionality)
yeah i think most of the msi_domain_ops (msi_prepare, set_desc etc) are
common and can be moved into a common file.
>
>> 5. Have the device driver do the rest:
>> create the chip/domain (one chip/domain per device?)
>> provide device specific callbacks for masking, unmasking, write
>> message
> Correct, but you don't need any magic new data structures for that, the
> existing msi_domain_info/msi_domain_ops and related structures are
> either sufficient or can be extended when necessary.
>
> So for the IDXD case you need:
>
> 1) An irq chip with mask/unmask callbacks and a write msg function
> 2) A slot allocation or association function and their 'free'
> counterpart (irq_domain_ops)
This is one part I didn't understand.
Currently my dev_msi_alloc_irqs is simply a wrapper over
platform_msi_domain_alloc_irqs which again mostly calls
msi_domain_alloc_irqs.
When you say add a .alloc, .free, does this mean we should add a device
specific alloc/free and not use the default
msi_domain_alloc/msi_domain_free?
I don't see anything device specific to be done for IDXD atleast, can
you please let me know?
>
> The function and struct pointers go into the appropriate
> msi_info/msi_ops structures along with the correct flags to set up the
> whole thing and then the infrastructure creates your domain, fills in
> the shared functions and sets the whole thing up.
>
> That's all what a device driver needs to provide, i.e. stick the device
> specific functionality into right data structures and let the common
> infrastructure deal with it. The rest just works and the device specific
> functions are invoked from the right places when required.
yeah. makes sense..
>
>> So from the hierarchical domain standpoint, we will have:
>> - For DSA device: vector->intel-IR->IDXD
>> - For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
>> - For any other intel IMS device in the future which
>> does not require interrupt remapping: vector->new device IRQ domain
>> requires interrupt remapping: vector->intel-IR->new device IRQ
>> domain (i.e. create a new domain even though IDXD is already present?)
> What's special about IDXD? It's just one specific implementation of IMS
> and any other device implementing IMS is completely independent and as
> documented in the specification the IMS slot management and therefore
> the mask/unmask functionality can and will be completely different. IDXD
> has a storage array with slots, Jason's hardware puts the IMS slot into
> the queue storage.
>
> It does not matter whether a device comes from Intel or any other vendor,
> it does neither matter whether the device works with direct vector
> delivery or interrupt remapping.
>
> IDXD is not any different from any other IMS capable device when you
> look at it from the interrupt hierarchy. It's either:
>
> vector -> IR -> device
> or
> vector -> device
>
> The only point where this is differentiated is when the irq domain is
> created. Anything else just falls into place.
yeah, so I will create the IRQ domain in the IDXD driver with INTEL-IR
as the parent, instead of creating a common per IR unit domain
>
> To answer Jason's question: No, the parent is never the PCI/MSI irq
> domain because that sits at the same level as that device
> domain. Remember the scheme:
>
> vector --- DMAR-MSI
> |
> |-- ....
> |
> |-- IR-0 --- IO/APIC-0
> | |
> | |-- IO/APIC-1
> | |
> | |-- PCI/MSI-0
> | |
> | |-- HPET/MSI-0
> | |
> | |-- DEV-A/MSI-0
> | |-- DEV-A/MSI-1
> | |-- DEV-B/MSI-2
> |
> |-- IR-1 --- PCI/MSI-1
> | |
> | |-- DEV-C/MSI-3
>
> The PCI/MSI domain(s) are dealing solely with PCI standard compliant
> MSI/MSI-X. IMS or similar (platform-MSI being one variant) sit at the
> same level as the PCI/MSI domains.
>
> Why? It's how the hardware operates.
>
> The PCI/MSI "irq chip" is configured by the PCI/MSI domain level and it
> sends its message to the interrupt parent in the hierarchy, i.e. either
> to the Interrupt Remap unit or to the configured vector of the target
> CPU.
>
> IMS does not send it to some magic PCI layer first at least not at the
> conceptual level. The fact that the message is transported by PCIe does
> not change that at all. PCIe in that case is solely the transport, but
> the "irq chip" at the PCI/MSI level of the device is not involved at
> all. If it were that would be a different story.
>
> So now you might ask why we have a single PCI/MSI domain per IR unit and
> why I want seperate IMS domains.
>
> The answer is in the hardware again. PCI/MSI is uniform accross devices
> so the irq chip and all of the domain functionality can be shared. But
> then we have two PCI/MSI domains in the above example because again the
> hardware has one connected to IR unit 0 and the other to IR unit 1.
> IR 0 and IR 1 manage different resources (remap tables) so PCI/MSI-0
> depends on IR-0 and PCI/MSI-1 on IR-1 which is reflected in the
> parent/child relation ship of the domains.
>
> There is another reason why we can spawn a single PCI/MSI domain per
> root complex / IR unit. The PCI/MSI domains are not doing any resource
> management at all. The resulting message is created from the allocated
> vector (direct CPU delivery) or from the allocated Interrupt remapping
> slot information. The domain just deals with the logic required to
> handle PCI/MSI(X) and the necessary resources are provided by the parent
> interrupt layers.
>
> IMS is different. It needs device specific resource management to
> allocate an IMS slot which is clearly part of the "irq chip" management
> layer, aka. irq domain. If the IMS slot management would happen in a
> global or per IR unit table and as a consequence the management, layout,
> mask/unmask operations would be uniform then an IMS domain per system or
> IR unit would be the right choice, but that's not how the hardware is
> specified and implemented.
>
> Now coming back to platform MSI. The way it looks is:
>
> CPU --- (IR) ---- PLATFORM-MSI --- PLATFORM-DEVICE-MSI-0
> |-- PLATFORM-DEVICE-MSI-1
> |...
>
> PLATFORM-MSI is a common resource management which also provides a
> shared interrupt chip which operates at the PLATFORM-MSI level with one
> exception:
>
> The irq_msi_write_msg() callback has an indirection so the actual
> devices can provide their device specific msi_write_msg() function.
>
> That's a borderline abuse of the hierarchy, but it makes sense to some
> extent as the actual PLATFORM-MSI domain is a truly shared resource and
> the only device specific functionality required is the message
> write. But that message write is not something which has it's own
> resource management, it's just a non uniform storage accessor. IOW, the
> underlying PLATFORM-MSI domain does all resource management including
> message creation and the quirk allows to write the message in the device
> specific way. Not that I love it, but ...
>
> That is the main difference between platform MSI and IMS. IMS is
> completely non uniform and the devices do not share any common resource
> or chip functionality. Each device has its own message store management,
> slot allocation/assignment and a device specifc interrupt chip
> functionality which goes way beyond the nasty write msg quirk.
Thanks for giving such a detailed explanation! really helps :)
>
>> What I still don't understand fully is what if all the IMS devices
>> need the same domain ops and chip callbacks, we will be creating
>> various instances of the same IRQ chip and domain right? Is that ok?
> Why would it be not ok? Are you really worried about a few hundred bytes
> of memory required for this?
>
> Sharing an instance only makes sense if the instance handles a shared or
> uniform resource space, which is clearly not the case with IMS.
>
> We create several PCI/MSI domains and several IO/APIC domains on larger
> systems. They all share the code, but they are dealing with seperate
> resources so they have seperate storage.
ok, got it ..
>
>> Currently the creation of the IRQ domain happens at the IR level so that
>> we can reuse the same domain but if it advisable to have a per device
>> interrupt domain, I will shift this to the device driver.
> Again. Look at the layering. What you created now is a pseudo shared
> domain which needs
>
> 1) An indirection layer for providing device specific functions
>
> 2) An extra allocation layer in the device specific driver to assign
> IMS slots completely outside of the domain allocation mechanism.
hmmm, again I am not sure of which extra allocation layer you are
referring to..
>
> In other words you try to make things which are neither uniform nor
> share a resource space look the same way. That's the "all I have is a
> hammer so everything is a nail" approach. That never worked out well.
>
> With a per device domain/chip approach you get one consistent domain
> per device which provides
>
> 1) The device specific resource management (i.e. slot allocation
> becomes part of the irq domain operations)
>
> 2) The device specific irq chip functions at the correct point in the
> layering without the horrid indirections
>
> 3) Consolidated data storage at the device level where the actual
> data is managed.
>
> This is of course sharing as much code as possible with the MSI core
> implementation.
>
> As a side effect any extension of this be it on the domain or the irq
> chip side is just a matter of adding the functionality to that
> particular incarnation and not by having yet another indirection
> logic at the wrong place.
>
> The price you pay is a bit of memory but you get a clean layering and
> seperation of functionality as a reward. The amount of code in the
> actual IMS device driver is not going to be much more than with the
> approach you have now.
>
> The infrastructure itself is not more than a thin wrapper around the
> existing msi domain infrastructure and might even share code with
> platform-msi.
From your explanation:
In the device driver:
static const struct irq_domain_ops idxd_irq_domain_ops = {
.alloc= idxd_domain_alloc, //not sure what this should do
.free= idxd_domain_free,
};
struct irq_chip idxd_irq_chip = {
.name= "idxd"
.irq_mask= idxd_irq_mask,
.irq_unmask= idxd_irq_unmask,
.irq_write_msg = idxd_irq_write_msg,
.irq_ack= irq_chip_ack_parent,
.irq_retrigger= irq_chip_retrigger_hierarchy,
.flags= IRQCHIP_SKIP_SET_WAKE,
}
struct msi_domain_info idxd_domain_info = {
.flags =MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
.ops =&dev_msi_domain_ops,//can be common
.chip =&idxd_irq_chip //per device
.handler= handle_edge_irq,
.handler_name = "edge",
}
dev->msi_domain = dev_msi_create_irq_domain(iommu->ir_domain,
idxd_domain_info, idxd_irq_domain_ops)
msi_domain_alloc_irqs(dev->msi_domain, dev, nvec);
Common code:
struct irq_domain *dev_msi_create_irq_domain(struct irq_domain *parent,
struct msi_domain_info *dev_msi_domain_info,
struct irq_domain_ops dev_msi_irq_domain_ops)
{
struct irq_domain *domain;
.......
domain = irq_domain_create_hierarchy(parent, IRQ_DOMAIN_FLAG_MSI, 0,
NULL, &dev_msi_irq_domain_ops, info);
.......
return domain;
}
static struct msi_domain_ops dev_msi_domain_ops = {
.set_desc= dev_msi_set_desc,
.msi_prepare= dev_msi_prepare,
.get_hwirq= dev_msi_get_hwirq,
}; // can re-use platform-msi data structures
except the alloc/free irq_domain_ops, does this look fine to you?
-Megha
>
> Thanks,
>
> tglx
Hi Thomas,
On 8/11/2020 2:53 AM, Thomas Gleixner wrote:
> Thomas Gleixner <[email protected]> writes:
>
> CC+: XEN folks
>
>> Thomas Gleixner <[email protected]> writes:
>>> The infrastructure itself is not more than a thin wrapper around the
>>> existing msi domain infrastructure and might even share code with
>>> platform-msi.
>> And the annoying fact that you need XEN support which opens another can
>> of worms...
hmm I am not sure why we need Xen support... are you referring to idxd
using xen?
> which needs some real cleanup first.
>
> x86 still does not associate the irq domain to devices at device
> discovery time, i.e. the device::msi_domain pointer is never populated.
>
> So to support this new fangled device MSI stuff we'd need yet more
> x86/xen specific arch_*msi_irqs() indirection and hackery, which is not
> going to happen.
>
> The right thing to do is to convert XEN MSI support over to proper irq
> domains. This allows to populate device::msi_domain which makes a lot of
> things simpler and also more consistent.
do you think this cleanup is to be a precursor to my patches? I could
look into it but I am not familiar with the background of Xen
and this stuff. Can you please provide further guidance on where to look?
> Thanks,
>
> tglx
"Dey, Megha" <[email protected]> writes:
> On 8/11/2020 2:53 AM, Thomas Gleixner wrote:
>>> And the annoying fact that you need XEN support which opens another can
>>> of worms...
>
> hmm I am not sure why we need Xen support... are you referring to idxd
> using xen?
What about using IDXD when you are running on XEN? I might be missing
something and IDXD/IMS is hypervisor only, but that still does not solve
this problem on bare metal:
>> x86 still does not associate the irq domain to devices at device
>> discovery time, i.e. the device::msi_domain pointer is never
>> populated.
We can't do that right now due to the way how X86 PCI/MSI allocation
works and being able to do so would make things consistent and way
simpler even for your stuff.
>> The right thing to do is to convert XEN MSI support over to proper irq
>> domains. This allows to populate device::msi_domain which makes a lot of
>> things simpler and also more consistent.
>
> do you think this cleanup is to be a precursor to my patches? I could
> look into it but I am not familiar with the background of Xen
>
> and this stuff. Can you please provide further guidance on where to
> look
As I said:
>> So to support this new fangled device MSI stuff we'd need yet more
>> x86/xen specific arch_*msi_irqs() indirection and hackery, which is not
>> going to happen.
git grep arch_.*msi_irq arch/x86
This indirection prevents storing the irq_domain pointer in the device
at probe/detection time. Native code already uses irq domains for
PCI/MSI but we can't exploit the full potential because then
pci_msi_setup_msi_irqs() would never end up in arch_setup_msi_irqs()
which breaks XEN.
I was reminded of that nastiness when I was looking at sensible ways to
integrate this device MSI maze proper.
From a conceptual POV this stuff, which is not restricted to IDXD at all,
looks like this:
]-------------------------------------------|
PCI BUS -- | PCI device |
]-------------------| |
| Physical function | |
]-------------------| |
]-------------------|----------| |
| Control block for subdevices | |
]------------------------------| |
| | <- "Subdevice BUS" |
| | |
| |-- Subddevice 0 |
| |-- Subddevice 1 |
| |-- ... |
| |-- Subddevice N |
]-------------------------------------------|
It does not matter whether this is IDXD with it's magic devices or a
network card with a gazillion of queues. Conceptually we need to look at
them as individual subdevices.
And obviously the above picture gives you the topology. The physical
function device belongs to PCI in all aspects including the MSI
interrupt control. The control block is part of the PCI device as well
and it even can have regular PCI/MSI interrupts for its own
purposes. There might be devices where the Physical function device does
not exist at all and the only true PCI functionality is the control
block to manage subdevices. That does not matter and does not change the
concept.
Now the subdevices belong topology wise NOT to the PCI part. PCI is just
the transport they utilize. And their irq domain is distinct from the
PCI/MSI domain for reasons I explained before.
So looking at it from a Linux perspective:
pci-bus -> PCI device (managed by PCI/MSI domain)
- PF device
- CB device (hosts DEVMSI domain)
| "Subdevice bus"
| - subdevice
| - subdevice
| - subdevice
Now you would assume that figuring out the irq domain which the DEVMSI
domain serving the subdevices on the subdevice bus should take as parent
is pretty trivial when looking at the topology, right?
CB device's parent is PCI device and we know that PCI device MSI is
handled by the PCI/MSI domain which is either system wide or per IR
unit.
So getting the relevant PCI/MSI irq domain is as simple as doing:
pcimsi_domain = pcidevice->device->msi_domain;
and then because we know that this is a hierarchy the parent domain of
pcimsi_domain is the one which is the parent of our DEVMSI domain, i.e.:
parent = pcmsi_domain->parent;
Obvious, right?
What's not so obvious is that pcidevice->device->msi_domain is not
populated on x86 and trying to get the parent from there is a NULL
pointer dereference which does not work well.
So you surely can hack up some workaround for this, but that's just
proliferating crap. We want this to be consistent and there is
absolutely no reason why that network card with the MSI storage in the
queue data should not work on any other architecture.
We do the correct association already for IOMMU and whatever topological
stuff is attached to (PCI) devices on probe/detection time so making it
consistent for irq domains is just a logical consequence and matter of
consistency.
Back in the days when x86 was converted to hierarchical irq domains in
order to support I/O APIC hotplug this workaround was accepted to make
progress and it was meant as a transitional step. Of course after the
goal was achieved nobody @Intel cared anymore and so far this did not
cause big problems. But now it does and we really want to make this
consistent first.
And no we are not making an exception for IDXD either just because
that's Intel only. Intel is not special and not exempt from cleaning
stuff up before adding new features especially not when the stuff to
cleanup is a leftover from Intel itself. IOW, we are not adding more
crap on top of crap which should not exists anymore.
It's not rocket science to fix this. All it needs is to let XEN create
irq domains and populate them during init.
On device detection/probe the proper domain needs to be determined which
is trivial and then stored in device->msi_domain. That makes
arch_.*_msi_irq() go away and a lot of code just simpler.
Thanks,
tglx
Megha.
"Dey, Megha" <[email protected]> writes:
> On 8/8/2020 12:47 PM, Thomas Gleixner wrote:
>>> 3. Come up with a ground up approach which adheres to the layering
>>> constraints of the IRQ subsystem
>> Yes. It's something which can be used by all devices which have:
>>
>> 1) A device specific irq chip implementation including a msi write function
>> 2) Device specific resource management (slots in the IMS case)
>>
>> The infrastructure you need is basically a wrapper around the core MSI
>> domain (similar to PCI, platform-MSI etc,) which provides the specific
>> functionality to handle the above.
>
> ok, i will create a per device irq chip which will directly have the
> device specific callbacks instead of another layer of redirection.
>
> This way i will get rid of the 'platform_msi_ops' data structure.
>
> I am not sure what you mean by device specific resource management, are
> you referring to dev_msi_alloc/free_irqs?
I think I gave you a hint:
>> 2) Device specific resource management (slots in the IMS case)
The IMS storage is an array with slots in the IDXD case and these slots
are assigned at interrupt allocation time, right? In other cases where
the IMS storage is in some other place, e.g. queue memory, then this
still needs to associated to the interrupt at allocation time.
But of course because you create some disconnected irqdomain you have to
do that assignment seperately on the side and then stick this
information into msi_desc after the fact.
And as this is device specific every device driver which utilizes IMS
has to do that which is bonkers at best.
>> 2) A slot allocation or association function and their 'free'
>> counterpart (irq_domain_ops)
>
> This is one part I didn't understand.
>
> Currently my dev_msi_alloc_irqs is simply a wrapper over
> platform_msi_domain_alloc_irqs which again mostly calls
> msi_domain_alloc_irqs.
>
> When you say add a .alloc, .free, does this mean we should add a device
> specific alloc/free and not use the default
> msi_domain_alloc/msi_domain_free?
>
> I don't see anything device specific to be done for IDXD atleast, can
> you please let me know?
Each and every time I mentioned this, I explicitely mentioned "slot
allocation (array) or slot association (IMS store is not in an
array)".
But you keep asking what's device specific over and over and where
resource management is?
The storage slot array is a resoruce which needs to be managed and it is
device specific by specification. And it's part of the interrupt
allocation obviously because without a place to store the MSI message
the whole thing would not work. This slot does not come out of thin air,
right?
https://github.com/intel/idxd-driver/commit/fb9a2f4e36525a1f18d9e654d472aa87a9adcb30
int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
{
struct idxd_device *idxd = vidxd->idxd;
struct ims_irq_entry *irq_entry;
struct mdev_device *mdev = vidxd->vdev.mdev;
struct device *dev = mdev_dev(mdev);
struct msi_desc *desc;
int err, i = 0;
int index;
/*
* MSIX vec 0 is emulated by the vdcm and does not take up an IMS. The total MSIX vecs used
* by the mdev will be total IMS + 1. vec 0 is used for misc interrupts such as command
* completion, error notification, PMU, etc. The other vectors are used for descriptor
* completion. Thus only the number of IMS vectors need to be allocated, which is
* VIDXD_MAX_MSIX_VECS - 1.
*/
err = dev_msi_domain_alloc_irqs(dev, VIDXD_MAX_MSIX_VECS - 1, &idxd_ims_ops);
if (err < 0) {
dev_dbg(dev, "Enabling IMS entry! %d\n", err);
return err;
}
i = 0;
for_each_msi_entry(desc, dev) {
index = idxd_alloc_ims_index(idxd);
if (index < 0) {
err = index;
break;
}
vidxd->ims_index[i] = index;
irq_entry = &vidxd->irq_entries[i];
irq_entry->vidxd = vidxd;
irq_entry->int_src = i;
irq_entry->irq = desc->irq;
i++;
}
if (err)
vidxd_free_ims_entries(vidxd);
return 0;
}
idxd_alloc_ims_index() is an allocation, right? And the above aside of
having 3 redundant levels of storage for exactly the same information is
just a violation of all layering concepts at once.
I just wish I've never seen that code.
>> Again. Look at the layering. What you created now is a pseudo shared
>> domain which needs
>>
>> 1) An indirection layer for providing device specific functions
>>
>> 2) An extra allocation layer in the device specific driver to assign
>> IMS slots completely outside of the domain allocation mechanism.
> hmmm, again I am not sure of which extra allocation layer you are
> referring to..
See above.
>> The infrastructure itself is not more than a thin wrapper around the
>> existing msi domain infrastructure and might even share code with
>> platform-msi.
>
> From your explanation:
>
> In the device driver:
>
> static const struct irq_domain_ops idxd_irq_domain_ops = {
>
> .alloc= idxd_domain_alloc, //not sure what this should do
You might know by now. Also it's not necessarily the .alloc callback
which needs to be implemented. As I said we can add ops if necessary and
if it makes sense. This needs some thoughts to provide proper layering
and for sharing as much code as possible.
> except the alloc/free irq_domain_ops, does this look fine to you?
It's at least heading into the right direction.
But before we talk about the details at this level the
device::msi_domain pointer issue wants to be resolved. It's part of the
solution to share code at various levels and to make utilization of this
technology as simple as possible for driver writers.
We need to think about infrastructure which can be used by various types
of IMS devices, e.g. those with storage arrays and this with storage in
random places, like the network card Jason was talking about. And to get
there we need to do the house cleaning first.
Also if you do a proper infrastructure then you need exactly ONE
implementation of an irqdomain and an irqchip for devices which have a
IMS slot storage array. Every driver for a device which has this kind of
storage can reuse that even with different array sizes.
If done right then your IDXD driver needs:
idxd->domain = create_ims_array_domain(...., basepointer, max_slots);
in the init function of the control block. create_ims_array_domain() is
not part of IDXD, it's a common irq domain/irq chip implementation which
deals with IMS slot storage arrays of arbitrary size.
And then when creating a subdevice you do:
subdevice->msi_domain = idxd->domain;
and to allocate the interrupts you just do:
device_msi_alloc_irqs(subdevice, nrirqs);
and device_msi_alloc_irqs() is shared infrastructure which has nothing
to do with idxd or the ims array domain.
The same can be done for devices which have their storage embedded into
whatever other data structure on the device, e.g. queue memory, and
share the same message storage layout.
And we need to put thoughts into the shared infrastructure upfront
because all of this can also be used on bare metal.
The next thing you completely missed is to think about the ability to
support managed interrupts which we have in PCI/MSIX today. Its just a
matter of time that a IMS device comes along which want's it's subdevice
interrupts managed properly when running on bare metal.
Can we please just go back to proper engineering and figure out how to
create something which is not just yet another half baken works for IDXD
"solution"?
This means we need a proper decription of possible IMS usage scenarios
and the foreseeable storage scenarios (arrays, queue data, ....). Along
with requirement like managed interrupts etc. I'm sure quite some of
this information is scattered over a wide range of mail threads, but
it's not my job to hunt it down.
Without consistent information at least to the point which is available
today this is going to end up in a major tinkering trainwreck. I have
zero interest in dealing with those especially if the major pain can be
avoided by doing proper analysis and design upfront.
Thanks,
tglx
> From: Alex Williamson <[email protected]>
> Sent: Wednesday, August 12, 2020 1:01 AM
>
> On Mon, 10 Aug 2020 07:32:24 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, August 7, 2020 8:20 PM
> > >
> > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > >
> > > > If you see this as an abuse of the framework, then let's identify those
> > > > specific issues and come up with a better approach. As we've discussed
> > > > before, things like basic PCI config space emulation are acceptable
> > > > overhead and low risk (imo) and some degree of register emulation is
> > > > well within the territory of an mdev driver.
> > >
> > > What troubles me is that idxd already has a direct userspace interface
> > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > provide a second direct userspace interface that is a little different
> > > and trivially plugs into the virtualization stack.
> >
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > * PASID allocation/free (on some platforms);
> > * bind/unbind guest mm/page table (nested translation);
> > * invalidate IOMMU cache/iotlb for guest page table changes;
> > * report page request from device to guest;
> > * forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > - put in dedicated or shared mode;
> > - enable/disable SVA;
> > - Associate guest-provided PASID to MSI/IMS entry;
> > - set threshold;
> > - allow/deny privileged access;
> > - allocate/free interrupt handle (enlightened for guest);
> > - collect error status;
> > - ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> > >
> > > I don't think VFIO should be the only entry point to
> > > virtualization. If we say the universe of devices doing user space DMA
> > > must also implement a VFIO mdev to plug into virtualization then it
> > > will be alot of mdevs.
> >
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision. If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice. But for idxd,
> > we see mdev a much better fit here, given the big difference between
> > what userspace DMA requires and what guest driver requires in this hw.
> >
> > >
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > adding special passthrough requirements (guest SVA, dirty tracking,
> > etc.) and raw controllability to the simple userspace DMA interface
> > is for sure making kernel more complex than reusing the mdev
> > framework (plus some degree of emulation mockup behind). Not to
> > mention the merit of uAPI compatibility with mdev...
>
> I agree with a lot of this argument, exposing a device through a
> userspace interface versus allowing user access to a device through a
> userspace interface are different levels of abstraction and control.
> In an ideal world, perhaps we could compose one from the other, but I
> don't think the existence of one is proof that the other is redundant.
> That's not to say that mdev/vfio isn't ripe for abuse in this space,
> but I'm afraid the test for that abuse is probably much more subtle.
>
> I'll also remind folks that LPC is coming up in just a couple short
> weeks and this might be something we should discuss (virtually)
> in-person. uconf CfPs are currently open. </plug> Thanks,
>
Yes, LPC is a good place to reach consensus. btw I saw there is
already one VFIO topic called "device assignment/sub-assignment".
Do you think whether this can be covered under that topic, or
makes more sense to be a new one?
Thanks
Kevin
On Wed, 12 Aug 2020 01:58:00 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Wednesday, August 12, 2020 1:01 AM
> >
> > On Mon, 10 Aug 2020 07:32:24 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Friday, August 7, 2020 8:20 PM
> > > >
> > > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > > >
> > > > > If you see this as an abuse of the framework, then let's identify those
> > > > > specific issues and come up with a better approach. As we've discussed
> > > > > before, things like basic PCI config space emulation are acceptable
> > > > > overhead and low risk (imo) and some degree of register emulation is
> > > > > well within the territory of an mdev driver.
> > > >
> > > > What troubles me is that idxd already has a direct userspace interface
> > > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > > provide a second direct userspace interface that is a little different
> > > > and trivially plugs into the virtualization stack.
> > >
> > > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > > are two distinct usages IMO (at least in idxd context). and this might
> > > be the main divergence between us, thus let me put more words here.
> > > If we could reach consensus in this matter, which direction to go
> > > would be clearer.
> > >
> > > First, a passthrough interface requires some unique requirements
> > > which are not commonly observed in an userspace DMA interface, e.g.:
> > >
> > > - Tracking DMA dirty pages for live migration;
> > > - A set of interfaces for using SVA inside guest;
> > > * PASID allocation/free (on some platforms);
> > > * bind/unbind guest mm/page table (nested translation);
> > > * invalidate IOMMU cache/iotlb for guest page table changes;
> > > * report page request from device to guest;
> > > * forward page response from guest to device;
> > > - Configuring irqbypass for posted interrupt;
> > > - ...
> > >
> > > Second, a passthrough interface requires delegating raw controllability
> > > of subdevice to guest driver, while the same delegation might not be
> > > required for implementing an userspace DMA interface (especially for
> > > modern devices which support SVA). For example, idxd allows following
> > > setting per wq (guest driver may configure them in any combination):
> > > - put in dedicated or shared mode;
> > > - enable/disable SVA;
> > > - Associate guest-provided PASID to MSI/IMS entry;
> > > - set threshold;
> > > - allow/deny privileged access;
> > > - allocate/free interrupt handle (enlightened for guest);
> > > - collect error status;
> > > - ...
> > >
> > > We plan to support idxd userspace DMA with SVA. The driver just needs
> > > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > > etc.), bind the process mm to IOMMU (non-nested) and then map
> > > the portal to userspace. The goal that userspace can do DMA to
> > > associated wq doesn't change the fact that the wq is still *owned*
> > > and *controlled* by kernel driver. However as far as passthrough
> > > is concerned, the wq is considered 'owned' by the guest driver thus
> > > we need an interface which can support low-level *controllability*
> > > from guest driver. It is sort of a mess in uAPI when mixing the
> > > two together.
> > >
> > > Based on above two reasons, we see distinct requirements between
> > > userspace DMA and passthrough interfaces, at least in idxd context
> > > (though other devices may have less distinction in-between). Therefore,
> > > we didn't see the value/necessity of reinventing the wheel that mdev
> > > already handles well to evolve an simple application-oriented usespace
> > > DMA interface to a complex guest-driver-oriented passthrough interface.
> > > The complexity of doing so would incur far more kernel-side changes
> > > than the portion of emulation code that you've been concerned about...
> > >
> > > >
> > > > I don't think VFIO should be the only entry point to
> > > > virtualization. If we say the universe of devices doing user space DMA
> > > > must also implement a VFIO mdev to plug into virtualization then it
> > > > will be alot of mdevs.
> > >
> > > Certainly VFIO will not be the only entry point. and This has to be a
> > > case-by-case decision. If an userspace DMA interface can be easily
> > > adapted to be a passthrough one, it might be the choice. But for idxd,
> > > we see mdev a much better fit here, given the big difference between
> > > what userspace DMA requires and what guest driver requires in this hw.
> > >
> > > >
> > > > I would prefer to see that the existing userspace interface have the
> > > > extra needed bits for virtualization (eg by having appropriate
> > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > the synthetic PCI device be done in userspace.
> > >
> > > In the end what decides the direction is the amount of changes that
> > > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > > adding special passthrough requirements (guest SVA, dirty tracking,
> > > etc.) and raw controllability to the simple userspace DMA interface
> > > is for sure making kernel more complex than reusing the mdev
> > > framework (plus some degree of emulation mockup behind). Not to
> > > mention the merit of uAPI compatibility with mdev...
> >
> > I agree with a lot of this argument, exposing a device through a
> > userspace interface versus allowing user access to a device through a
> > userspace interface are different levels of abstraction and control.
> > In an ideal world, perhaps we could compose one from the other, but I
> > don't think the existence of one is proof that the other is redundant.
> > That's not to say that mdev/vfio isn't ripe for abuse in this space,
> > but I'm afraid the test for that abuse is probably much more subtle.
> >
> > I'll also remind folks that LPC is coming up in just a couple short
> > weeks and this might be something we should discuss (virtually)
> > in-person. uconf CfPs are currently open. </plug> Thanks,
> >
>
> Yes, LPC is a good place to reach consensus. btw I saw there is
> already one VFIO topic called "device assignment/sub-assignment".
> Do you think whether this can be covered under that topic, or
> makes more sense to be a new one?
All the things listed in the CFP are only potential topics to get ideas
flowing, there is currently no proposal to talk about sub-assignment.
I'd suggest submitting separate topics for each and if we run into time
constraints we can ask that they might be combined together. Thanks,
Alex
On 2020/8/10 下午3:32, Tian, Kevin wrote:
>> From: Jason Gunthorpe <[email protected]>
>> Sent: Friday, August 7, 2020 8:20 PM
>>
>> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
>>
>>> If you see this as an abuse of the framework, then let's identify those
>>> specific issues and come up with a better approach. As we've discussed
>>> before, things like basic PCI config space emulation are acceptable
>>> overhead and low risk (imo) and some degree of register emulation is
>>> well within the territory of an mdev driver.
>> What troubles me is that idxd already has a direct userspace interface
>> to its HW, and does userspace DMA. The purpose of this mdev is to
>> provide a second direct userspace interface that is a little different
>> and trivially plugs into the virtualization stack.
> No. Userspace DMA and subdevice passthrough (what mdev provides)
> are two distinct usages IMO (at least in idxd context). and this might
> be the main divergence between us, thus let me put more words here.
> If we could reach consensus in this matter, which direction to go
> would be clearer.
>
> First, a passthrough interface requires some unique requirements
> which are not commonly observed in an userspace DMA interface, e.g.:
>
> - Tracking DMA dirty pages for live migration;
> - A set of interfaces for using SVA inside guest;
> * PASID allocation/free (on some platforms);
> * bind/unbind guest mm/page table (nested translation);
> * invalidate IOMMU cache/iotlb for guest page table changes;
> * report page request from device to guest;
> * forward page response from guest to device;
> - Configuring irqbypass for posted interrupt;
> - ...
>
> Second, a passthrough interface requires delegating raw controllability
> of subdevice to guest driver, while the same delegation might not be
> required for implementing an userspace DMA interface (especially for
> modern devices which support SVA). For example, idxd allows following
> setting per wq (guest driver may configure them in any combination):
> - put in dedicated or shared mode;
> - enable/disable SVA;
> - Associate guest-provided PASID to MSI/IMS entry;
> - set threshold;
> - allow/deny privileged access;
> - allocate/free interrupt handle (enlightened for guest);
> - collect error status;
> - ...
>
> We plan to support idxd userspace DMA with SVA. The driver just needs
> to prepare a wq with a predefined configuration (e.g. shared, SVA,
> etc.), bind the process mm to IOMMU (non-nested) and then map
> the portal to userspace. The goal that userspace can do DMA to
> associated wq doesn't change the fact that the wq is still *owned*
> and *controlled* by kernel driver. However as far as passthrough
> is concerned, the wq is considered 'owned' by the guest driver thus
> we need an interface which can support low-level *controllability*
> from guest driver. It is sort of a mess in uAPI when mixing the
> two together.
So for userspace drivers like DPDK, it can use both of the two uAPIs?
>
> Based on above two reasons, we see distinct requirements between
> userspace DMA and passthrough interfaces, at least in idxd context
> (though other devices may have less distinction in-between). Therefore,
> we didn't see the value/necessity of reinventing the wheel that mdev
> already handles well to evolve an simple application-oriented usespace
> DMA interface to a complex guest-driver-oriented passthrough interface.
> The complexity of doing so would incur far more kernel-side changes
> than the portion of emulation code that you've been concerned about...
>
>> I don't think VFIO should be the only entry point to
>> virtualization. If we say the universe of devices doing user space DMA
>> must also implement a VFIO mdev to plug into virtualization then it
>> will be alot of mdevs.
> Certainly VFIO will not be the only entry point. and This has to be a
> case-by-case decision.
The problem is that if we tie all controls via VFIO uAPI, the other
subsystem like vDPA is likely to duplicate them. I wonder if there is a
way to decouple the vSVA out of VFIO uAPI?
> If an userspace DMA interface can be easily
> adapted to be a passthrough one, it might be the choice.
It's not that easy even for VFIO which requires a lot of new uAPIs and
infrastructures(e.g mdev) to be invented.
> But for idxd,
> we see mdev a much better fit here, given the big difference between
> what userspace DMA requires and what guest driver requires in this hw.
A weak point for mdev is that it can't serve kernel subsystem other than
VFIO. In this case, you need some other infrastructures (like [1]) to do
this.
(For idxd, you probably don't need this, but it's pretty common in the
case of networking or storage device.)
Thanks
[1] https://patchwork.kernel.org/patch/11280547/
>
>> I would prefer to see that the existing userspace interface have the
>> extra needed bits for virtualization (eg by having appropriate
>> internal kernel APIs to make this easy) and all the emulation to build
>> the synthetic PCI device be done in userspace.
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. For idxd,
> adding special passthrough requirements (guest SVA, dirty tracking,
> etc.) and raw controllability to the simple userspace DMA interface
> is for sure making kernel more complex than reusing the mdev
> framework (plus some degree of emulation mockup behind). Not to
> mention the merit of uAPI compatibility with mdev...
>
> Thanks
> Kevin
>
> From: Alex Williamson <[email protected]>
> Sent: Wednesday, August 12, 2020 10:36 AM
> On Wed, 12 Aug 2020 01:58:00 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > >
> > > I'll also remind folks that LPC is coming up in just a couple short
> > > weeks and this might be something we should discuss (virtually)
> > > in-person. uconf CfPs are currently open. </plug> Thanks,
> > >
> >
> > Yes, LPC is a good place to reach consensus. btw I saw there is
> > already one VFIO topic called "device assignment/sub-assignment".
> > Do you think whether this can be covered under that topic, or
> > makes more sense to be a new one?
>
> All the things listed in the CFP are only potential topics to get ideas
> flowing, there is currently no proposal to talk about sub-assignment.
> I'd suggest submitting separate topics for each and if we run into time
> constraints we can ask that they might be combined together. Thanks,
>
Done.
--
title: Criteria of using VFIO mdev (vs. userspace DMA)
Content:
VFIO mdev provides a framework for subdevice assignment and reuses
existing VFIO uAPI to handle common passthrough-related requirements.
However, subdevice (e.g. ADI defined in Intel Scalable IOV) might not be
a PCI endpoint (e.g. just a work queue), thus requires some degree of
emulation/mediation in kernel to fit into VFIO device API. Then there is
a concern on putting emulation in kernel and how to judge abuse of
mdev framework by simply using it as an easy path to hook into
virtualization stack. An associated open is about differentiating mdev
from userspace DMA framework (such as uacce), and whether building
passthrough features on top of userspace DMA framework is a better
choice than using mdev.
Thanks
Kevin
> From: Jason Wang <[email protected]>
> Sent: Wednesday, August 12, 2020 11:28 AM
>
>
> On 2020/8/10 下午3:32, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <[email protected]>
> >> Sent: Friday, August 7, 2020 8:20 PM
> >>
> >> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >>
> >>> If you see this as an abuse of the framework, then let's identify those
> >>> specific issues and come up with a better approach. As we've discussed
> >>> before, things like basic PCI config space emulation are acceptable
> >>> overhead and low risk (imo) and some degree of register emulation is
> >>> well within the territory of an mdev driver.
> >> What troubles me is that idxd already has a direct userspace interface
> >> to its HW, and does userspace DMA. The purpose of this mdev is to
> >> provide a second direct userspace interface that is a little different
> >> and trivially plugs into the virtualization stack.
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > * PASID allocation/free (on some platforms);
> > * bind/unbind guest mm/page table (nested translation);
> > * invalidate IOMMU cache/iotlb for guest page table changes;
> > * report page request from device to guest;
> > * forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > - put in dedicated or shared mode;
> > - enable/disable SVA;
> > - Associate guest-provided PASID to MSI/IMS entry;
> > - set threshold;
> > - allow/deny privileged access;
> > - allocate/free interrupt handle (enlightened for guest);
> > - collect error status;
> > - ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
>
>
> So for userspace drivers like DPDK, it can use both of the two uAPIs?
yes.
>
>
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> >> I don't think VFIO should be the only entry point to
> >> virtualization. If we say the universe of devices doing user space DMA
> >> must also implement a VFIO mdev to plug into virtualization then it
> >> will be alot of mdevs.
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.
>
>
> The problem is that if we tie all controls via VFIO uAPI, the other
> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> way to decouple the vSVA out of VFIO uAPI?
vSVA is a per-device (either pdev or mdev) feature thus naturally should
be managed by its device driver (VFIO or vDPA). From this angle some
duplication is inevitable given VFIO and vDPA are orthogonal passthrough
frameworks. Within the kernel the majority of vSVA handling is done by
IOMMU and IOASID modules thus most logic are shared.
>
>
> > If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice.
>
>
> It's not that easy even for VFIO which requires a lot of new uAPIs and
> infrastructures(e.g mdev) to be invented.
>
>
> > But for idxd,
> > we see mdev a much better fit here, given the big difference between
> > what userspace DMA requires and what guest driver requires in this hw.
>
>
> A weak point for mdev is that it can't serve kernel subsystem other than
> VFIO. In this case, you need some other infrastructures (like [1]) to do
> this.
mdev is not exclusive from kernel usages. It's perfectly fine for a driver
to reserve some work queues for host usages, while wrapping others
into mdevs.
Thanks
Kevin
>
> (For idxd, you probably don't need this, but it's pretty common in the
> case of networking or storage device.)
>
> Thanks
>
> [1] https://patchwork.kernel.org/patch/11280547/
>
>
> >
> >> I would prefer to see that the existing userspace interface have the
> >> extra needed bits for virtualization (eg by having appropriate
> >> internal kernel APIs to make this easy) and all the emulation to build
> >> the synthetic PCI device be done in userspace.
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > adding special passthrough requirements (guest SVA, dirty tracking,
> > etc.) and raw controllability to the simple userspace DMA interface
> > is for sure making kernel more complex than reusing the mdev
> > framework (plus some degree of emulation mockup behind). Not to
> > mention the merit of uAPI compatibility with mdev...
> >
> > Thanks
> > Kevin
> >
On 2020/8/12 下午12:05, Tian, Kevin wrote:
>> The problem is that if we tie all controls via VFIO uAPI, the other
>> subsystem like vDPA is likely to duplicate them. I wonder if there is a
>> way to decouple the vSVA out of VFIO uAPI?
> vSVA is a per-device (either pdev or mdev) feature thus naturally should
> be managed by its device driver (VFIO or vDPA). From this angle some
> duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> frameworks. Within the kernel the majority of vSVA handling is done by
> IOMMU and IOASID modules thus most logic are shared.
So why not introduce vSVA uAPI at IOMMU or IOASID layer?
>
>>> If an userspace DMA interface can be easily
>>> adapted to be a passthrough one, it might be the choice.
>> It's not that easy even for VFIO which requires a lot of new uAPIs and
>> infrastructures(e.g mdev) to be invented.
>>
>>
>>> But for idxd,
>>> we see mdev a much better fit here, given the big difference between
>>> what userspace DMA requires and what guest driver requires in this hw.
>> A weak point for mdev is that it can't serve kernel subsystem other than
>> VFIO. In this case, you need some other infrastructures (like [1]) to do
>> this.
> mdev is not exclusive from kernel usages. It's perfectly fine for a driver
> to reserve some work queues for host usages, while wrapping others
> into mdevs.
I meant you may want slices to be an independent device from the kernel
point of view:
E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
Similarly, you may want 10K net devices which is connected to the kernel
networking subsystems.
In this case it's not simply reserving queues but you need some other
type of device abstraction. There could be some kind of duplication
between this and mdev.
Thanks
>
> Thanks
> Kevin
>
> From: Jason Wang <[email protected]>
> Sent: Thursday, August 13, 2020 12:34 PM
>
>
> On 2020/8/12 下午12:05, Tian, Kevin wrote:
> >> The problem is that if we tie all controls via VFIO uAPI, the other
> >> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> >> way to decouple the vSVA out of VFIO uAPI?
> > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > be managed by its device driver (VFIO or vDPA). From this angle some
> > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > frameworks. Within the kernel the majority of vSVA handling is done by
> > IOMMU and IOASID modules thus most logic are shared.
>
>
> So why not introduce vSVA uAPI at IOMMU or IOASID layer?
One may ask a similar question why IOMMU doesn't expose map/unmap
as uAPI...
>
>
> >
> >>> If an userspace DMA interface can be easily
> >>> adapted to be a passthrough one, it might be the choice.
> >> It's not that easy even for VFIO which requires a lot of new uAPIs and
> >> infrastructures(e.g mdev) to be invented.
> >>
> >>
> >>> But for idxd,
> >>> we see mdev a much better fit here, given the big difference between
> >>> what userspace DMA requires and what guest driver requires in this hw.
> >> A weak point for mdev is that it can't serve kernel subsystem other than
> >> VFIO. In this case, you need some other infrastructures (like [1]) to do
> >> this.
> > mdev is not exclusive from kernel usages. It's perfectly fine for a driver
> > to reserve some work queues for host usages, while wrapping others
> > into mdevs.
>
>
> I meant you may want slices to be an independent device from the kernel
> point of view:
>
> E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
>
> Similarly, you may want 10K net devices which is connected to the kernel
> networking subsystems.
>
> In this case it's not simply reserving queues but you need some other
> type of device abstraction. There could be some kind of duplication
> between this and mdev.
>
yes, some abstraction required but isn't it what the driver should
care about instead of mdev framework itself? If the driver reports
the same set of resource to both mdev and networking, it needs to
make sure when the resource is claimed in one interface then it
should be marked in-use in another. e.g. each mdev includes a
available_intances attribute. the driver could report 10k available
instances initially and then update it to 5K when another 5K is used
for net devices later.
Mdev definitely has its usage limitations. Some may be improved
in the future, some may not. But those are distracting from the
original purpose of this thread (mdev vs. userspace DMA) and better
be discussed in other places e.g. LPC...
Thanks
Kevin
On 2020/8/13 下午1:26, Tian, Kevin wrote:
>> From: Jason Wang <[email protected]>
>> Sent: Thursday, August 13, 2020 12:34 PM
>>
>>
>> On 2020/8/12 下午12:05, Tian, Kevin wrote:
>>>> The problem is that if we tie all controls via VFIO uAPI, the other
>>>> subsystem like vDPA is likely to duplicate them. I wonder if there is a
>>>> way to decouple the vSVA out of VFIO uAPI?
>>> vSVA is a per-device (either pdev or mdev) feature thus naturally should
>>> be managed by its device driver (VFIO or vDPA). From this angle some
>>> duplication is inevitable given VFIO and vDPA are orthogonal passthrough
>>> frameworks. Within the kernel the majority of vSVA handling is done by
>>> IOMMU and IOASID modules thus most logic are shared.
>>
>> So why not introduce vSVA uAPI at IOMMU or IOASID layer?
> One may ask a similar question why IOMMU doesn't expose map/unmap
> as uAPI...
I think this is probably a good idea as well. If there's anything missed
in the infrastructure, we can invent. Besides vhost-vDPA, there are
other subsystems that relaying their uAPI to IOMMU API. Duplicating
uAPIs is usually a hint of the codes duplication. Simple map/unmap could
be easy but vSVA uAPI is much more complicated.
>
>>
>>>>> If an userspace DMA interface can be easily
>>>>> adapted to be a passthrough one, it might be the choice.
>>>> It's not that easy even for VFIO which requires a lot of new uAPIs and
>>>> infrastructures(e.g mdev) to be invented.
>>>>
>>>>
>>>>> But for idxd,
>>>>> we see mdev a much better fit here, given the big difference between
>>>>> what userspace DMA requires and what guest driver requires in this hw.
>>>> A weak point for mdev is that it can't serve kernel subsystem other than
>>>> VFIO. In this case, you need some other infrastructures (like [1]) to do
>>>> this.
>>> mdev is not exclusive from kernel usages. It's perfectly fine for a driver
>>> to reserve some work queues for host usages, while wrapping others
>>> into mdevs.
>>
>> I meant you may want slices to be an independent device from the kernel
>> point of view:
>>
>> E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
>>
>> Similarly, you may want 10K net devices which is connected to the kernel
>> networking subsystems.
>>
>> In this case it's not simply reserving queues but you need some other
>> type of device abstraction. There could be some kind of duplication
>> between this and mdev.
>>
> yes, some abstraction required but isn't it what the driver should
> care about instead of mdev framework itself?
With mdev you present a "PCI" device, but what's kind of device it tries
to present to kernel? If it's still PCI, there's duplication with mdev,
if it's something new, maybe we can switch to that API.
> If the driver reports
> the same set of resource to both mdev and networking, it needs to
> make sure when the resource is claimed in one interface then it
> should be marked in-use in another. e.g. each mdev includes a
> available_intances attribute. the driver could report 10k available
> instances initially and then update it to 5K when another 5K is used
> for net devices later.
Right but this probably means you need another management layer under mdev.
>
> Mdev definitely has its usage limitations. Some may be improved
> in the future, some may not. But those are distracting from the
> original purpose of this thread (mdev vs. userspace DMA) and better
> be discussed in other places e.g. LPC...
Ok.
Thanks
>
> Thanks
> Kevin
On Thu, Aug 13, 2020 at 02:01:58PM +0800, Jason Wang wrote:
>
> On 2020/8/13 下午1:26, Tian, Kevin wrote:
> > > From: Jason Wang <[email protected]>
> > > Sent: Thursday, August 13, 2020 12:34 PM
> > >
> > >
> > > On 2020/8/12 下午12:05, Tian, Kevin wrote:
> > > > > The problem is that if we tie all controls via VFIO uAPI, the other
> > > > > subsystem like vDPA is likely to duplicate them. I wonder if there is a
> > > > > way to decouple the vSVA out of VFIO uAPI?
> > > > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > > > be managed by its device driver (VFIO or vDPA). From this angle some
> > > > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > > > frameworks. Within the kernel the majority of vSVA handling is done by
> > > > IOMMU and IOASID modules thus most logic are shared.
> > >
> > > So why not introduce vSVA uAPI at IOMMU or IOASID layer?
> > One may ask a similar question why IOMMU doesn't expose map/unmap
> > as uAPI...
>
>
> I think this is probably a good idea as well. If there's anything missed in
> the infrastructure, we can invent. Besides vhost-vDPA, there are other
> subsystems that relaying their uAPI to IOMMU API. Duplicating uAPIs is
> usually a hint of the codes duplication. Simple map/unmap could be easy but
> vSVA uAPI is much more complicated.
A way to create the vSVA objects unrelated to VFIO and then pass those
objects for device use into existing uAPIs, to me, makes alot of
sense.
You should not have to use the VFIO API just to get vSVA.
Or stated another way, the existing user driver should be able to get
a PASID from the vSVA components as well as just create a PASID from
the local mm_struct.
The same basic argument goes for all the points - the issue is really
the only uAPI we have for this stuff is under VFIO, and the better
solution is to disagregate that uAPI, not to try and make everything
pretend to be a VFIO device.
Jason
On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
> > I would prefer to see that the existing userspace interface have the
> > extra needed bits for virtualization (eg by having appropriate
> > internal kernel APIs to make this easy) and all the emulation to build
> > the synthetic PCI device be done in userspace.
>
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'.
No, this is not right. The decision should be based on what will end
up more maintable in the long run.
Yes it would be more code to dis-aggregate some of the things
currently only bundled as uAPI inside VFIO (eg your vSVA argument
above) but once it is disaggregated the maintability of the whole
solution will be better overall, and more drivers will be able to use
this functionality.
Jason
> From: Jason Gunthorpe
> Sent: Friday, August 14, 2020 9:35 PM
>
> On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
>
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'.
>
> No, this is not right. The decision should be based on what will end
> up more maintable in the long run.
>
> Yes it would be more code to dis-aggregate some of the things
> currently only bundled as uAPI inside VFIO (eg your vSVA argument
> above) but once it is disaggregated the maintability of the whole
> solution will be better overall, and more drivers will be able to use
> this functionality.
>
Disaggregation is an orthogonal topic to the main divergence in
this thread, which is passthrough vs. userspace DMA. I gave detail
explanation about the difference between the two in last reply.
the possibility of dis-aggregating something between passthrough
frameworks (e.g. VFIO and vDPA) is not the reason for growing
every userspace DMA framework to be a passthrough framework.
Doing that is instead hurting maintainability in general...
Thanks
Kevin
> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, August 14, 2020 9:24 PM
>
> The same basic argument goes for all the points - the issue is really
> the only uAPI we have for this stuff is under VFIO, and the better
> solution is to disagregate that uAPI, not to try and make everything
> pretend to be a VFIO device.
>
Nobody is proposing to make everything VFIO. there must be some
criteria which can be brainstormed in LPC. But the opposite also holds -
the fact that we should not make everything VFIO doesn't imply
prohibition on anyone from using it. There is a clear difference between
passthrough and userspace DMA requirements in idxd context, and we
see good reasons to use VFIO for our passthrough requirements.
Thanks
Kevin
On Mon, Aug 17, 2020 at 02:12:44AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Friday, August 14, 2020 9:35 PM
> >
> > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
> >
> > > > I would prefer to see that the existing userspace interface have the
> > > > extra needed bits for virtualization (eg by having appropriate
> > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > the synthetic PCI device be done in userspace.
> > >
> > > In the end what decides the direction is the amount of changes that
> > > we have to put in kernel, not whether we call it 'emulation'.
> >
> > No, this is not right. The decision should be based on what will end
> > up more maintable in the long run.
> >
> > Yes it would be more code to dis-aggregate some of the things
> > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > above) but once it is disaggregated the maintability of the whole
> > solution will be better overall, and more drivers will be able to use
> > this functionality.
> >
>
> Disaggregation is an orthogonal topic to the main divergence in
> this thread, which is passthrough vs. userspace DMA. I gave detail
> explanation about the difference between the two in last reply.
You said the first issue was related to SVA, which is understandable
because we have no SVA uAPIs outside VFIO.
Imagine if we had some /dev/sva that provided this API and user space
DMA drivers could simply accept an FD and work properly. It is not
such a big leap anymore, nor is it customized code in idxd.
The other pass through issue was IRQ, which last time I looked, was
fairly trivial to connect via interrupt remapping in the kernel, or
could be made extremely trivial.
The last, seemed to be a concern that the current uapi for idxd was
lacking seems idxd specific features, which seems like an quite weak
reason to use VFIO.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, August 18, 2020 8:44 AM
>
> On Mon, Aug 17, 2020 at 02:12:44AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Friday, August 14, 2020 9:35 PM
> > >
> > > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
> > >
> > > > > I would prefer to see that the existing userspace interface have the
> > > > > extra needed bits for virtualization (eg by having appropriate
> > > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > > the synthetic PCI device be done in userspace.
> > > >
> > > > In the end what decides the direction is the amount of changes that
> > > > we have to put in kernel, not whether we call it 'emulation'.
> > >
> > > No, this is not right. The decision should be based on what will end
> > > up more maintable in the long run.
> > >
> > > Yes it would be more code to dis-aggregate some of the things
> > > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > > above) but once it is disaggregated the maintability of the whole
> > > solution will be better overall, and more drivers will be able to use
> > > this functionality.
> > >
> >
> > Disaggregation is an orthogonal topic to the main divergence in
> > this thread, which is passthrough vs. userspace DMA. I gave detail
> > explanation about the difference between the two in last reply.
>
> You said the first issue was related to SVA, which is understandable
> because we have no SVA uAPIs outside VFIO.
>
> Imagine if we had some /dev/sva that provided this API and user space
> DMA drivers could simply accept an FD and work properly. It is not
> such a big leap anymore, nor is it customized code in idxd.
>
> The other pass through issue was IRQ, which last time I looked, was
> fairly trivial to connect via interrupt remapping in the kernel, or
> could be made extremely trivial.
>
> The last, seemed to be a concern that the current uapi for idxd was
> lacking seems idxd specific features, which seems like an quite weak
> reason to use VFIO.
>
The difference in my reply is not just about the implementation gap
of growing a userspace DMA framework to a passthrough framework.
My real point is about the different goals that each wants to achieve.
Userspace DMA is purely about allowing userspace to directly access
the portal and do DMA, but the wq configuration is always under kernel
driver's control. On the other hand, passthrough means delegating full
control of the wq to the guest and then associated mock-up (live migration,
vSVA, posted interrupt, etc.) for that to work. I really didn't see the
value of mixing them together when there is already a good candidate
to handle passthrough...
Thanks
Kevin
On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote:
> The difference in my reply is not just about the implementation gap
> of growing a userspace DMA framework to a passthrough framework.
> My real point is about the different goals that each wants to achieve.
> Userspace DMA is purely about allowing userspace to directly access
> the portal and do DMA, but the wq configuration is always under kernel
> driver's control. On the other hand, passthrough means delegating full
> control of the wq to the guest and then associated mock-up (live migration,
> vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> value of mixing them together when there is already a good candidate
> to handle passthrough...
In Linux a 'VM' and virtualization has always been a normal system
process that uses a few extra kernel features. This has been more or
less the cornerstone of that design since the start.
In that view it doesn't make any sense to say that uAPI from idxd that
is useful for virtualization somehow doesn't belong as part of the
standard uAPI.
Especially when it is such a small detail like what APIs are used to
configure the wq.
For instance, what about suspend/resume of containers using idxd?
Wouldn't you want to have the same basic approach of controlling the
wq from userspace that virtualization uses?
Jason
On 18/08/20 13:50, Jason Gunthorpe wrote:
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?
The difference is that VFIO more or less standardizes the approach you
use for live migration. With another interface you'd have to come up
with something for every driver, and add support in CRIU for every
driver as well.
Paolo
On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
> On 18/08/20 13:50, Jason Gunthorpe wrote:
> > For instance, what about suspend/resume of containers using idxd?
> > Wouldn't you want to have the same basic approach of controlling the
> > wq from userspace that virtualization uses?
>
> The difference is that VFIO more or less standardizes the approach you
> use for live migration. With another interface you'd have to come up
> with something for every driver, and add support in CRIU for every
> driver as well.
VFIO is very unsuitable for use as some general userspace. It only 1:1
with a single process and just can't absorb what the existing idxd
userspace is doing.
So VFIO is already not a solution for normal userspace idxd where CRIU
becomes interesting. Not sure what you are trying to say?
My point was the opposite, if you want to enable CRIU for idxd then
you probably need all the same stuff as for qemu/VFIO except in the
normal idxd user API.
Jason
On 18/08/20 18:49, Jason Gunthorpe wrote:
> On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
>> On 18/08/20 13:50, Jason Gunthorpe wrote:
>>> For instance, what about suspend/resume of containers using idxd?
>>> Wouldn't you want to have the same basic approach of controlling the
>>> wq from userspace that virtualization uses?
>>
>> The difference is that VFIO more or less standardizes the approach you
>> use for live migration. With another interface you'd have to come up
>> with something for every driver, and add support in CRIU for every
>> driver as well.
>
> VFIO is very unsuitable for use as some general userspace. It only 1:1
> with a single process and just can't absorb what the existing idxd
> userspace is doing.
The point of mdev is that it's not 1:1 anymore.
Paolo
> So VFIO is already not a solution for normal userspace idxd where CRIU
> becomes interesting. Not sure what you are trying to say?
>
> My point was the opposite, if you want to enable CRIU for idxd then
> you probably need all the same stuff as for qemu/VFIO except in the
> normal idxd user API.
>
> Jason
>
On Tue, Aug 18, 2020 at 07:05:16PM +0200, Paolo Bonzini wrote:
> On 18/08/20 18:49, Jason Gunthorpe wrote:
> > On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
> >> On 18/08/20 13:50, Jason Gunthorpe wrote:
> >>> For instance, what about suspend/resume of containers using idxd?
> >>> Wouldn't you want to have the same basic approach of controlling the
> >>> wq from userspace that virtualization uses?
> >>
> >> The difference is that VFIO more or less standardizes the approach you
> >> use for live migration. With another interface you'd have to come up
> >> with something for every driver, and add support in CRIU for every
> >> driver as well.
> >
> > VFIO is very unsuitable for use as some general userspace. It only 1:1
> > with a single process and just can't absorb what the existing idxd
> > userspace is doing.
>
> The point of mdev is that it's not 1:1 anymore.
The lifecycle model of mdev is not compatible with how something like
idxd works, it needs a multi-open cdev.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, August 18, 2020 7:50 PM
>
> On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote:
> > The difference in my reply is not just about the implementation gap
> > of growing a userspace DMA framework to a passthrough framework.
> > My real point is about the different goals that each wants to achieve.
> > Userspace DMA is purely about allowing userspace to directly access
> > the portal and do DMA, but the wq configuration is always under kernel
> > driver's control. On the other hand, passthrough means delegating full
> > control of the wq to the guest and then associated mock-up (live migration,
> > vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> > value of mixing them together when there is already a good candidate
> > to handle passthrough...
>
> In Linux a 'VM' and virtualization has always been a normal system
> process that uses a few extra kernel features. This has been more or
> less the cornerstone of that design since the start.
>
> In that view it doesn't make any sense to say that uAPI from idxd that
> is useful for virtualization somehow doesn't belong as part of the
> standard uAPI.
The point is that we already have a more standard uAPI (VFIO) which
is unified and vendor-agnostic to userspace. Creating a idxd specific
uAPI to absorb similar requirements that VFIO already does is not
compelling and instead causes more trouble to Qemu or other VMMs
as they need to deal with every such driver uAPI even when Qemu
itself has no interest in the device detail (since the real user is inside
guest).
>
> Especially when it is such a small detail like what APIs are used to
> configure the wq.
>
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?
>
I'm not familiar with how container suspend/resume is done today.
But my gut-feeling is that it's different from virtualization. For
virtualization, the whole wq is assigned to the guest thus the uAPI
must provide a way to save the wq state including its configuration
at suspsend, and then restore the state to what guest expects when
resume. However in container case which does userspace DMA, the
wq is managed by host kernel and could be shared between multiple
containers. So the wq state is irrelevant to container. The only relevant
state is the in-fly workloads which needs a draining interface. In this
view I think the two have a major difference.
Thanks
Kevin